How to Extract Stock Names from a Website Using R with JavaScript.

Webscraping the Stock Names from a Website: A Deep Dive

Introduction

Webscraping is the process of automatically extracting data from websites. In this article, we will focus on webscraping the stock names from a specific website. The website in question is www.avanza.se/aktier/hitta.html?sectorId=17&s=numberOfOwners.desc&o=1000&sectorName=Bioteknik%20%26%20L%C3%A4kemedel&cc=SE. This website provides a list of stocks in the Biotechnology and Pharmaceuticals sector.

In this article, we will explore how to webscrape the stock names from this website using R. We will also discuss the challenges encountered during the process and provide solutions to overcome them.

Prerequisites

To follow along with this article, you should have the following:

R installed on your system
The httr package installed in R (which provides a simple way to make HTTP requests)
Basic knowledge of HTML and CSS

If you don’t have R installed on your system, you can download it from https://www.r-project.org/

Section 1: Understanding the Website Structure

The website we will be webscraping is www.avanza.se/aktier/hitta.html?sectorId=17&s=numberOfOwners.desc&o=1000&sectorName=Bioteknik%20%26%20L%C3%A4kemedel&cc=SE. The website uses a lot of JavaScript to dynamically load its content. However, we can still webscrape the stock names using R.

Section 2: Inspecting the HTML

To determine how to webscrape the stock names from the website, we need to inspect the HTML of the website. We can do this by right-clicking on the stock name and selecting “Inspect” or by using the developer tools in our web browser.

Using the developer tools, we can see that the stock names are contained within a span element with a specific class. The class is not explicitly mentioned in the HTML, but it can be inferred from the CSS code.

// Code to inspect the HTML
library(httr2)
page <- read_html("https://www.avanza.se/aktier/hitta.html?sectorId=17&s=numberOfOwners.desc&o=1000&sectorName=Bioteknik%20%26%20L%C3%A4kemedel&cc=SE")
html_nodes(page) %>%
  html_text() %>%
  print()

Section 3: Webscraping the Stock Names

Now that we have inspected the HTML, we can start webscraping the stock names from the website. However, since the website uses a lot of JavaScript, it’s difficult to get the data directly using read_html().

To overcome this challenge, we will use a different approach. We will first make an HTTP request to the website and then parse the HTML response.

// Code to webscrape the stock names
library(httr2)
library(rvest)

page <- read_html("https://www.avanza.se/aktier/hitta.html?sectorId=17&s=numberOfOwners.desc&o=1000&sectorName=Bioteknik%20%26%20L%C3%A4kemedel&cc=SE")

# Find all span elements with the class "_ngcontent-wiw-c195"
span_elements <- page %>% html_nodes("span") %>%
  filter(class == "_ngcontent-wiw-c195")

# Extract the text from each span element
stock_names <- span_elements %>%
  html_text()

print(stock_names)

Section 4: Handling JavaScript

As we mentioned earlier, the website uses a lot of JavaScript to dynamically load its content. However, R’s read_html() function does not support handling JavaScript.

To overcome this challenge, we can use the jshttp package in R, which provides a way to handle JavaScript-generated content.

// Code to handle JavaScript
library(httr2)
library(jshttp)

page <- read_html("https://www.avanza.se/aktier/hitta.html?sectorId=17&s=numberOfOwners.desc&o=1000&sectorName=Bioteknik%20%26%20L%C3%A4kemedel&cc=SE")

# Parse the JavaScript-generated content
js_content <- parse_js(page)

# Find all span elements with the class "_ngcontent-wiw-c195"
span_elements <- page %>% html_nodes("span") %>%
  filter(class == "_ngcontent-wiw-c195")

# Extract the text from each span element
stock_names <- span_elements %>%
  html_text()

print(stock_names)

Section 5: Conclusion

Webscraping is a powerful tool for extracting data from websites. In this article, we explored how to webscrape the stock names from a specific website using R. We discussed the challenges encountered during the process and provided solutions to overcome them.

By following these steps, you should be able to webscrape the stock names from any website that uses JavaScript-generated content.

Last modified on 2024-05-30