Web scraping across multiple pages using rvest

99 views Asked by At

I am trying to scrape information about renewable energy manufacturers, suppliers and companies in Europe on the following website: https://www.energy-xprt.com/renewable-energy/companies/location-europe/.

The first step is to collect urls of each company in the list but when I run a loop to scrape across pages I obtain links of the companies from the first page. My code looks like

link <- paste0('https://www.energy-xprt.com/renewable-energy/companies/location-europe/page-',1:78)  
result <- lapply(link, function(x) x %>% 
                   read_html %>% html_nodes("[class='h2 mb-0']") %>% html_elements('a') %>% html_attr('href')
                 ) %>% unlist() %>% unique()

I expect to obtain a vector that contains urls of companies from all 78 pages

1

There are 1 answers

1
margusl On

Seems that static (or just curl-based?) tools like rvest and httr / httr2 are always redirected to the same page, so let's try to get that content through Chrome / chromote. But instead of pointing chromote to a list of URLs (would not work anyway, redirects back to https://www.energy-xprt.com/renewable-energy/companies ) or triggering click() methods, we can use JavaScript fetch() in chromote Runtime, pull raw text response from chromote and process it in R with rvest. This way Chrome does not need to render those pages nor download linked resources, while request headers (incl. session cookies ) are all nicely set and match our browser session.

library(chromote)
library(rvest)

library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
library(stringr)

b <- ChromoteSession$new()
{
  b$Page$navigate("https://www.energy-xprt.com/renewable-energy/companies/location-europe")
  b$Page$loadEventFired()
} 
#> $timestamp
#> [1] 163274

# instead of navigating through pages, we'll just fetch page content with js and return it as text;
# create a small helper function in js runtime to make following calls bit more compact; 
# function is asynchronous and returns a Promise, but that's fine
b$Runtime$evaluate("function fetch_text(url) {return fetch(url).then(response => response.text())}")
#> $result
#> $result$type
#> [1] "undefined"

# we'll call fetch_text(..) for all pages, awaitPromise = TRUE awaits unitl js Promise gets resolved
# map() output will be a named list (page-1, page-2, ...) of extracted urls,
# enframe creates a nested tibble (78×2) which we'll unnest to a long 1542×2 tibble
url_df <- str_c("page-", 1:78) |>
  set_names() |>
  map(\(page) str_glue("fetch_text('https://www.energy-xprt.com/renewable-energy/companies/location-europe/{page}')")) |>
  map(\(js_call) b$Runtime$evaluate(js_call, awaitPromise = TRUE)$result$value |>
        read_html() |>
        html_elements("h2[class='h2 mb-0'] > a") |> 
        html_attr('href')
      ) |>
  enframe(name = "page", value = "url") |>
  unnest_longer(url)

Results:

url_df
#> # A tibble: 1,542 × 2
#>    page   url                                                                   
#>    <chr>  <chr>                                                                 
#>  1 page-1 https://www.energy-xprt.com/companies/zygo-corporation-ametek-inc-101…
#>  2 page-1 https://www.energy-xprt.com/companies/weda-ab-10468                   
#>  3 page-1 https://www.energy-xprt.com/companies/lindner-recyclingtech-gmbh-3552 
#>  4 page-1 https://www.energy-xprt.com/companies/real-tech-inc-37526             
#>  5 page-1 https://www.energy-xprt.com/companies/advanced-energy-industries-inc-…
#>  6 page-1 https://www.energy-xprt.com/companies/solar-turbines-incorporated-286…
#>  7 page-1 https://www.energy-xprt.com/companies/envitec-biogas-ag-29450         
#>  8 page-1 https://www.energy-xprt.com/companies/hrs-heat-exchangers-ltd-19747   
#>  9 page-1 https://www.energy-xprt.com/companies/delta-t-devices-ltd-775         
#> 10 page-1 https://www.energy-xprt.com/companies/entech-renewable-energy-solutio…
#> # ℹ 1,532 more rows

tail(url_df)
#> # A tibble: 6 × 2
#>   page    url                                                                  
#>   <chr>   <chr>                                                                
#> 1 page-77 https://www.energy-xprt.com/companies/fuel-tech-inc-35874            
#> 2 page-77 https://www.energy-xprt.com/companies/purenviro-as-40264             
#> 3 page-77 https://www.energy-xprt.com/companies/kt-kinetics-technology-spa-9236
#> 4 page-77 https://www.energy-xprt.com/companies/biogastechnik-sued-gmbh-147066 
#> 5 page-78 https://www.energy-xprt.com/companies/bmh-technology-oy-19660        
#> 6 page-78 https://www.energy-xprt.com/companies/feedwater-ltd-41643

# 3 first urls from every page to make sure we are not running in circles
url_df |> 
  slice_head(n = 3, by = page)
#> # A tibble: 233 × 2
#>    page   url                                                                   
#>    <chr>  <chr>                                                                 
#>  1 page-1 https://www.energy-xprt.com/companies/zygo-corporation-ametek-inc-101…
#>  2 page-1 https://www.energy-xprt.com/companies/weda-ab-10468                   
#>  3 page-1 https://www.energy-xprt.com/companies/lindner-recyclingtech-gmbh-3552 
#>  4 page-2 https://www.energy-xprt.com/companies/zeeco-inc-45526                 
#>  5 page-2 https://www.energy-xprt.com/companies/flammatec-spol-s-ro-130539      
#>  6 page-2 https://www.energy-xprt.com/companies/electria-wind-134493            
#>  7 page-3 https://www.energy-xprt.com/companies/trasfeco-srl-48302              
#>  8 page-3 https://www.energy-xprt.com/companies/enertec-kraftwerke-gmbh-35958   
#>  9 page-3 https://www.energy-xprt.com/companies/n-bio-gmbh-36559                
#> 10 page-4 https://www.energy-xprt.com/companies/neste-27323                     
#> # ℹ 223 more rows

# number of urls collected from each page
url_df |> 
  count(page)
#> # A tibble: 78 × 2
#>    page        n
#>    <chr>   <int>
#>  1 page-1     20
#>  2 page-10    20
#>  3 page-11    20
#>  4 page-12    20
#>  5 page-13    20
#>  6 page-14    20
#>  7 page-15    20
#>  8 page-16    20
#>  9 page-17    20
#> 10 page-18    20
#> # ℹ 68 more rows

Note that there are still some issues; for me it currently returns 1089 unique values (site reports 1542 matches), order does not match desktop browser session (perhaps a side effect of fetch()-based approach... ) and occasionally that site just behaves oddly. For example page-3 in a browser was broken for me at some point -- other pagination pages worked fine but this one triggered re-direction to https://www.energy-xprt.com/renewable-energy/companies.