I am trying to scrape information about renewable energy manufacturers, suppliers and companies in Europe on the following website: https://www.energy-xprt.com/renewable-energy/companies/location-europe/.
The first step is to collect urls of each company in the list but when I run a loop to scrape across pages I obtain links of the companies from the first page. My code looks like
link <- paste0('https://www.energy-xprt.com/renewable-energy/companies/location-europe/page-',1:78)
result <- lapply(link, function(x) x %>%
read_html %>% html_nodes("[class='h2 mb-0']") %>% html_elements('a') %>% html_attr('href')
) %>% unlist() %>% unique()
I expect to obtain a vector that contains urls of companies from all 78 pages
Seems that static (or just curl-based?) tools like
rvestandhttr/httr2are always redirected to the same page, so let's try to get that content through Chrome /chromote. But instead of pointingchromoteto a list of URLs (would not work anyway, redirects back tohttps://www.energy-xprt.com/renewable-energy/companies) or triggeringclick()methods, we can use JavaScriptfetch()inchromoteRuntime, pull raw text response fromchromoteand process it in R withrvest. This way Chrome does not need to render those pages nor download linked resources, while request headers (incl. session cookies ) are all nicely set and match our browser session.Results:
Note that there are still some issues; for me it currently returns 1089 unique values (site reports 1542 matches), order does not match desktop browser session (perhaps a side effect of
fetch()-based approach... ) and occasionally that site just behaves oddly. For example page-3 in a browser was broken for me at some point -- other pagination pages worked fine but this one triggered re-direction tohttps://www.energy-xprt.com/renewable-energy/companies.