download info with rvest

62 views Asked by At

I would like to download the information of a site with the rvest package. The information is the one contained under the HTML layer div_class="col-sm-8". How can I do this?

The usual way I followed doesn't work:

 url <- "myurl"

    pagina <- read_html(url)


    titoli <- pagina %>%
      html_nodes("col-sm-8") %>%
      html_text()
1

There are 1 answers

0
Till On

The contents of this page are rendered with javascript and read_html() does not execute any javascript. You’ll either have to use a scraping technique that renders the whole page in a headless browser (i.e. RSelenium) or you can write request against their API (with i.e. httr).

The development version of rvest provides read_html_live() which works for this on my end.

Note: In your code you’d need to prefix the css selector with a . to tell the parser to look for elements of that class.

# Install the rvest dev version
#remotes::install_github("tidyverse/rvest")

library(rvest)

url <- "https://www.assonime.it/attivita-editoriale/Pagine/pubblicazioni.aspx"

# The first time I tried this, the page timed out, 2nd try worked for me.
pagina <- read_html_live(url)

titoli <- pagina  |> 
  html_elements(".col-sm-8")  |> 
  html_text()

titoli