html_element returning NA and I cand understand why

39 views Asked by At

I'm getting into web scrapping with R, and trying to extract a data frame for IMDB's top 250 movies. My code is very short so far:

library(tidyverse)
library(rvest)

page = read_html('https://www.imdb.com/chart/top/?ref_=nv_mv_250')

base = html_elements(page, 'li')
base %>% html_elements('h3') %>% html_text2() %>% str_remove('^[0-9]+\\. ')
base %>% html_element('.sc-b0691f29-7 hrgukm cli-title-metadata')

whenever I try to use html_element, I seem to only get NAs: This is the case for the last line, which is supposed to extract the year, duration and age rating of the movie, but returns only NAs.

The same thing happens with the second-to-last line, trying to extract h3 elements, in this case that's the titles of movies. If I use html_element, I get a list of NAs, while if I use html_elements I get the desired result (this substitution doesn't work for the last line). What am I doing wrong?

1

There are 1 answers

0
the-mad-statter On

To use combination class selectors, you need to replace spaces between classes with periods as in '.sc-b0691f29-7 hrgukm cli-title-metadata') should be '.sc-b0691f29-7.hrgukm.cli-title-metadata' The version with spaces selects elements that are decedents of decedents. The version with no spaces selects elements that have all three classes.

Instead of an h3 element selector, you could use a class selector (.ipc-title__text) for the titles and then do a little post-processing.

However, year, duration, and rating are more difficult because the span for rating is not always present. Therefore, you could select for those spans with another class selector (the one you identified but corrected with periods: .sc-b0691f29-7.hrgukm.cli-title-metadata) and process each looking for the first span (year), second span (duration), and third span (rating) if present:

library(rvest)
library(purrr)
library(tibble)

titles <- read_html('https://www.imdb.com/chart/top/?ref_=nv_mv_250') %>% 
  html_elements(".ipc-title__text") %>% 
  html_text() %>% 
  `[`(grepl("^\\d", .)) %>% 
  sub("^\\d+\\. ", "", .)

ylr <- read_html('https://www.imdb.com/chart/top/?ref_=nv_mv_250') %>% 
  html_elements(".sc-b0691f29-7.hrgukm.cli-title-metadata")

years <- map_chr(ylr, ~ html_elements(., "span")[1] %>% html_text())

durations <- map_chr(ylr, ~ html_elements(., "span")[2] %>% html_text())

ratings <- ylr %>% 
  map_chr(
    ~ ifelse(
      length(html_elements(., "span")) == 3, 
      html_elements(., "span")[3] %>% html_text(), 
      NA_character_
    )
  )

tibble(
  title = titles,
  year = years,
  duration = durations,
  rating = ratings
)
#> # A tibble: 250 × 4
#>    title                                             year  duration rating  
#>    <chr>                                             <chr> <chr>    <chr>   
#>  1 The Shawshank Redemption                          1994  2h 22m   R       
#>  2 The Godfather                                     1972  2h 55m   R       
#>  3 The Dark Knight                                   2008  2h 32m   PG-13   
#>  4 The Godfather Part II                             1974  3h 22m   R       
#>  5 12 Angry Men                                      1957  1h 36m   Approved
#>  6 Schindler's List                                  1993  3h 15m   R       
#>  7 The Lord of the Rings: The Return of the King     2003  3h 21m   PG-13   
#>  8 Pulp Fiction                                      1994  2h 34m   R       
#>  9 The Lord of the Rings: The Fellowship of the Ring 2001  2h 58m   PG-13   
#> 10 The Good, the Bad and the Ugly                    1966  2h 58m   Approved
#> # ℹ 240 more rows

Created on 2024-03-26 with reprex v2.1.0