I'm getting into web scrapping with R, and trying to extract a data frame for IMDB's top 250 movies. My code is very short so far:
library(tidyverse)
library(rvest)
page = read_html('https://www.imdb.com/chart/top/?ref_=nv_mv_250')
base = html_elements(page, 'li')
base %>% html_elements('h3') %>% html_text2() %>% str_remove('^[0-9]+\\. ')
base %>% html_element('.sc-b0691f29-7 hrgukm cli-title-metadata')
whenever I try to use html_element, I seem to only get NAs: This is the case for the last line, which is supposed to extract the year, duration and age rating of the movie, but returns only NAs.
The same thing happens with the second-to-last line, trying to extract h3 elements, in this case that's the titles of movies. If I use html_element, I get a list of NAs, while if I use html_elements I get the desired result (this substitution doesn't work for the last line). What am I doing wrong?
To use combination class selectors, you need to replace spaces between classes with periods as in
'.sc-b0691f29-7 hrgukm cli-title-metadata') should be'.sc-b0691f29-7.hrgukm.cli-title-metadata'The version with spaces selects elements that are decedents of decedents. The version with no spaces selects elements that have all three classes.Instead of an h3 element selector, you could use a class selector (.ipc-title__text) for the titles and then do a little post-processing.
However, year, duration, and rating are more difficult because the span for rating is not always present. Therefore, you could select for those spans with another class selector (the one you identified but corrected with periods: .sc-b0691f29-7.hrgukm.cli-title-metadata) and process each looking for the first span (year), second span (duration), and third span (rating) if present:
Created on 2024-03-26 with reprex v2.1.0