I would like to take my current data frame of nfl players and search for their age on wikipedia. I currently have this function built that loops through the names in my data frame and replaces the space between the first and last name with a '_'. I then use the 'rvest' library to read the html code of the website and extract the infobox.
The problem with my current loop is the object age_row where I filter the list to search for 'Born'.
If you would like me to add more information please let me know and I will edit the post.
P.S. I usually work with data frames and rarely touch large nested list...
Thank you in advance.
getwiki <- function(name){
for(i in RB_14_22$Player){
NewName <- gsub(' ','_', i)
Link <- paste0('https://en.wikipedia.org/wiki/',NewName)
url <- URLencode(Link)
page <- read_html(url)
# Extract the infobox from the page
infobox <- page %>% html_nodes(".infobox") %>% html_table()
# Extract the row containing the person's age
age_row <- infobox[[1]] %>%
filter(grepl("Born", .)) %>%
filter(grepl("age", .))
# Extract the age value from the age row
age <- age_row[[1, 2]] %>%
str_extract("\\d+")
# Print the person's age
cat("The person's age is:", age, "\n")
}
}
head of infobox[[1]] is...
# A tibble: 6 × 14
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
<chr> <chr> <chr> <chr> <chr> <dbl> <chr> <int> <chr> <int> <chr> <chr> <chr> <int>
1 McCaffrey in 2019 McCaffrey in … NA NA NA NA NA NA NA NA NA NA NA NA
2 No. 23 – San Francisco 49ers No. 23 – San … NA NA NA NA NA NA NA NA NA NA NA NA
3 Position: Running back NA NA NA NA NA NA NA NA NA NA NA NA
4 Personal information Personal info… NA NA NA NA NA NA NA NA NA NA NA NA
5 Born: (1996-06-07) … NA NA NA NA NA NA NA NA NA NA NA NA
6 Height: 5 ft 11 in (1… NA NA NA NA NA NA NA NA NA NA NA NA
Here's a complete example that may help. After filtering where first column value contains "Born" and
pulling the second column (which contains the information of interest), you may need to remove hidden characters after scraping. Viewing the web page source, there is a numeric entity reference (intended to be parseable by machines) between "age" and the numeric age value. Below I usedstr_replace_allto replace anything not parentheses or alphanumeric character with empty space for now.Then, you can use
str_extractwith a regex look-behind for "age" followed by a numeric value for the age itself.Output
Or without replacing any characters, but just with
str_extractalone and using the unicode\u00A0for the numeric entity reference, you could try: