Extracting ones age from wikipedia

93 views Asked by At

I would like to take my current data frame of nfl players and search for their age on wikipedia. I currently have this function built that loops through the names in my data frame and replaces the space between the first and last name with a '_'. I then use the 'rvest' library to read the html code of the website and extract the infobox.

The problem with my current loop is the object age_row where I filter the list to search for 'Born'.

If you would like me to add more information please let me know and I will edit the post.

P.S. I usually work with data frames and rarely touch large nested list...

Thank you in advance.

getwiki <- function(name){
  for(i in RB_14_22$Player){
    NewName <- gsub(' ','_', i)
    Link <- paste0('https://en.wikipedia.org/wiki/',NewName)
    url <- URLencode(Link)
    page <- read_html(url)
    # Extract the infobox from the page
    infobox <- page %>% html_nodes(".infobox") %>% html_table()
    # Extract the row containing the person's age
    age_row <- infobox[[1]] %>%
      filter(grepl("Born", .)) %>% 
      filter(grepl("age", .))
    # Extract the age value from the age row
    age <- age_row[[1, 2]] %>% 
      str_extract("\\d+")
    # Print the person's age
    cat("The person's age is:", age, "\n")
  }
}

head of infobox[[1]] is...

# A tibble: 6 × 14
  X1                           X2             X3    X4    X5       X6 X7       X8 X9      X10 X11   X12   X13     X14
  <chr>                        <chr>          <chr> <chr> <chr> <dbl> <chr> <int> <chr> <int> <chr> <chr> <chr> <int>
1 McCaffrey in 2019            McCaffrey in … NA    NA    NA       NA NA       NA NA       NA NA    NA    NA       NA
2 No. 23 – San Francisco 49ers No. 23 – San … NA    NA    NA       NA NA       NA NA       NA NA    NA    NA       NA
3 Position:                    Running back   NA    NA    NA       NA NA       NA NA       NA NA    NA    NA       NA
4 Personal information         Personal info… NA    NA    NA       NA NA       NA NA       NA NA    NA    NA       NA
5 Born:                        (1996-06-07) … NA    NA    NA       NA NA       NA NA       NA NA    NA    NA       NA
6 Height:                      5 ft 11 in (1… NA    NA    NA       NA NA       NA NA       NA NA    NA    NA       NA
1

There are 1 answers

0
Ben On

Here's a complete example that may help. After filtering where first column value contains "Born" and pulling the second column (which contains the information of interest), you may need to remove hidden characters after scraping. Viewing the web page source, there is a &#160 numeric entity reference (intended to be parseable by machines) between "age" and the numeric age value. Below I used str_replace_all to replace anything not parentheses or alphanumeric character with empty space for now.

Then, you can use str_extract with a regex look-behind for "age" followed by a numeric value for the age itself.

library(tidyverse)
library(rvest)

NewName <- "Christian_McCaffrey"

Link <- paste0('https://en.wikipedia.org/wiki/', NewName)
page <- read_html(URLencode(Link))

infobox <- page %>% 
  html_nodes(".infobox") %>% 
  html_table()

infobox[[1]] %>%
  filter(grepl("Born", .[[1]])) %>%
  pull(2) %>%
  str_replace_all("[^\\(|\\)|[:alnum:]]", " ") %>%
  str_extract("(?<=age) \\d+") %>%
  as.numeric

Output

[1] 26

Or without replacing any characters, but just with str_extract alone and using the unicode \u00A0 for the numeric entity reference, you could try:

infobox[[1]] %>%
  filter(grepl("Born", .[[1]])) %>%
  pull(2) %>%
  str_extract("(?<=age)\u00A0\\d+")