rvest package - Is it possible for html_text() to store an NA value if it does not find an attribute?

1.4k views Asked by At

As the title states, I'm curious if it is possible for the html_text() function from the rvest package to store an NA value if it is not able to find an attribute on a specific page.

I'm currently running a scrape over 199 pages (which works fine; tested on a few variables already).

Currently, when I search for a value that is only present on a some (136) of the 199 pages, html_text() is only returning a vector of 136 strings. This is not useful because without NAs I am unable to determine which pages contained the variable in question.

I see that html_atts() is able to receive a default input, but not html_text(). Any tips?

Thank you so much!

2

There are 2 answers

1
hrbrmstr On

If you create a new function to wrap error handling, it'll keep the %>% pipe cleaner and easier to grok for your future self and others:

library(rvest)

html_text_na <- function(x, ...) {

  txt <- try(html_text(x, ...))
  if (inherits(txt, "try-error") |
      (length(txt)==0)) { return(NA) }
  return(txt)

}

base_url <- "http://www.saem.org/membership/services/residency-directory?RecordID=%d"

record_id <- c(1291, 1000, 1166, 1232, 999)

sapply(record_id, function(i) {

  html(sprintf(base_url, i)) %>% 
    html_nodes("#drpict tr:nth-child(6) .text") %>%
    html_text_na %>%
    as.numeric()

})

## [1]  8 NA 10 27 NA

Also, by doing an sapply over the vector of record_id's you automagically get a vector back of whatever value that is you're trying to extract.

2
dsifford On

Figured it out.

I just needed to add a line of logic to my loop.

Here's a chunk of the code that worked:

for(i in record_id) {
  site <- paste("http://www.saem.org/membership/services/residency-directory?RecordID=", i, sep="")
  site <- html(site)
  this_data <- site %>%
                html_nodes("#drpict tr:nth-child(6) .text") %>%
                html_text() %>%
                as.numeric()
  if(length(this_data) == 0) {
    this_data <- NA
  }
  all_data <- c(all_data, this_data)
}

Thanks anyway everybody (and @hrbrmstr)! :)