R: xml2 extracts more nodes than expected

148 views Asked by At

I do not understand why this code

library(rvest)
u <- "http://alistapart.com/article/daemonskin"
h <- read_html(u)
html_nodes(h, "div.main-content[itemprop='articleBody'] a") %>%
  html_attr("href")

matches many URLs that are beyond the scope of the targeted <div> element.

If you look at the HTML source of the target page:

  • There is only one <div> that matches my query. It starts at line 177.
  • The <div> stops at line 396.
  • Yet many of the URLs returned by html_nodes in the code above are extracted from nodes beyond that line, outside of that <div>.

Note: You might notice that, at line 242 of the HTML, there is a faulty </div element. However, fixing it does not seem to resolve the issue.

I guess that the problem has to do with how xml2 (i.e. libxml2) is parsing the code.

Any other ideas?

1

There are 1 answers

5
hrbrmstr On BEST ANSWER

Well, the main problem is poorly-crafted, non-compliant HTML at the site that prides itself on actually being smarter than everyone else in that regard.

When you do:

library(rvest)
library(purrr)

URL <- "http://alistapart.com/article/daemonskin"

read_html(URL) %>%
  html_nodes("div.main-content[itemprop='articleBody'] a") %>%
  html_attr("href") %>%
  str()
##  chr [1:74] "#comments" ...

the underlying libxml2 library is "fixing up" minor HTML (XML) errors and has it's own way of doing things. Such ways cause more elements to be under that particular <div> you're targeting due to the way the parser "fixed" the HTML.

You can see the difference with the results from a real browser, which we can simulate with selenium:

NOTE: I started phantomjs in webdriver mode outside of R

library(seleniumPipes)

rd <- remoteDr(browserName = "phantomjs", port = 8910)
rd %>% go(URL)

When you do that, it's phantomjs (webkit, really) applying it's own browser-based HTML parser fixup methods to the HTML it receives (and, it's also adding or subtracting nodes based on any javascript triggers on page load).

When you use the equivalent of document.querySelectorAll() to retrieve the nodes you get 34 (which is also what I got in Chrome in the Developer Tools console):

rd %>%
  findElements("css", "div.main-content[itemprop='articleBody'] a") %>%
  map_chr(getElementAttribute, "href") %>%
  str()
##  chr [1:34] "http://alistapart.com/article/daemonskin#comments" ...

NOTE that if you decide to do what I normally do and use getPageSource() to do all the selecting/extracting via rvest/xml2 functions, you'll get a different result since it extracts the current HTML page from the virtual browser and passed it to read_html() which has libxml2 applying some additional fixup to what webkit theoretically already cleaned up:

rd %>%
  go(URL) %>%
  getPageSource() %>%
  html_nodes("div.main-content[itemprop='articleBody'] a") %>%
  html_attr("href") %>%
  str()
##  chr [1:32] "#comments" ...