I do not understand why this code
library(rvest)
u <- "http://alistapart.com/article/daemonskin"
h <- read_html(u)
html_nodes(h, "div.main-content[itemprop='articleBody'] a") %>%
html_attr("href")
matches many URLs that are beyond the scope of the targeted <div>
element.
If you look at the HTML source of the target page:
- There is only one
<div>
that matches my query. It starts at line 177. - The
<div>
stops at line 396. - Yet many of the URLs returned by
html_nodes
in the code above are extracted from nodes beyond that line, outside of that<div>
.
Note: You might notice that, at line 242 of the HTML, there is a faulty </div
element. However, fixing it does not seem to resolve the issue.
I guess that the problem has to do with how xml2
(i.e. libxml2
) is parsing the code.
Any other ideas?
Well, the main problem is poorly-crafted, non-compliant HTML at the site that prides itself on actually being smarter than everyone else in that regard.
When you do:
the underlying
libxml2
library is "fixing up" minor HTML (XML) errors and has it's own way of doing things. Such ways cause more elements to be under that particular<div>
you're targeting due to the way the parser "fixed" the HTML.You can see the difference with the results from a real browser, which we can simulate with selenium:
NOTE: I started phantomjs in webdriver mode outside of R
When you do that, it's phantomjs (webkit, really) applying it's own browser-based HTML parser fixup methods to the HTML it receives (and, it's also adding or subtracting nodes based on any javascript triggers on page load).
When you use the equivalent of
document.querySelectorAll()
to retrieve the nodes you get 34 (which is also what I got in Chrome in the Developer Tools console):NOTE that if you decide to do what I normally do and use
getPageSource()
to do all the selecting/extracting viarvest
/xml2
functions, you'll get a different result since it extracts the current HTML page from the virtual browser and passed it toread_html()
which haslibxml2
applying some additional fixup to what webkit theoretically already cleaned up: