Html Agility Pack cannot find element using xpath but it is working fine with WebDriver

1.5k views Asked by At

I have already seen these questions 1 and 2 but not working for me.

I am creating the Xpath for objects which is working fine from WebDriver but when Trying to select node using HtmlAgilityPack it's not working in some cases.

I am using latest HtmlAgilityPack 1.4.9

For example, Here is a page.

enter image description here

The xpath for the object highlighted in Red is

//section[@id='main-content']/div2/div/div/div/div/div/p1/a

Similarly another object as shown in picture

enter image description here

It's xpath is

//section[@id='main-content']/div2/div/div/div/div/div/ul/li2/a

Both these Xpath are working absolutely fine from WebDriver but unable to find any object from HtmlAgility pack.

For the first one I tried

HtmlAgilityPack.HtmlNode.ElementsFlags.Remove("p")

It started to work but why it is required? Also there is no luck for the second one.

Is there any list of specific tags which are needed to be removed from ElementFlags? If there is any then what would be its impact?

My requirement is to fetch objects using Xpath from HtmlAgility pack just like WebDriver works.

Any help will be greatly appreciated.

EDIT 1:

The XPATH we are getting from HAP are also long ones like div/div/div/div/div/a Here's the VB.Net code for the example given by Sir Simon

Dim selectedNode As HtmlAgilityPack.HtmlNode = htmlAgilityDoc.DocumentNode.SelectSingleNode("//section[@id='main-content']//div[@class='pane-content']//a")

Dim xpathValue As String = selectedNode.XPath

Then the xpathValue we get from HAP is

/html1/body1/section1/div2/div1/div1/div1/div1/div1/a1

1

There are 1 answers

5
Simon Mourier On BEST ANSWER

WebDriver will always rely on the target browser when working with XPATH. Technically, it's just a fancy bridge to the browser (whether the browser is Firefox or Chrome - IE up to 11 does not support XPATH)

Unfortunately the DOM (elements and attributes structure) that reside in browser memory is not the same as the DOM that you probably provided to the Html Agility Pack. It could be the same if you loaded the HAP with the content of the DOM from the browser memory (an equivalent to document.OuterHtml for example). In general this is not the case because developers use HAP to scrap sites without a browser, so they feed it from a network stream (from an HTTP GET request) or a raw file.

This problem is easy to demonstrate. For example, if you create a file that contains only this:

<table><tr><td>hello world</td></tr></table>

(no html, no body tag, this is in fact an invalid html file)

With HAP you can load it like this:

HtmlDocument doc = new HtmlDocument();
doc.Load(myFile);

And the structure HAP will come up with is simply this:

+table
 +tr
  +td
   'hello world'

The HAP is not a browser, it's a parser and it doesn't really know HTML specifications, it just knows how to parse a bunch of tags and build a DOM with it. It doesn't know for example a document should start with HTML, and should contain a BODY, or that a TABLE element always has a TBODY child when inferred by a browser.

In a Chrome browser though, it you open this file, inspect it and ask the XPATH for the TD element, it will report this:

/html/body/table/tbody/tr/td

Because Chrome has just made this up by itself... As you see the two systems don't match.

Note if you have id attributes available in the source HTML, the story is better, for example, with the following HTML:

<table><tr><td id='hw'>hello world</td></tr></table>

Chrome will report the following XPATH (it will try to use id attributes as much as possible):

//*[@id="hw"]

Wich can be used in HAP as well. But, this does not work all the time though. For example, with the following HTML

<table id='hw'><tr><td>hello world</td></tr></table>

Chrome will now produce this XPATH to the TD:

//*[@id="mytable"]/tbody/tr/td

as you see this is not usable in HAP again because of that inferred TBODY.

So, in the end, you can't just blindly use browsers-generated XPATH in other contexts than in those browsers. In other contexts, you will have to find other discriminants.

Actually, I personnally think it's somehow a good thing because it will make your XPATH more resistant to changes. But you'll have to think :-)

Now let's get back to your case :)

The following C# sample console case should work fine:

  static void Main(string[] args)
  {
      var web = new HtmlWeb();
      var doc = web.Load("http://www2.epa.gov/languages/traditional-chinese");
      var node = doc.DocumentNode.SelectSingleNode("//section[@id='main-content']//div[@class='pane-content']//a");
      Console.WriteLine(node.OuterHtml); // displays <a href="http://www.oehha.ca.gov/fish/pdf/59329_CHINESE.pdf">...etc...</a>"
  }

If you look at the structure of the stream or file (or even what the browser displays, but take care, avoid TBODYs...), the easiest is to

  • find an id (just like browser do) and/or
  • find unique child or grand child elements or attributes below this, recursively or not
  • avoid too precise XPATHs. Things like p/p/p/div/a/div/whatever are bad

So, here, after the main-content id attribute, we just look (recursively with //) for a DIV that has a special class and we look (again recursively) for the first child A available.

This XPATH should work in webdriver and in HAP.

Note this XPATH also works: //div[@class='pane-content']//a but it looks a bit loose to me. Setting the foot on id attributes is often a good idea.