I have already seen these questions 1 and 2 but not working for me.
I am creating the Xpath for objects which is working fine from WebDriver but when Trying to select node using HtmlAgilityPack it's not working in some cases.
I am using latest HtmlAgilityPack 1.4.9
For example, Here is a page.
The xpath for the object highlighted in Red is
Similarly another object as shown in picture
It's xpath is
//section[@id='main-content']/div2/div/div/div/div/div/ul/li2/a
Both these Xpath are working absolutely fine from WebDriver but unable to find any object from HtmlAgility pack.
For the first one I tried
HtmlAgilityPack.HtmlNode.ElementsFlags.Remove("p")
It started to work but why it is required? Also there is no luck for the second one.
Is there any list of specific tags which are needed to be removed from ElementFlags? If there is any then what would be its impact?
My requirement is to fetch objects using Xpath from HtmlAgility pack just like WebDriver works.
Any help will be greatly appreciated.
EDIT 1:
The XPATH we are getting from HAP are also long ones like div/div/div/div/div/a Here's the VB.Net code for the example given by Sir Simon
Dim selectedNode As HtmlAgilityPack.HtmlNode = htmlAgilityDoc.DocumentNode.SelectSingleNode("//section[@id='main-content']//div[@class='pane-content']//a")
Dim xpathValue As String = selectedNode.XPath
Then the xpathValue we get from HAP is
WebDriver will always rely on the target browser when working with XPATH. Technically, it's just a fancy bridge to the browser (whether the browser is Firefox or Chrome - IE up to 11 does not support XPATH)
Unfortunately the DOM (elements and attributes structure) that reside in browser memory is not the same as the DOM that you probably provided to the Html Agility Pack. It could be the same if you loaded the HAP with the content of the DOM from the browser memory (an equivalent to document.OuterHtml for example). In general this is not the case because developers use HAP to scrap sites without a browser, so they feed it from a network stream (from an HTTP GET request) or a raw file.
This problem is easy to demonstrate. For example, if you create a file that contains only this:
(no html, no body tag, this is in fact an invalid html file)
With HAP you can load it like this:
And the structure HAP will come up with is simply this:
The HAP is not a browser, it's a parser and it doesn't really know HTML specifications, it just knows how to parse a bunch of tags and build a DOM with it. It doesn't know for example a document should start with HTML, and should contain a BODY, or that a TABLE element always has a TBODY child when inferred by a browser.
In a Chrome browser though, it you open this file, inspect it and ask the XPATH for the TD element, it will report this:
Because Chrome has just made this up by itself... As you see the two systems don't match.
Note if you have
id
attributes available in the source HTML, the story is better, for example, with the following HTML:Chrome will report the following XPATH (it will try to use
id
attributes as much as possible):Wich can be used in HAP as well. But, this does not work all the time though. For example, with the following HTML
Chrome will now produce this XPATH to the TD:
as you see this is not usable in HAP again because of that inferred TBODY.
So, in the end, you can't just blindly use browsers-generated XPATH in other contexts than in those browsers. In other contexts, you will have to find other discriminants.
Actually, I personnally think it's somehow a good thing because it will make your XPATH more resistant to changes. But you'll have to think :-)
Now let's get back to your case :)
The following C# sample console case should work fine:
If you look at the structure of the stream or file (or even what the browser displays, but take care, avoid TBODYs...), the easiest is to
id
(just like browser do) and/orp/p/p/div/a/div/whatever
are badSo, here, after the
main-content
id
attribute, we just look (recursively with//
) for a DIV that has a special class and we look (again recursively) for the first childA
available.This XPATH should work in webdriver and in HAP.
Note this XPATH also works:
//div[@class='pane-content']//a
but it looks a bit loose to me. Setting the foot onid
attributes is often a good idea.