WebDriver can find element using xpath, Html Agility Pack cannot

c# html-agility-pack visual-studio-2010 webdriver xpath

Question

I have continually had problems with Html Agility Pack; my XPath queries only ever work when they are extremely simple:

//*[@id='some_id']

or

//input

However, anytime they get more complicated, then Html Agility Pack can't handle it. Here's an example demonstrating the problem, I'm using WebDriver to navigate to Google, and return the page source, which is passed to Html Agility Pack, and both WebDriver and HtmlAgilityPack attempt to locate the element/node (C#):

//The XPath query
const string xpath = "//form//tr[1]/td[1]//input[@name='q']";

//Navigate to Google and get page source
var driver = new FirefoxDriver(new FirefoxProfile()) { Url = "http://www.google.com" };
Thread.Sleep(2000);

//Can WebDriver find it?
var e = driver.FindElementByXPath(xpath);
Console.WriteLine(e!=null ? "Webdriver success" : "Webdriver failure");

//Can Html Agility Pack find it?
var source = driver.PageSource;
var htmlDoc = new HtmlDocument { OptionFixNestedTags = true };
htmlDoc.LoadHtml(source);
var nodes = htmlDoc.DocumentNode.SelectNodes(xpath);
Console.WriteLine(nodes!=null ? "Html Agility Pack success" : "Html Agility Pack failure");

driver.Quit();

In this case, WebDriver successfully located the item, but Html Agility Pack did not.

I know, I know, in this case it's very easy to change the xpath to one that will work: //input[@name='q'], but that will only fix this specific example, which isn't the point, I need something that will exactly or at least closely mirror the behavior of WebDriver's xpath engine, or even the FirePath or FireFinder add-ons to Firefox.

If WebDriver can find it, then why can't Html Agility Pack find it too?

Accepted Answer

The issue you're running into is with the FORM element. HTML Agility Pack handles that element differently - by default, it will never report that it has children.

In the particular example you gave, this query does find the target element:

.//div/div[2]/table/tr/td/table/tr/td/div/table/tr/td/div/div[2]/input

However, this does not, so it's clear the form element is tripping up the parser:

.//form/div/div[2]/table/tr/td/table/tr/td/div/table/tr/td/div/div[2]/input

That behavior is configurable, though. If you place this line prior to parsing the HTML, the form will give you child nodes:

HtmlNode.ElementsFlags.Remove("form");



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why