C# HtmlAgilityPack adding tbody

c# html html-agility-pack xpath

Question

The C# HtmlAgilityPack, adds tbody element after LoadHtml function, into DOM tree in tables even if it doesn't exists in original HTML document. How can I disable this?

My algorithm creates some XPATH expressions, by traversing the dom tree and that non existing tbody element inside original document makes the SelectNodes not find desired items. Took me a lot of time to figure this out :|

Is it possible to make SelectNodes also consider nodes added by HtmlAgilityPack?

Example:

<table>
    <tr><td>data</td></tr>
</table>

My application would produce this XPATH to extract 'data': //table/tbody/tr/td

The tbody tag in expression was added because its in DOM tree after parsing the html code by HtmlAgilityPack because HtmlAgilityPack added it even if it doesnt exist. Because of that

doc.DocumentNode.SelectNodes("//table/tbody/tr/td");

would fail.

In other words the tr element (HtmlElement) parent TagName is equal to 'TBODY' not 'TABLE'. Also I'm parsing many different web sites so this is one situation.

SelectNodes is searching in original HTML code, not by DOM tree it has after HtmlDocument.LoadHtml, or it doesn't consider 'virtual' elements added by it.

Popular Answer

You don't have to use the full hierarchy.

Just use the following if all you want are the tds:

doc.DocumentNode.SelectNodes("//table//td");

or just ignore the tbody node and get all the hierarchy you care about:

doc.DocumentNode.SelectNodes("//table//tr/td");


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why