C# HtmlAgilityPack adding tbody

c# html html-agility-pack xpath

Question

Even if the tbody element wasn't there in the original HTML text, the C# HtmlAgilityPack adds it to the DOM tree in tables after the LoadHtml method. How can I turn this off?

My method generates various XPATH expressions by navigating the DOM tree, and because the original document's tbody element is missing, the SelectNodes are unable to locate the needed objects. It took me a long time to realize this:

Is it feasible to have SelectNodes take into account the nodes that HTMLAgilityPack has added?

Example:

<table>
    <tr><td>data</td></tr>
</table>

To extract "data," my program would generate the following XPATH: /table/tbody/tr/td

The tbody element in the equation was inserted because, when HtmlAgilityPack parsed the HTML code, it was found in the DOM tree (even though it didn't exist). In light of that

doc.DocumentNode.SelectNodes("//table/tbody/tr/td");

a failure.

In other words, the parent TagName of the tr element (HtmlElement) is 'TBODY' rather than 'TABLE. Additionally, I parse a lot of various websites, so this is one instance.

Instead of using the DOM tree it has after the HtmlDocument, SelectNodes searches the original HTML code. Otherwise, it disregards any "virtual" items that are added by LoadHtml.

1
2
1/21/2016 6:28:57 PM

Popular Answer

It's not necessary to use the whole hierarchy.

If all you need are the following, just utilize them.td s:

doc.DocumentNode.SelectNodes("//table//td");

either disregard thetbody Node and get all relevant hierarchy:

doc.DocumentNode.SelectNodes("//table//tr/td");
1
1/21/2016 6:12:07 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow