I need to parse a HTML string like this:
<widget attribute="1"> <header> <table> </header> <item> <tr><td>content</td></tr> </item> <footer> </table> </footer> </widget>
I'm using Html Agility Pack and I'm able to find all "widget":
HtmlDocument doc = new HtmlDocument(); doc.OptionAutoCloseOnEnd = false; doc.OptionOutputAsXml = false; doc.LoadHtml(htmlString); HtmlNodeCollection widgets = doc.DocumentNode.SelectNodes("//widget");
My problem is when I try to get all childnodes of Widget node.. HTMLAgility closes automatically all my tags, so I'm not able to correctly retrieve Header, Item and Footer nodes. The output generated by Agility is:
<header> <table> </table></header> <item> <tr> <td><p>Riga n.1</p></td> </tr> </item> <footer> </footer>
It closes Table tag in the Header, and hides the Table tag in the Footer. There is a way to leave these tag unclosed? I tried to search documentation about the logic of LoadHtml method but I didn't find anything. I think I need to play with Options.
Can you help me?
Html Agility Pack does not generally support overlapping tags by design. However, you can tweak it like this:
HtmlDocument doc = new HtmlDocument(); HtmlNode.ElementsFlags.Add("table", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty); doc.LoadHtml(htmlString);
In this case, you instruct the library to treat TABLE as an overlapping tag. As a side note, FORM is the only TAG by default defined as an overlapping tag (see the reason here: HtmlAgilityPack -- Does <form> close itself for some reason?).
However, this does not come as a free lunch...
It means, the library will now see what's inside the table and closing table tags as a pure text element. So all the tags inside the parsed table will not be programmatically accessible, you won't see it in the DOM, you won't see it using XPATH, etc... but that may be enough for your needs.