How to get HTML text between H1 tags in C#

c# html html-agility-pack

Question

I need to parse an HTML document to extract all the H1 tags and all HTML between them. I have been playing with HtmlAgilityPack to achieve this with some success. I could extract all H1 tags using:

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h1"))

But how do I extract all the HTML after every H1 tag until I hit the next H1 tag? This HTML could include anything from a table/image/link or any other thing on an HTML page but H1 tag.

Thanks in advance.

Popular Answer

Possible solution: Get the complete HTML as String, replace < H1 > with a sign HTML does not know (e.g. ü, HTML uses & uuml;), then split the String by this sign into an array.

Now you search (with RegEx for example) for nodes that have start AND end tags and only parse those.

Quick and dirty, but should work.

Please be aware, that, as drachenstern mentioned, nested H1-Tags will lead to parent-nodes not being parsed.



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why