I need to parse an HTML document to extract all the H1 tags and all HTML between them. I have been playing with HtmlAgilityPack to achieve this with some success. I could extract all H1 tags using:
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h1"))
But how do I extract all the HTML after every H1 tag until I hit the next H1 tag? This HTML could include anything from a table/image/link or any other thing on an HTML page but H1 tag.
Thanks in advance.
Possible solution: Get the complete HTML as String, replace < H1 > with a sign HTML does not know (e.g. Ã¼, HTML uses & uuml;), then split the String by this sign into an array.
Now you search (with RegEx for example) for nodes that have start AND end tags and only parse those.
Quick and dirty, but should work.
Please be aware, that, as drachenstern mentioned, nested H1-Tags will lead to parent-nodes not being parsed.