In C#, how can I retrieve HTML content between H1 tags?

c# html html-agility-pack

Question

To extract all of the H1 elements and the HTML in between them from an HTML page, I must parse it. With some success, I've been experimenting with HTMLAgilityPack to do this. Using, I was able to extract all H1 tags.

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h1"))

But how can I extract all the HTML up to the next H1 tag, following each H1 tag? A table, an image, a link, or any other element on an HTML page might be included in this HTML, but not the H1 tag.

I appreciate it.

1
0
10/11/2010 11:59:36 PM

Popular Answer

Possible workaround: Take the whole HTML as a String, change H1 > to a sign HTML does not recognize (for example, HTML uses & uuml; instead of 14), and then divide the String by this sign into an array.

Now you only parse nodes that contain start AND end tags by searching (using RegEx, for instance) for such nodes.

Quick and crude, but it ought to work.

Please be mindful that nested H1-Tags will prevent parent-nodes from being processed, as drachenstern indicated.

1
10/12/2010 12:13:27 AM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow