How to parse this piece of HTML?

c# html html-agility-pack

Question

good morning! i am using c# (framework 3.5sp1) and want to parse following piece of html via regex:

<h1>My caption</h1>
<p>Here will be some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

i need following output:

  • group 1: content of h1
  • group 2: content of h1-following text
  • group 3-n: content of subcaptions + text

what i have atm:

<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>

this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/>. for parsing the h1-caption i have another pattern (<h1.*?>(.*?)</h1>), which only gives me the caption but not the content - i'm fine with that atm.

does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)?

edit:
as some brought in HTMLAgilityPack, i was curious about this nice tool. i accomplished getting content of the <h1>-tag.
but ... myproblem is parsing the rest. this is caused by: the tags for the content may vary - from <p> to <div> and <ul>... atm this seems more or less iterate over the whole document and parsing tag for tag ...? any hints?

Accepted Answer

You will really need HTML parser for this


Popular Answer

Don't use regex to parse HTML. Consider using the HTML Agility Pack.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why