good morning! i am using c# (framework 3.5sp1) and want to parse following piece of html via regex:
<h1>My caption</h1> <p>Here will be some text</p> <hr class="cs" /> <h2 id="x">CaptionX</h2> <p>Some text</p> <hr class="cs" /> <h2 id="x">CaptionX</h2> <p>Some text</p> <hr class="cs" /> <h2 id="x">CaptionX</h2> <p>Some text</p>
i need following output:
what i have atm:
<hr.*?/> <h2.*?>(.*?)</h2> ([\W\S]*?) <hr.*?/>
this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing
<hr/>. for parsing the h1-caption i have another pattern (
<h1.*?>(.*?)</h1>), which only gives me the caption but not the content - i'm fine with that atm.
does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)?
as some brought in HTMLAgilityPack, i was curious about this nice tool. i accomplished getting content of the
but ... myproblem is parsing the rest. this is caused by: the tags for the content may vary - from
atm this seems more or less iterate over the whole document and parsing tag for tag ...?