happy morning! I want to use regex to parse the following bit of html while using C# (framework 3.5sp1):
<h1>My caption</h1> <p>Here will be some text</p> <hr class="cs" /> <h2 id="x">CaptionX</h2> <p>Some text</p> <hr class="cs" /> <h2 id="x">CaptionX</h2> <p>Some text</p> <hr class="cs" /> <h2 id="x">CaptionX</h2> <p>Some text</p>
I need the output as follows:
what I currently have
<hr.*?/> <h2.*?>(.*?)</h2> ([\W\S]*?) <hr.*?/>
Due to the trailing, this will give me every odd subcaption + content (such as 1, 3,...)
. I have another pattern for processing the h1-caption (
), which simply displays the caption rather than the substance; yet, I'm okay with that for now.
Does anybody have a suggestion for me, a solution, or any other logics I could use (such parsing the HTML using a reader and allocating it this way)? ?
I was intrigued by the HTMLAgilityPack that some people had brought in. I was successful in obtaining the material of
but deciphering the remaining text is my issue. the tags for the content may range, starting from
... now, it appears to be processing tags one by one while iterating over the whole document?