How can I understand this HTML code?

c# html html-agility-pack

Question

happy morning! I want to use regex to parse the following bit of html while using C# (framework 3.5sp1):

<h1>My caption</h1>
<p>Here will be some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

I need the output as follows:

  • group 1: h1 content
  • group 2: the text that follows the h1 tag
  • group 3-n: subcaption material plus text

what I currently have

<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>

Due to the trailing, this will give me every odd subcaption + content (such as 1, 3,...)<hr/> . I have another pattern for processing the h1-caption (<h1.*?>(.*?)</h1> ), which simply displays the caption rather than the substance; yet, I'm okay with that for now.

Does anybody have a suggestion for me, a solution, or any other logics I could use (such parsing the HTML using a reader and allocating it this way)? ?

edit:
I was intrigued by the HTMLAgilityPack that some people had brought in. I was successful in obtaining the material of<h1> -tag.
but deciphering the remaining text is my issue. the tags for the content may range, starting from<p> to <div> and <ul> ... now, it appears to be processing tags one by one while iterating over the whole document? any cues?

1
3
5/22/2015 10:00:00 PM

Accepted Answer

Really, you're going to need CSS parser for this.

9
5/23/2017 11:54:31 AM

Popular Answer

Don't parse HTML using regex. Use the Agility Pack for HTML, maybe.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow