HtmlAgilityPack how to extract html between some tag

c# html-agility-pack

Question

I need to extract all the paragraph from one html and also all text between that tags.

this code is not working when the text parsed into HtmlDocument get changed from the original one. In the sample

some <br />text

is changed in

some <br>text

es:

string s = "<p>firt paragraph</p>some <br />text<p>another paragraph</p><span>some text between span</span><p>hellow word</p>";
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(s);
var nodes = doc.DocumentNode.SelectNodes("//p");
int lastPos = -1;
foreach (HtmlAgilityPack.HtmlNode n in nodes)
{
  if (lastPos > -1)
  {
      string textNotInP = Doc.DocumentNode.OuterHtml.Substring(lastPos, n.StreamPosition - lastPos);
                System.Diagnostics.Debug.WriteLine(textNotInP);
 }
 System.Diagnostics.Debug.WriteLine(n.OuterHtml);
 lastPos = n.StreamPosition + n.OuterHtml.Length;
}

the correct result would be:

<p>firt paragraph</p>
some <br>text
<p>second paragraph</p>
<span>some text between span</span>
<p>third paragraph</p>

but the code above return this:

<p>firt paragraph</p>
some <br>text<p
<p>second paragraph</p>
pan>some text between span</span><p
<p>third paragraph</p>

the reason is steamPosition return the node position related to the original text, not that one parserd in htmlDocument.

Is There a way to return the position of one node related to the parsed html?

Accepted Answer

You can use OuterHtml property of each <p> element to get the desired HTML :

string s = "<p>firt paragraph</p>some <br />text<p>another paragraph</p><span>some text between span</span><p>hellow word</p>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var nodes = doc.DocumentNode.SelectNodes("//p");
foreach (var item in nodes)
{
    Console.WriteLine(item.OuterHtml);
}

output :

<p>firt paragraph</p>
<p>another paragraph</p>
<p>hellow word</p>

Or if you mean to get everything between the first <p> and the last <p> elements, inclusive, you can use the following XPath :

var query = "//node()[preceding-sibling::p or self::p][following-sibling::p or self::p]";

The XPath grab all nodes (either element or text node) that: has preceding sibling p and following sibling p, or the node itself is a p element.

var nodes = doc.DocumentNode.SelectNodes(query);
foreach (var item in nodes)
{
    Console.WriteLine(item.OuterHtml);
}

output :

<p>firt paragraph</p>
some
<br />
text
<p>another paragraph</p>
<span>some text between span</span>
<p>hellow word</p>


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why