HtmlAgilityPack how to extract html between some tag

c# html-agility-pack

Question

I need to extract every paragraph and every word between such tags from a single HTML file.

When the text that is processed into an HTML document differs from the original, the code does not function. The example

some <br />text

is altered in

some <br>text

es:

string s = "<p>firt paragraph</p>some <br />text<p>another paragraph</p><span>some text between span</span><p>hellow word</p>";
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(s);
var nodes = doc.DocumentNode.SelectNodes("//p");
int lastPos = -1;
foreach (HtmlAgilityPack.HtmlNode n in nodes)
{
  if (lastPos > -1)
  {
      string textNotInP = Doc.DocumentNode.OuterHtml.Substring(lastPos, n.StreamPosition - lastPos);
                System.Diagnostics.Debug.WriteLine(textNotInP);
 }
 System.Diagnostics.Debug.WriteLine(n.OuterHtml);
 lastPos = n.StreamPosition + n.OuterHtml.Length;
}

the appropriate outcome would be:

<p>firt paragraph</p>
some <br>text
<p>second paragraph</p>
<span>some text between span</span>
<p>third paragraph</p>

nonetheless, the aforementioned code produces the following result:

<p>firt paragraph</p>
some <br>text<p
<p>second paragraph</p>
pan>some text between span</span><p
<p>third paragraph</p>

The reason is because, unlike htmlDocument, steamPosition returns the node position associated with the original text.

Is it possible to get the location of a certain node in the processed HTML?

1
1
5/19/2016 10:32:36 AM

Accepted Answer

You may utilizeOuterHtml attribute of each<p> to get the required HTML:

string s = "<p>firt paragraph</p>some <br />text<p>another paragraph</p><span>some text between span</span><p>hellow word</p>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var nodes = doc.DocumentNode.SelectNodes("//p");
foreach (var item in nodes)
{
    Console.WriteLine(item.OuterHtml);
}

output:

<p>firt paragraph</p>
<p>another paragraph</p>
<p>hellow word</p>

Alternatively, if you want to get everything between the first<p> and lastly<p> the following XPath to include all elements:

var query = "//node()[preceding-sibling::p or self::p][following-sibling::p or self::p]";

The XPath grabs all nodes that: have a preceding sibling, whether they are elements or text nodes.p and the succeeding siblingp , or if the node is ap element.

var nodes = doc.DocumentNode.SelectNodes(query);
foreach (var item in nodes)
{
    Console.WriteLine(item.OuterHtml);
}

output:

<p>firt paragraph</p>
some
<br />
text
<p>another paragraph</p>
<span>some text between span</span>
<p>hellow word</p>
1
5/19/2016 11:18:23 AM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow