I need to extract all the paragraph from one html and also all text between that tags.
this code is not working when the text parsed into HtmlDocument get changed from the original one. In the sample
some <br />text
is changed in
some <br>text
es:
string s = "<p>firt paragraph</p>some <br />text<p>another paragraph</p><span>some text between span</span><p>hellow word</p>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var nodes = doc.DocumentNode.SelectNodes("//p");
int lastPos = -1;
foreach (HtmlAgilityPack.HtmlNode n in nodes)
{
if (lastPos > -1)
{
string textNotInP = Doc.DocumentNode.OuterHtml.Substring(lastPos, n.StreamPosition - lastPos);
System.Diagnostics.Debug.WriteLine(textNotInP);
}
System.Diagnostics.Debug.WriteLine(n.OuterHtml);
lastPos = n.StreamPosition + n.OuterHtml.Length;
}
the correct result would be:
<p>firt paragraph</p>
some <br>text
<p>second paragraph</p>
<span>some text between span</span>
<p>third paragraph</p>
but the code above return this:
<p>firt paragraph</p>
some <br>text<p
<p>second paragraph</p>
pan>some text between span</span><p
<p>third paragraph</p>
the reason is steamPosition return the node position related to the original text, not that one parserd in htmlDocument.
Is There a way to return the position of one node related to the parsed html?
You can use OuterHtml
property of each <p>
element to get the desired HTML :
string s = "<p>firt paragraph</p>some <br />text<p>another paragraph</p><span>some text between span</span><p>hellow word</p>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var nodes = doc.DocumentNode.SelectNodes("//p");
foreach (var item in nodes)
{
Console.WriteLine(item.OuterHtml);
}
output :
<p>firt paragraph</p>
<p>another paragraph</p>
<p>hellow word</p>
Or if you mean to get everything between the first <p>
and the last <p>
elements, inclusive, you can use the following XPath :
var query = "//node()[preceding-sibling::p or self::p][following-sibling::p or self::p]";
The XPath grab all nodes (either element or text node) that: has preceding sibling p
and following sibling p
, or the node itself is a p
element.
var nodes = doc.DocumentNode.SelectNodes(query);
foreach (var item in nodes)
{
Console.WriteLine(item.OuterHtml);
}
output :
<p>firt paragraph</p>
some
<br />
text
<p>another paragraph</p>
<span>some text between span</span>
<p>hellow word</p>