I need to obtain html text nodes from, let's say, line 64,line position 45 to line 183,line position 22. I'm pretty new to XPath and I'm not quite sure what are my options. How should I proceed? I had in mind something like this:
var nodes=doc.DocumentNode.SelectNodes("//text()[position() > startPosition and position() < endPosition]");
HtmlNode class has two important attributes (for what you need to do):
Line(the line where the node begins)
LinePosition(the line where the node ends)
You could do something like:
var nodes = doc.DocumentNode.Descendants("#text").Where( x => (x.Line > 64 || (x.Line == 64 && x.LinePosition >= 45)) && (x.Line < 183 || (x.Line == 183 && x.LinePosition <= 22)) );
of course, you can also do
One problem you'll have to deal with:
It doesn't tell you where the node ends, so the above solution might give you nodes that end in a line greather than
183, or in line
183 but in a position greather than
22. For that, you can use the
OuterHtml property of the node, and do some strings manipulation (get the length to know where it ends, split by
\n to know how many lines, etc).
You cannot do this with XPath: it does not know anything about line numbers and character positions within the XML.
position() function returns the relative position of a node in a list of nodes - e.g. returns 1 for the first node in the list, 2 for the second one and so forth.
Note though that using line / character positions to identify fragments of a XML file is problematic: XML processors routinely re-format XML, adding/removing spaces and end lines, and so the same XML fragment can change position.