How can I get the html between 2 surrounding html elements using htmlagilitypack?

asp.net c# html-agility-pack

Question

I have a need to retrieve the html elements that are contained within 2 other html elements using htmlagilitypack with C#.

As an example, I have the following:

<div id="div1" style="style definition here">
  <strong>
    <font face="Verdana" size="2">Your search request retrieved 0 matches.</font>
  </strong>
  <font face="Verdana" size="2">Some more text here.</font>
  <br><br>
  <!--more html here-->
</div>

I want to return everything between

<div id="div1">

and the first

<br>

without returning either of those elements.

I can't get my head around the syntax required for this so if somebody could explain to me the best way to get the html that exists between 2 other known start tags while ignoring the end tags, I would really appreciate it.

I should also mention that I need to first find the div with the id of div1 within the surrounding html of a complete web page.

I don't need the actual nodes to have reference equality with the nodes that came from a specific HtmlDocument, they just have to be the same content-wise.

Accepted Answer

When HtmlNode instances are returned, multiple calls for the same node will produce the same reference. You can use this to your advantage (although it's an implementation detail, so be careful).

Basically, you'd get all the descendants that are elements up until the node in question. You select the node to start from:

HtmlNode divNode = doc.DocumentNode.SelectSingleNode("div[@id='div1']");

The node you want to go up to:

// Note that in this case, working off the first node is not necessary, just
// convenient for this example.
HtmlNode brNode = divNode.SelectSingleNode("br");

And then use the TakeWhile extension method on the Enumerable class to take all the elements up until the second element, like so:

// The nodes.
IEnumerable<HtmlNode> nodes = divNode.Descendants().
    TakeWhile(n => n != brNode).
    Where(n => n.NodeType == HtmlNodeType.Element);

It's the comparison in the TakeWhile method (n => n != brNode) that depends on reference comparison (that's the implementation detail part).

The last filter is to give you just element nodes, as that is what you'd typically get with calls to SelectSingleNode; if you want to process other node types, you can omit that.

Cycling through those nodes like this:

foreach (HtmlNode node in nodes)
{
    // Print.
    Console.WriteLine("Node: {0}", node.Name);
}  

Produces:

Node: strong
Node: font
Node: font



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why