HtmlAgilityPack NextSibling.InnerText value is blank

c# html-agility-pack siblings xpath

Question

I am scraping some data using HtmlAgilityPack.

The HTML looks like this:

<div id="id-here">
  <dl>
    <dt> Field Name </dt>
    <dd> Value for above field name </dd>
    <dt> Field Name </dt>
    <dd> Value for above field name </dd>
    <dt> Field Name </dt>
    <dd> Value for above field name </dd>
  </dl>
</div>

Now the problem I have is that there is not always a set number of fields so I cant reliably access each of them like:

//*[@id="id-here"]/dl[1]/dd[1]

as dd[1] may be a name on one page and a telephone on another where the user failed to fill out a name so field is hidden.

so I grab all the DT and DD nodes like so:

//*[@id="id-here"]/dl[1]/dt | //*[@id="id-here"]/dl[1]/dd

Now I check each node to see if it matches field I want and take the NextSibling value like so:

    foreach (HtmlNode node in details)
    {
        if (node.InnerText.Contains("Tel:")) telephone = node.NextSibling.InnerText;
        if (node.InnerText.Contains("Email:")) email = node.NextSibling.InnerText;
    }

This works fine for telephone but for some reason when the "Email:" node comes up, both NextSibling.InnerHTML & NextSibling.InnerText are blank although the next sibling definitely has the data. If I actually go to that node in details and look at it the InnerHTML is the entire formatted link and the InnerText is the email address.

Is the NextSibling.InnerText not working because the A tag is making it a child or something? I have had a look in debugger and just cant find the information I need under NextSibling.

I am sure answer is ridiculously simple, I just cant figure it out. Anyone put me out of my misery?

Accepted Answer

The reason this is happening is that if node is a dt element that is separated from its corresponding dd element by some whitespace, then node.NextSibling is an all-whitespace text node (the space between the </dt> and the <dd>). If you look at it in the debugger, you will see that node.NextSibling's NodeType is HtmlNodeType.Text and not HtmlNodeType.Element.

I suggest creating a convenience method to get the text of a dt node's corresponding dd:

internal static string GetMatchingDdValue(HtmlNode dtNode)
{
    var found = dtNode.SelectSingleNode("following-sibling::*[1][self::dd]");
    return found == null ? "" : found.InnerText;
}

Then you can use it like this:

if (node.InnerText.Contains("Tel:")) { telephone = GetMatchingDdValue(node); }

Here's a breakdown of the somewhat tricky XPath used in my method above:

(a) following-sibling::*

^ Select all elements that share the same parent as the current node and occur after it.

(b) following-sibling::*[1]

^ Select the first node in set (a) (if there are any)

(c) following-sibling::*[1][self::dd] 

^ Select all nodes in set (b) that are elements with the name "dd"

SelectSingleNode() selects the first node in set (c), which should always either be 1 or 0 nodes.

You could most likely get by with just following-sibling::dd or following-sibling::*, but the above path contains safeguards. For example, if for some reason, you had the following XML and your current node was the Tel: element:

<dl>
  <dt>Tel:</dt>
  <dt>Address:</dt>
  <dd>50 Fake St.</dd>
</dl>

following-sibling::dd would give you the result "50 Fake St.", while following-sibling::* would give you the result "Address:". Instead, following-sibling::*[1][self::dd] would select an empty nodeset in this case, so the method would correctly produce an empty string as the result.


Popular Answer

var html = @"
<div id='id-here'>
  <dl>
    <dt> Field Name </dt>
    <dd> Value for above field name </dd>
    <dt> Field Name </dt>
    <dd> Value for above field name </dd>
    <dt> Field Name </dt>
    <dd> Value for above field name </dd>
  </dl>
</div>";
html = new Regex(">\r\n\\s*<").Replace(html,"><");
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
Console.Write(doc.DocumentNode.SelectNodes("//dt")[0].NextSibling.OuterHtml);

<dd> Value for above field name </dd>



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why