I am scraping some data using HtmlAgilityPack.
The HTML looks like this:
<div id="id-here">
<dl>
<dt> Field Name </dt>
<dd> Value for above field name </dd>
<dt> Field Name </dt>
<dd> Value for above field name </dd>
<dt> Field Name </dt>
<dd> Value for above field name </dd>
</dl>
</div>
Now the problem I have is that there is not always a set number of fields so I cant reliably access each of them like:
//*[@id="id-here"]/dl[1]/dd[1]
as dd[1] may be a name on one page and a telephone on another where the user failed to fill out a name so field is hidden.
so I grab all the DT and DD nodes like so:
//*[@id="id-here"]/dl[1]/dt | //*[@id="id-here"]/dl[1]/dd
Now I check each node to see if it matches field I want and take the NextSibling value like so:
foreach (HtmlNode node in details)
{
if (node.InnerText.Contains("Tel:")) telephone = node.NextSibling.InnerText;
if (node.InnerText.Contains("Email:")) email = node.NextSibling.InnerText;
}
This works fine for telephone but for some reason when the "Email:" node comes up, both NextSibling.InnerHTML
& NextSibling.InnerText
are blank although the next sibling definitely has the data. If I actually go to that node
in details
and look at it the InnerHTML
is the entire formatted link and the InnerText
is the email address.
Is the NextSibling.InnerText
not working because the A tag is making it a child or something? I have had a look in debugger and just cant find the information I need under NextSibling
.
I am sure answer is ridiculously simple, I just cant figure it out. Anyone put me out of my misery?
The reason this is happening is that if node
is a dt
element that is separated from its corresponding dd
element by some whitespace, then node.NextSibling
is an all-whitespace text node (the space between the </dt>
and the <dd>
). If you look at it in the debugger, you will see that node.NextSibling
's NodeType
is HtmlNodeType.Text
and not HtmlNodeType.Element
.
I suggest creating a convenience method to get the text of a dt
node's corresponding dd
:
internal static string GetMatchingDdValue(HtmlNode dtNode)
{
var found = dtNode.SelectSingleNode("following-sibling::*[1][self::dd]");
return found == null ? "" : found.InnerText;
}
Then you can use it like this:
if (node.InnerText.Contains("Tel:")) { telephone = GetMatchingDdValue(node); }
Here's a breakdown of the somewhat tricky XPath used in my method above:
(a) following-sibling::*
^ Select all elements that share the same parent as the current node and occur after it.
(b) following-sibling::*[1]
^ Select the first node in set (a) (if there are any)
(c) following-sibling::*[1][self::dd]
^ Select all nodes in set (b) that are elements with the name "dd"
SelectSingleNode()
selects the first node in set (c), which should always either be 1 or 0 nodes.
You could most likely get by with just following-sibling::dd
or following-sibling::*
, but the above path contains safeguards. For example, if for some reason, you had the following XML and your current node was the Tel:
element:
<dl>
<dt>Tel:</dt>
<dt>Address:</dt>
<dd>50 Fake St.</dd>
</dl>
following-sibling::dd
would give you the result "50 Fake St.", while following-sibling::*
would give you the result "Address:". Instead, following-sibling::*[1][self::dd]
would select an empty nodeset in this case, so the method would correctly produce an empty string as the result.
var html = @"
<div id='id-here'>
<dl>
<dt> Field Name </dt>
<dd> Value for above field name </dd>
<dt> Field Name </dt>
<dd> Value for above field name </dd>
<dt> Field Name </dt>
<dd> Value for above field name </dd>
</dl>
</div>";
html = new Regex(">\r\n\\s*<").Replace(html,"><");
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
Console.Write(doc.DocumentNode.SelectNodes("//dt")[0].NextSibling.OuterHtml);
<dd> Value for above field name </dd>