XPath giving different results in browser and HtmlAgilityPack

c# html-agility-pack xpath

Question

I am attempting to parse a section of a webpage using HtmlAgilityPack in a C# program. Below is a simplified version of this section of the page (edited 1/30/2015 2:40PM EST):

<html>
    <body>
        <div id="main-box">
            <div>
                <div>...</div>
                <div>

                    <div class="other-classes row-box">
                        <div>...</div>
                        <div>...</div>
                        <div>
                            <p>
                                <a href="/some/other/path">
                                    <img src="/path/to/img" />
                                </a>
                            </p>
                            <p>
                                ...
                                <a href="/test/path?a=123">Correct</a> extra text
                            </p>
                        </div>
                        <div>
                            ...
                            <p>
                                <ul>
                                    ...
                                    <li>
                                        <span>
                                            <a href="/test/path?a=456&b=123">Never Selected</a>
                                            and <a href="/test/path?a=789">Never Selected</a>.
                                        </span>
                                    </li>
                                </ul>
                            </p>
                        </div>
                        ...
                    </div>

                    <div class="other-classes row-box">
                        <div>...</div>
                        <div>...</div>
                        <div>
                            <p>
                                No "a" tag this time
                            </p>
                        </div>
                        <div>
                            <p>
                                <ul>
                                    <li>
                                        <span>
                                            <span style="display:none;">
                                                <a href="/some/other/path">Never Selected</a>
                                            </span>
                                        </span>
                                    </li>
                                    <li>
                                        <span>
                                            <a href="/test/path?a=abc&b=123">Correct</a>
                                            and <a href="/test/path?a=def">Wrongly Selected</a>.
                                        </span>
                                    </li>
                                </ul>
                            </p>
                        </div>
                        ...
                    </div>

                    <div class="other-classes row-box">
                        <div>...</div>
                        <div>...</div>
                        <div>
                            <p>
                                <span>
                                    <a href="/test/path?a=ghi">Correct</a>
                                </span>
                            </p>
                            <p>
                                ...
                                <a href="/test/path?a=jkl">Wrongly Selected</a> extra text
                            </p>
                        </div>
                        <div>
                            <p>
                                <ul>
                                    ...
                                    <li>
                                        <span>
                                            <a href="/test/path?a=mno&b=123">Never Selected</a>
                                            and <a href="/test/path?a=pqr">Never Selected</a>.
                                        </span>
                                    </li>
                                </ul>
                            </p>
                        </div>
                        ...
                    </div>

                </div>
            </div>
        </div>
    </body>
</html>

I am attempting to get the first and only the first "a" tag with the GET parameter "a" in the 3rd or 4th child div of each div with the class "row-box" (the ones with the the word "Correct" in them in the above example). I came up with the following XPath that gets these nodes and only these nodes in both Chrome's inspector and the Firepath add-on for Firefox (wrapped for legibility):

//div[@id="main-box"]/div/div[2]/div[contains(@class, "row-box")]/div[
  (position() = 3 or position() = 4) and descendant::a[
    contains(@href, "a=")
  ]
][1]/descendant::a[contains(@href, "a=")][1]

However, when I load this page using HttpWebRequest, load the response stream into an HtmlDocument object, and call SelectNodes(xpath) on its DocumentNode property using this XPath, it returns not only the three correct nodes, but also the two tags with the text "Wrongly Selected" in the example above. I noticed that this is effectively the same as if I were to use the XPath above, except without the last "[1]", like this (wrapped for legibility):

//div[@id="main-box"]/div/div[2]/div[contains(@class, "row-box")]/div[
  (position() = 3 or position() = 4) and descendant::a[
    contains(@href, "a=")
  ]
][1]/descendant::a[contains(@href, "a=")]

I have made sure that I am using the latest version of HtmlAgilityPack, attempted several variations on my XPath to determine if maybe it was hitting some arbitrary maximum length or other simple issues like that, and tried to research similar issues without success. I tried throwing together an even simpler HTML structure using the same basic concept to test, but couldn't reproduce the issue with that, so I suspect that it may be some subtle issue with how HtmlAgilityPack parses something in this structure.

If anyone knows what might cause this issue, or has a better way to write an XPath expression that will get the correct nodes and hopefully not cause issues in HtmlAgilityPack, I would be greatly appreciative.

EDIT

As suggested, here is a simplified version of the C# code I'm using, which I have confirmed does reproduce the problem for me.

using System;
using System.Net;
using HtmlAgilityPack;

...

static void Main(string[] args)
{
    string url = "http://www.deerso.com/test.html";
    string xpath = "//div[@id=\"main-box\"]/div/div[2]/div[contains(@class, \"row-box\")]/div[(position() = 3 or position() = 4) and descendant::a[contains(@href, \"a=\")]][1]/descendant::a[contains(@href, \"a=\")][1]";
    int statusCode;
    string htmlText;

    HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);

    request.Accept = "text/html,*/*";
    request.Proxy = new WebProxy();
    request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0";

    using (var response = (WebResponse)request.GetResponse())
    {
        statusCode = (int)((HttpWebResponse)response).StatusCode;
        using (var stream = response.GetResponseStream())
        {
            if (stream != null)
            {
                using (var reader = new System.IO.StreamReader(stream))
                {
                    htmlText = reader.ReadToEnd();
                }
            }
            else
            {
                Console.WriteLine("Request to '{0}' failed, response stream was null", url);
                htmlText = null;
                return;
            }
        }
    }

    HtmlNode.ElementsFlags.Remove("form"); //fix for forms
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(htmlText);

    HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(xpath);

    foreach (HtmlNode node in nodes)
    {
        Console.WriteLine("Node Found:");
        Console.WriteLine("Text: {0}", node.InnerText);
        Console.WriteLine("Href: {0}", node.Attributes["href"].Value);
        Console.WriteLine();
    }

    Console.WriteLine("Done!");
}

Popular Answer

New answer based on updated Html

We can't use the //a[contains(@href,'a=')][1] filter since that is selecting the first <a> element from its direct parent.

We need to add brackets to include the descendant operator in the filter, i.e.

(//a[contains(@href,'a=')])[1]

However, if we expand that to apply the first descendant filter to each node in another nodeset, the resultant xpath expression is invalid:

//div[contains(@class,'row-box')](//a[contains(@href,'a=')])[1]

I think we need to break it into two steps:

  1. Get the group of div elements containing the particular link we want.
  2. Get the first descendant link element from each element in that group

In C# this looks like:

// Get the <div> elements we know are ancestors to the <a> elements we want
HtmlNodeCollection topDivs = doc.DocumentNode.SelectNodes("//a[contains(@href,'?a=')]/ancestor::div[contains(@class,'row-box')]");

// Create a new list to hold the <a> elements
List<HtmlNode> linksWeWant = new List<HtmlNode>(topDivs.Count)

// Iterate through the <div> elements and get the first descendant
foreach(var div in topDivs)
{
    linksWeWant.Add(div.SelectSingleNode("(//a[contains(@href,'?a=')])[1]"));
}

Old Answer

Using this page as a guide I put together the xpath expression:

When I run it in HtmlAgilityPack I'm getting only these three elements returned:

<a href = "/test/path?a=123">
<a href = "/test/path?a=abc&b=123">
<a href = "/test/path?a=ghi">

Here's a breakdown of the expression:

//div[contains(@class,'row-box')]        -> Get nodeset of <div class="*row-box*"> elements
/descendant::a                           -> From here get all descendant <a> elements
[contains(@href,'a=') and position()=1]  -> Filter according to href value and element being the first descendant

I believe the key difference to the xpath in your question is /descendant::a[contains(@href,'a=') and position()=1] vs /descendant::a[contains(@href,'a=')][1]. Applying the [1] separately is filtering as the first child instead of the first descendant.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why