HTML Parsing C# HTMLAgilityPack

c# html-agility-pack xpath

Question

I am having a problem reading some values from a HTML string using the HTMLAgilityPack.

The Two Items i want to read are Newspaper : 82548828 and Fish : 8545852485

But using the code i have wrote so far i can only ever get back the Newspaper item.

I suspect the XPATH i am using is not fully correct, i think the XPATH for the first loop is corrrect as this gives me back the two

I want my second loop to loop over these two items (it thinks there are 6???)

Also is div2.SelectSingleNode(sXPathT); the correct way to extract the groupLabel? or is there a better way?

Thanks

Full Test Code Below

string strTestHTML = @"<div class=\""content\"" data-id=\""123456789\"">" + 
                              "  <div class=\"m-group item\">" +
                              "      <span class=\"group\">" +
                              "          <a href=\"javascript:void(0);\">" +
                              "          <span class=\"group-label\">Newspaper </span>" +
                              "          <span class=\"group-value\">82548828</span>" +
                              "          </a>" +
                              "      </span>" +
                              "      <span class=\"group\">" +
                              "          <a href=\"javascript:void(0);\">" +
                              "          <span class=\"group-label\">Fish </span>" +
                              "          <span class=\"group-value\">8545852485</span>" +
                              "          </a>" +
                              "      </span>" +
                              "  </div>" +
                              "</div>";


            //<div class="content" data-id="123456789">
            string sNewXpath = "//div[contains(@class,'content') and contains(@data-id, '" + "123456789" + "')]";
            //<div class="m-group item">
            string sSecondXPath = "/div[contains(@class,'m-group item')]";
            //<span class="group"
            string sThirdXPath = "//span[contains(@class,'group')]";

            string sXPathT = "//span[contains(@class,'group-label')]";
            string sXPathO = "//span[contains(@class,'group-value')]";

            HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
            Doc.LoadHtml(strTestHTML);

            foreach (HtmlNode div in Doc.DocumentNode.SelectNodes(sNewXpath + sSecondXPath))
            {
                foreach (HtmlNode div2 in div.SelectNodes(sThirdXPath))
                {
                    var vOddL = div2.SelectSingleNode(sXPathT);
                    var vOddP = div2.SelectSingleNode(sXPathO);

                    string GroupLabel = vOddL.InnerText.Trim();

                    string GroupValue = vOddP.InnerText.Trim();
                }
            }

EDIT:

Worked out why i was getting 6 items back in the forloop

sThirdXPath was : string sThirdXPath = "//span[contains(@class,'group')]";

should be:

string sThirdXPath = "//span[@class='group']";

Still trying to find the right way to interrogate the HTMLNode contained in div2 to find the values of interest. I assume it needs XPath to match iinside the current node, not HTML document wide.

Updated HTML Sample:

<div class="content" data-id="123456789">
<div class="m-group item">
    <span class="group">
        <a href="javascript:void(0);">
        <span class="group-label">Newspaper </span>
        <span class="group-value">82548828</span>
        </a>
    </span>

    <span class="group">
        <a href="javascript:void(0);">
        <span class="group-label">Fish </span>
        <span class="group-value">8545852485</span>
        </a>
    </span>
</div>
</div>

<div class="content" data-id="987654321">
<div class="m-group item">
    <span class="group">
        <a href="javascript:void(0);">
        <span class="group-label">Bread</span>
        <span class="group-value">82548828</span>
        </a>
    </span>

    <span class="group">
        <a href="javascript:void(0);">
        <span class="group-label">Milk </span>
        <span class="group-value">8545852485</span>
        </a>
    </span>
</div>
</div>

In the above example what is the correct XPATH to access Just Bread and Its Value and Milk and its Value. I assume i need to filter on data-id="987654321 in the XPath?

Accepted Answer

Your suspicion is correct, you already specified the XPath queries for the full path so you don't need a loop. To get "Newspaper" and "Fish" nodes in this example you can simply use SelectNodes instead of looping and calling SelectSingleNode. If there are more items you can loop through the result set of course, I accessed them by index in this example as there are only two of them.

string sXPathT = "//span[contains(@class,'group-label')]";
string sXPathO = "//span[contains(@class,'group-value')]";

HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
Doc.LoadHtml(strTestHTML);

var vOddL = Doc.DocumentNode.SelectNodes(sXPathT);
var vOddP = Doc.DocumentNode.SelectNodes(sXPathO);

string GroupLabelNewsPaper = vOddL.ElementAt(0).InnerText.Trim();
string GroupLabelFish = vOddL.ElementAt(1).InnerText.Trim();

string GroupValueNewspaper = vOddP.ElementAt(0).InnerText.Trim();
string GroupValueFish = vOddP.ElementAt(1).InnerText.Trim();

Console.WriteLine($"{GroupLabelNewsPaper}\t{GroupValueNewspaper}");
Console.WriteLine($"{GroupLabelFish}\t{GroupValueFish}");

Output:

Newspaper       82548828
Fish    8545852485

UPDATE: To get a specific content node you can use this XPath:

string xpathForDataId = "//div[@class='content' and @data-id='987654321']";

You can filter the divs with the above expression then get the child nodes of this like this:

string sXPathT = ".//span[contains(@class,'group-label')]";
string sXPathO = ".//span[contains(@class,'group-value')]";
string xpathForDataId = "//div[@class='content' and @data-id='987654321']";

HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
Doc.LoadHtml(strTestHTML);

var specificNode = Doc.DocumentNode.SelectSingleNode(xpathForDataId);

var vOddL = specificNode.SelectNodes(sXPathT);
var vOddP = specificNode.SelectNodes(sXPathO);

string GroupLabelBread = vOddL.ElementAt(0).InnerText.Trim();
string GroupLabelMilk = vOddL.ElementAt(1).InnerText.Trim();

string GroupValueBread = vOddP.ElementAt(0).InnerText.Trim();
string GroupValueMilk = vOddP.ElementAt(1).InnerText.Trim();

Console.WriteLine($"{GroupLabelBread}\t{GroupValueBread}");
Console.WriteLine($"{GroupLabelMilk}\t{GroupValueMilk}");

Notice the ".//" in the sXPathT and sXPathO. By that we search the current context only and not the whole document.

Output:

Bread   82548828
Milk    8545852485



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why