HTML解析C#HTMLAgilityPack

c# html-agility-pack xpath

我在使用HTMLAgilityPack從HTML字符串中讀取某些值時遇到問題。

我想讀的兩個項目是報紙:82548828和魚:8545852485

但是使用我到目前為止寫的代碼我只能得到報紙項目。

我懷疑我使用的XPATH不完全正確,我認為第一個循環的XPATH是正確的,因為這讓我回到了兩個

我希望我的第二個循環遍歷這兩個項目(它認為有6個???)

也是div2.SelectSingleNode(sXPathT);提取groupLabel的正確方法?或者,還有更好的方法?

謝謝

完整測試代碼如下

string strTestHTML = @"<div class=\""content\"" data-id=\""123456789\"">" + 
                              "  <div class=\"m-group item\">" +
                              "      <span class=\"group\">" +
                              "          <a href=\"javascript:void(0);\">" +
                              "          <span class=\"group-label\">Newspaper </span>" +
                              "          <span class=\"group-value\">82548828</span>" +
                              "          </a>" +
                              "      </span>" +
                              "      <span class=\"group\">" +
                              "          <a href=\"javascript:void(0);\">" +
                              "          <span class=\"group-label\">Fish </span>" +
                              "          <span class=\"group-value\">8545852485</span>" +
                              "          </a>" +
                              "      </span>" +
                              "  </div>" +
                              "</div>";


            //<div class="content" data-id="123456789">
            string sNewXpath = "//div[contains(@class,'content') and contains(@data-id, '" + "123456789" + "')]";
            //<div class="m-group item">
            string sSecondXPath = "/div[contains(@class,'m-group item')]";
            //<span class="group"
            string sThirdXPath = "//span[contains(@class,'group')]";

            string sXPathT = "//span[contains(@class,'group-label')]";
            string sXPathO = "//span[contains(@class,'group-value')]";

            HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
            Doc.LoadHtml(strTestHTML);

            foreach (HtmlNode div in Doc.DocumentNode.SelectNodes(sNewXpath + sSecondXPath))
            {
                foreach (HtmlNode div2 in div.SelectNodes(sThirdXPath))
                {
                    var vOddL = div2.SelectSingleNode(sXPathT);
                    var vOddP = div2.SelectSingleNode(sXPathO);

                    string GroupLabel = vOddL.InnerText.Trim();

                    string GroupValue = vOddP.InnerText.Trim();
                }
            }

編輯:

弄清楚為什麼我在forloop中得到了6件物品

sThirdXPath是:string sThirdXPath =“// span [contains(@ class,'group')]”;

應該:

string sThirdXPath =“// span [@ class ='group']”;

仍在嘗試找到正確的方法來查詢div2中包含的HTMLNode以查找感興趣的值。我假設它需要XPath來匹配iinside當前節點,而不是HTML文檔範圍。

更新的HTML示例:

<div class="content" data-id="123456789">
<div class="m-group item">
    <span class="group">
        <a href="javascript:void(0);">
        <span class="group-label">Newspaper </span>
        <span class="group-value">82548828</span>
        </a>
    </span>

    <span class="group">
        <a href="javascript:void(0);">
        <span class="group-label">Fish </span>
        <span class="group-value">8545852485</span>
        </a>
    </span>
</div>
</div>

<div class="content" data-id="987654321">
<div class="m-group item">
    <span class="group">
        <a href="javascript:void(0);">
        <span class="group-label">Bread</span>
        <span class="group-value">82548828</span>
        </a>
    </span>

    <span class="group">
        <a href="javascript:void(0);">
        <span class="group-label">Milk </span>
        <span class="group-value">8545852485</span>
        </a>
    </span>
</div>
</div>

在上面的例子中,正確的XPATH訪問Just Bread及其價值和牛奶及其價值是什麼。我假設我需要在XPath中過濾data-id =“987654321?

一般承認的答案

您的懷疑是正確的,您已經為完整路徑指定了XPath查詢,因此您不需要循環。要在此示例中獲取“Newspaper”和“Fish”節點,您只需使用SelectNodes而不是循環並調用SelectSingleNode。如果有更多的項目你可以循環遍歷結果集,我在這個例子中通過索引訪問它們,因為它們只有兩個。

string sXPathT = "//span[contains(@class,'group-label')]";
string sXPathO = "//span[contains(@class,'group-value')]";

HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
Doc.LoadHtml(strTestHTML);

var vOddL = Doc.DocumentNode.SelectNodes(sXPathT);
var vOddP = Doc.DocumentNode.SelectNodes(sXPathO);

string GroupLabelNewsPaper = vOddL.ElementAt(0).InnerText.Trim();
string GroupLabelFish = vOddL.ElementAt(1).InnerText.Trim();

string GroupValueNewspaper = vOddP.ElementAt(0).InnerText.Trim();
string GroupValueFish = vOddP.ElementAt(1).InnerText.Trim();

Console.WriteLine($"{GroupLabelNewsPaper}\t{GroupValueNewspaper}");
Console.WriteLine($"{GroupLabelFish}\t{GroupValueFish}");

輸出:

Newspaper       82548828
Fish    8545852485

更新:要獲取特定的內容節點,您可以使用此XPath:

string xpathForDataId = "//div[@class='content' and @data-id='987654321']";

您可以使用上面的表達式過濾div,然後獲取此節點的子節點,如下所示:

string sXPathT = ".//span[contains(@class,'group-label')]";
string sXPathO = ".//span[contains(@class,'group-value')]";
string xpathForDataId = "//div[@class='content' and @data-id='987654321']";

HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
Doc.LoadHtml(strTestHTML);

var specificNode = Doc.DocumentNode.SelectSingleNode(xpathForDataId);

var vOddL = specificNode.SelectNodes(sXPathT);
var vOddP = specificNode.SelectNodes(sXPathO);

string GroupLabelBread = vOddL.ElementAt(0).InnerText.Trim();
string GroupLabelMilk = vOddL.ElementAt(1).InnerText.Trim();

string GroupValueBread = vOddP.ElementAt(0).InnerText.Trim();
string GroupValueMilk = vOddP.ElementAt(1).InnerText.Trim();

Console.WriteLine($"{GroupLabelBread}\t{GroupValueBread}");
Console.WriteLine($"{GroupLabelMilk}\t{GroupValueMilk}");

注意sXPathT和sXPathO中的“.//”。我們只搜索當前上下文而不是整個文檔。

輸出:

Bread   82548828
Milk    8545852485


Related

許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow