如何使用html敏捷性來獲取<b>和</b>之間的所有內容<b><br></b>

c# html-agility-pack html-parsing screen-scraping

上週我對同一個項目的質疑很差,沒有收到任何建議。我會盡力更清楚。我正在嘗試使用網站www.gtin13.com上的數據。例如,如果你在搜索中輸入花生醬,我試圖抓住描述:** Nabisco Nutter Butter Sandwich Cookies巧克力花生醬4 Ct * 尺寸 :大小:12盎司GTIN: 0044000003562 * ean :** 00- 44000-00356-2 upc: 044000003562和upca: 04400000356.我嘗試將nodeCollection與SelectNodes ("<b>") ,我得到的只是錯誤。甚至可以使用html敏捷性來獲取<b> <br>之間的數據,然後在/ s之間進行解析?由於缺乏經驗,我無法取得任何進展。似乎返回的頁面沒有我認為真正的節點。如果html敏捷無法做到這一點,任何人都可以提出更好的方法嗎?最終我想將每個數據發送到一個sql表。我希望我以一種更有意義的方式呈現。

該頁面以此源格式返回信息:

<b><a href="/product/nabisco+nutter+butter+sandwich+cookies+chocolate+peanut+butter+4+ct/">Nabisco Nutter Butter Sandwich Cookies Chocolate Peanut Butter 4 Ct</a></b><br />

Size: 12 oz<br />

GTIN/EAN-13: 0044000003562 / 00-44000-00356-2<br />

UPC-A: 044000003562 / 04400000356<br />



Tags:

<a href="/tag/chocolate/">Chocolate</a>, 

<a href="/tag/cookies/">Cookies</a>, 
 ..<br />

<br >

一般承認的答案

這並不容易,因為原始文檔是非結構化的(不是使用分層佈局,而是使用平面佈局),但是這裡是如何使用Html Agility Pack提取主文本字段:

        HtmlDocument doc = new HtmlDocument();
        doc.Load("yourDoc.Htm");

        // Get A nodes that have an HREF attribute
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//b/a[@href]"))
        {
            // This will contain anchor's displayed text
            string title = node.InnerText;
            Console.WriteLine("title=" + title);

            // Get the 1st BR, and then it's next sibling of TEXT type.
            HtmlNode sizeNode = node.SelectSingleNode("../following-sibling::br[1]/following-sibling::text()");
            Console.WriteLine(" size=" + sizeNode.InnerText.Trim());

            // Get the 3nd BR, and then it's next sibling of TEXT type.
            HtmlNode eanNode = node.SelectSingleNode("../following-sibling::br[2]/following-sibling::text()");
            Console.WriteLine(" ean=" + eanNode.InnerText.Trim());

            // Get the 3rd BR, and then it's next sibling of TEXT type.
            HtmlNode upcNode = node.SelectSingleNode("../following-sibling::br[3]/following-sibling::text()");
            Console.WriteLine(" upc=" + upcNode.InnerText.Trim());
        }

這將顯示:

        HtmlDocument doc = new HtmlDocument();
        doc.Load("yourDoc.Htm");

        // Get A nodes that have an HREF attribute
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//b/a[@href]"))
        {
            // This will contain anchor's displayed text
            string title = node.InnerText;
            Console.WriteLine("title=" + title);

            // Get the 1st BR, and then it's next sibling of TEXT type.
            HtmlNode sizeNode = node.SelectSingleNode("../following-sibling::br[1]/following-sibling::text()");
            Console.WriteLine(" size=" + sizeNode.InnerText.Trim());

            // Get the 3nd BR, and then it's next sibling of TEXT type.
            HtmlNode eanNode = node.SelectSingleNode("../following-sibling::br[2]/following-sibling::text()");
            Console.WriteLine(" ean=" + eanNode.InnerText.Trim());

            // Get the 3rd BR, and then it's next sibling of TEXT type.
            HtmlNode upcNode = node.SelectSingleNode("../following-sibling::br[3]/following-sibling::text()");
            Console.WriteLine(" upc=" + upcNode.InnerText.Trim());
        }

等等...

注意:它不是100%完成,因為您將不得不使用標準字符串操作(IndexOf,Substring等等)或Regex解析size,ean和upc變量,但事情的Html方面已完成。


熱門答案

使用HTQL,從頁面中提取整個表的查詢是:

<div (CLASS='BGC')>1.<div (CLASS='CON')>1.<div (CLASS='SC')>1.<div (ID='post-20')>1.<div (CLASS='PostContent')>1.<b sep>2-0 {
  title=<a>1:tx; 
  size=/'Size:'~'<br />'/;
  gtin=/'GTIN/EAN-13:'~'<br />'/;
  upc=/'UPC-A:'~'<br />'/;
  tags=/'Tags:'~'<br />'/;
}

如果你只需要將結果發送到sql數據庫,那麼我建議你使用IRobotSoft web scraper。




許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因
許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因