HTMLAgilityPack使用C#解析HTML的問題

c# html-agility-pack xpath

我只是想了解HTMLAgilityPack和XPath,我試圖從納斯達克網站上獲得(HTML鏈接)公司列表;

http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx

我目前有以下代碼;

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

        // Create a request for the URL.        
        WebRequest request = WebRequest.Create("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx");
        // Get the response.
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        // Get the stream containing content returned by the server.
        Stream dataStream = response.GetResponseStream();
        // Open the stream using a StreamReader for easy access.
        StreamReader reader = new StreamReader(dataStream);
        // Read the content.
        string responseFromServer = reader.ReadToEnd();
        // Read into a HTML store read for HAP
        htmlDoc.LoadHtml(responseFromServer);

        HtmlNodeCollection tl = htmlDoc.DocumentNode.SelectNodes("//*[@id='indu_table']/tbody/tr[*]/td/b/a");
        foreach (HtmlAgilityPack.HtmlNode node in tl)
        {
            Debug.Write(node.InnerText);
        }            

        // Cleanup the streams and the response.
        reader.Close();
        dataStream.Close();
        response.Close();

我已經使用XPath插件來獲取Chrome的XPath;

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

        // Create a request for the URL.        
        WebRequest request = WebRequest.Create("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx");
        // Get the response.
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        // Get the stream containing content returned by the server.
        Stream dataStream = response.GetResponseStream();
        // Open the stream using a StreamReader for easy access.
        StreamReader reader = new StreamReader(dataStream);
        // Read the content.
        string responseFromServer = reader.ReadToEnd();
        // Read into a HTML store read for HAP
        htmlDoc.LoadHtml(responseFromServer);

        HtmlNodeCollection tl = htmlDoc.DocumentNode.SelectNodes("//*[@id='indu_table']/tbody/tr[*]/td/b/a");
        foreach (HtmlAgilityPack.HtmlNode node in tl)
        {
            Debug.Write(node.InnerText);
        }            

        // Cleanup the streams and the response.
        reader.Close();
        dataStream.Close();
        response.Close();

在運行我的項目時,我得到一個xpath未處理的異常,因為它是一個無效的令牌。

我有點不確定它有什麼問題,我試圖在上面的tr [*]部分輸入一個數字,但我仍然得到同樣的錯誤。

我一直在看這個最後一小時,這有什麼簡單的嗎?

謝謝

一般承認的答案

由於數據來自javascript,你必須解析javascript而不是html,因此Agility Pack沒有那麼多幫助,但它使事情變得容易一些。以下是如何使用Agility Pack和Newtonsoft JSON.Net來解析Javascript。

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(new WebClient().OpenRead("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"));
List<string> listStocks = new List<string>();
HtmlNode scriptNode = htmlDoc.DocumentNode.SelectSingleNode("//script[contains(text(),'var table_body =')]");
if (scriptNode != null)
{
  //Using Regex here to get just the array we're interested in...
  string stockArray = Regex.Match(scriptNode.InnerText, "table_body = (?<Array>\\[.+?\\]);").Groups["Array"].Value;
  JArray jArray = JArray.Parse(stockArray);
  foreach (JToken token in jArray.Children())
  {
    listStocks.Add("http://www.nasdaq.com/symbol/" + token.First.Value<string>().ToLower());
  }
}

為了更詳細地解釋一下,數據來自頁面上的一個大的javascript數組var table_body = [...每個庫存是數組中的一個元素,並且是一個數組本身。

["ATVI", "Activision Blizzard, Inc", 11.75, 0.06, 0.51, 3058125, 0.06, "N", "N"]

因此,通過解析數組並獲取第一個元素並附加修復URL,我們得到與javascript相同的結果。


熱門答案

如果查看該URL的頁面源,實際上並沒有id=indu_table的元素。它似乎是動態生成的(即在javascript中);直接從服務器加載時獲得的html不會反映客戶端腳本更改的任何內容。這可能是它無法正常工作的原因。




許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因
許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因