Issue with HTMLAgilityPack parsing HTML using C#

c# html-agility-pack xpath

Question

I'm attempting to get a list of (HTML Links) firms from the NASDAQ website since I'm just learning about HTMLAgilityPack and XPath;

http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx

I presently own the code;

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

        // Create a request for the URL.        
        WebRequest request = WebRequest.Create("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx");
        // Get the response.
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        // Get the stream containing content returned by the server.
        Stream dataStream = response.GetResponseStream();
        // Open the stream using a StreamReader for easy access.
        StreamReader reader = new StreamReader(dataStream);
        // Read the content.
        string responseFromServer = reader.ReadToEnd();
        // Read into a HTML store read for HAP
        htmlDoc.LoadHtml(responseFromServer);

        HtmlNodeCollection tl = htmlDoc.DocumentNode.SelectNodes("//*[@id='indu_table']/tbody/tr[*]/td/b/a");
        foreach (HtmlAgilityPack.HtmlNode node in tl)
        {
            Debug.Write(node.InnerText);
        }            

        // Cleanup the streams and the response.
        reader.Close();
        dataStream.Close();
        response.Close();

I obtained the XPath of; using an XPath add-on for Chrome.

//*table[@id='indu_table']/tbody/tr[*]/td/b/a

I get an xpath unhandled error about it being an invalid token when I run my project.

I've attempted to enter a number in the tr[*] area above, but I still get the same problem, so I'm not really sure what's wrong.

Have I missed anything obvious after an hour of looking at this?

thanks

1
3
6/13/2012 3:01:11 PM

Accepted Answer

The Agility Pack doesn't really assist much, but it does make things a little simpler since you have to parse the javascript and not the html because the data comes from javascript. It might be accomplished using Agility Pack and JSON.Net by Newtonsoft to parse the Javascript as shown below.

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(new WebClient().OpenRead("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"));
List<string> listStocks = new List<string>();
HtmlNode scriptNode = htmlDoc.DocumentNode.SelectSingleNode("//script[contains(text(),'var table_body =')]");
if (scriptNode != null)
{
  //Using Regex here to get just the array we're interested in...
  string stockArray = Regex.Match(scriptNode.InnerText, "table_body = (?<Array>\\[.+?\\]);").Groups["Array"].Value;
  JArray jArray = JArray.Parse(stockArray);
  foreach (JToken token in jArray.Children())
  {
    listStocks.Add("http://www.nasdaq.com/symbol/" + token.First.Value<string>().ToLower());
  }
}

To clarify a little more specifically, the data originates from a single large javascript array on the website.var table_body = [... Each stock functions as both an array and an element of the array.

["ATVI", "Activision Blizzard, Inc", 11.75, 0.06, 0.51, 3058125, 0.06, "N", "N"]

Therefore, we may achieve the same outcome as javascript by parsing the array, picking the first element, and adding the fix url.

3
6/13/2012 5:59:20 PM

Popular Answer

The page source for that URL doesn't really include an element withid=indu_table . The html you receive when loading straight from the server will not reflect anything that has been updated by client script; it looks to be created dynamically (i.e. in javascript). This is presumably the cause of its failure.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow