This is my first effort to use HAP to get an element value. If I attempt to utilize InnerText, I get a null object error.
I am scraping the following URL: http://www.mypivots.com/dailynotes/symbol/659/-1/e-mini-sp500-june-2013 I am trying to get the value for current high from the Day Change Summary Table.
There is my code at the bottom. First off, could you please tell me whether I am approaching this the appropriate way? If so, then is it simply that my XPath value is incorrect?
I used a tool I discovered called htmlagility helper to get the XPath value. The problem is also present in the firebug version of the XPath below: In the following format: /html/body/div/div/table/tbody/tr/td/table/tbody/tr/td
My code is:
WebClient myPivotsWC = new WebClient(); string nodeValue; string htmlCode = myPivotsWC.DownloadString("http://www.mypivots.com/dailynotes/symbol/659/-1/e-mini-sp500-june-2013"); HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(htmlCode); HtmlNode node = doc.DocumentNode.SelectSingleNode("/html/body/div/div/table/tbody/tr/td/table/tbody/tr/td"); nodeValue=(node.InnerText);
You can't depend on developer tools like FireBug or Chrome, etc., to get the XPATH for the nodes you're looking for since such tools only know about the raw HTML sent by the server, but the HTML Agility Pack knows about the in-memory HTML DOM.
Examine what is sent back visually, that is what you need to do (or just do a view source). You'll notice, for instance, that there is no TBODY element. In order to locate anything discriminatory, for instance, use Axes in XPATH. In order for the scrape to be more future-proof, you need to find something more "stable" since your XPATH, even if it worked, would not be particularly resistant to changes in the document.
A code that seems to function is as follows:
HtmlNode node = doc.DocumentNode.SelectSingleNode("//td[@class='dnTableCell']//a[text()='High']/../../td");
What it does is this:
You were able to retrieve the server's raw HTML, as explained by Peter Mourier. You can't obtain the element you require since it doesn't yet exist in the DOM because it hasn't been displayed. Using a web renderer to create the DOM, then grabbing and scraping the HTML, is an easy solution to this issue. I employ WatiN as follows:
WatiN.Core.Settings.MakeNewInstanceVisible = false; WatiN.Core.Settings.AutoMoveMousePointerToTopLeft = false; IE ie = new IE(); ie.GoTo(urlLink); ie.WaitForComplete(); string html = ie.Html; ie.close();