HTML Agility Pack - using XPath to get a single node - Object Reference not set to an instance of an object

html-agility-pack xpath

Question

This is my first effort to use HAP to get an element value. If I attempt to utilize InnerText, I get a null object error.

I am scraping the following URL: http://www.mypivots.com/dailynotes/symbol/659/-1/e-mini-sp500-june-2013 I am trying to get the value for current high from the Day Change Summary Table.

There is my code at the bottom. First off, could you please tell me whether I am approaching this the appropriate way? If so, then is it simply that my XPath value is incorrect?

I used a tool I discovered called htmlagility helper to get the XPath value. The problem is also present in the firebug version of the XPath below: In the following format: /html/body/div[3]/div/table/tbody/tr[3]/td/table/tbody/tr[5]/td[3]

My code is:

WebClient myPivotsWC = new WebClient();
string nodeValue;
string htmlCode = myPivotsWC.DownloadString("http://www.mypivots.com/dailynotes/symbol/659/-1/e-mini-sp500-june-2013");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
HtmlNode node = doc.DocumentNode.SelectSingleNode("/html[1]/body[1]/div[3]/div[1]/table[1]/tbody[1]/tr[3]/td[1]/table[1]/tbody[1]/tr[5]/td[3]");
nodeValue=(node.InnerText);

Thanks, Will.

1
8
4/5/2013 5:52:35 AM

Accepted Answer

You can't depend on developer tools like FireBug or Chrome, etc., to get the XPATH for the nodes you're looking for since such tools only know about the raw HTML sent by the server, but the HTML Agility Pack knows about the in-memory HTML DOM.

Examine what is sent back visually, that is what you need to do (or just do a view source). You'll notice, for instance, that there is no TBODY element. In order to locate anything discriminatory, for instance, use Axes in XPATH. In order for the scrape to be more future-proof, you need to find something more "stable" since your XPATH, even if it worked, would not be particularly resistant to changes in the document.

A code that seems to function is as follows:

HtmlNode node = doc.DocumentNode.SelectSingleNode("//td[@class='dnTableCell']//a[text()='High']/../../td[3]");

What it does is this:

  • locate a TD element with the value 'dnTableCell' for the CLASS property. The / character indicates that the XML hierarchy search is recursive.
  • Look for an A element with an inner text value of "High."
  • To get to the nearest TR element, move two parents up.
  • from there, choose the third TD element.
28
11/14/2017 6:44:05 PM

Popular Answer

You were able to retrieve the server's raw HTML, as explained by Peter Mourier. You can't obtain the element you require since it doesn't yet exist in the DOM because it hasn't been displayed. Using a web renderer to create the DOM, then grabbing and scraping the HTML, is an easy solution to this issue. I employ WatiN as follows:

WatiN.Core.Settings.MakeNewInstanceVisible = false;
WatiN.Core.Settings.AutoMoveMousePointerToTopLeft = false; 
IE ie = new IE();
ie.GoTo(urlLink); 
ie.WaitForComplete();
string html = ie.Html;
ie.close();


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow