Html Agility Pack - reading div InnerText in table

c# html-agility-pack web-scraping

Question

My problem is that I can't get div InnerText from table. I have successfully extraced different kind of data, but i don't know how to read div from table.

In following picture I've highlighted div, and I need to get InnerText from it, in this case - number 3.

Click here for first picture

I'm trying to accomplish this using following path:

"//div[@class='kal']//table//tr[2]/td[1]/div[@class='cipars']"

But I'm getting following Error:

Click here for Error message picture

Assuming that rest of the code is written correctly, could anyone point me in the right direction ? I have been trying to figure this one out, but i can't get any results.

Accepted Answer

So your problem is that you are relying on positions within your XPath. Whilst this can be OK in some cases, it is not here, because you are expecting the first td in a given tr to have a div with the class.

Looking at the source in Chrome, it shows this is not always the case. You can see this by comparing the "1" element in the calendar, to "2" and "3". You'll notice the "1" element has a number of elements around it, which the others don't.

Your original XPath query does not return an element, this is why you are getting the error. In the event the XPath query you give HtmlAgilityPack does not result in a DOM element, it will return null.

Now, because you've not shown your entire code, I don't know how this code is being run. However, I am guessing you are trying to loop through all of the calendar items. Regardless, you have multiple ways of doing this, but I will show you that with the descendant XPath selector, you can just grab the whole lot in one go:

//div[@class='kal']//table//descendant::div[@class='cipars']

This will return all of the calendar items (ie 1 through 30).

However, to get all the items in a particular row, you can just stick that tr into the query:

//div[@class='kal']//table//tr[3]/descendant::div[@class='cipars']

This would return 2 to 8 (the second row of calendar items).

To target a specific one, well, you'll have to make an assumption on the source code of the website. It looks like that every "cipars" div has an ancestor of a td with a class datums....so to get the "3" value from your question:

//div[@class='kal']//table//tr[3]//td[@class='datums'][2]/div[@class='cipars']

Hopefully this is enough to show the issue at least.

Edit

Although you do have an XPath problem, you also have another issue.

The site is created very strangely. The calendar is loaded in a strange way. When I hit that URL, the calendar is created by some Javascript calling an XML web service (written in PHP) that then calculates the full table to be used for the calendar.

Due to the fact this is Javascript (client side code), HtmlAgilityPack won't execute it. Therefore, HtmlAgilityPack doesn't even "see" the table. Hence the queries against it come back as "not found" (null).

Ways around this: 1) Use a tool that will call the scripts. By this, I mean load up a Browser. A great tool to use for this is called Selenium. This will probably be the better overall solution because it means all the scripting used by the site will actually be called. You can still use XPath with it, so your queries will not change.

The second way is to send a request off to the same web service that the page does. This is to basically get back the same HTML that the page is getting, and using that with HtmlAgilityPack. How do we do that?

Well, you can easily POST data to a web service using C#. Just for ease of use I've stolen the code from this SO question. With this, we can send the same request the page is, and get the same HTML back.

So to send some POST data, we generate a method like so.....

public static string SendPost(string url, string postData)
{
    string webpageContent = string.Empty;

    byte[] byteArray = Encoding.UTF8.GetBytes(postData);

    HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
    webRequest.Method = "POST";
    webRequest.ContentType = "application/x-www-form-urlencoded";
    webRequest.ContentLength = byteArray.Length;

    using (Stream webpageStream = webRequest.GetRequestStream())
    {
        webpageStream.Write(byteArray, 0, byteArray.Length);
    }

    using (HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse())
    {
        using (StreamReader reader = new StreamReader(webResponse.GetResponseStream()))
        {
            webpageContent = reader.ReadToEnd();
        }
    }

    return webpageContent;
}

We can call it like so:

string responseBody = SendPost("http://lekcijas.va.lv/lekcijas_request.php", "nodala=IT&kurss=1&gads=2013&menesis=9&c_dala=");

How did I get this? Well the php file we are calling is the web service the page is, and the POST data is too. The way I found out what data it sends to the service is by debugging the Javascript (using Chrome's Developer console), but you may notice it's pretty much the same thing that is in the URL. That seems to be intentional.

The responseBody that is returned is the physical HTML of just the table for the calendar.

What do we do with it now? We load that up into HtmlAgilityPack, because it is able to accept pure HTML.

var document = new HtmlDocument();
document.LoadHtml(webpageContent);

Now, we stick that original XPath in:

var node = document.DocumentNode.SelectSingleNode("//div[@class='kal']//table//tr[3]//td[@class='datums'][2]/div[@class='cipars']");

Now, we print out what should hopefully be "3":

Console.WriteLine(node.InnerText);

My output, running it locally, is indeed: 3.

However, although this would get you over the problem you are having, I am assuming the rest of the site is like this. If this is the case, you may still be able to work around it using technique above, but tools like Selenium were created for this very reason.



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why