Get data using HAP (HTML Agility Pack) From Page

.net .net-4.0 c# html-agility-pack

Question

A continuation of this post, I am trying to parse out some data from an HTML page. Here is the HTML (there is more info on the page, but this is the important section):

<table class="integrationteamstats">
<tbody>
<tr>
    <td class="right">
        <span class="mediumtextBlack">Queue:</span>
    </td>
    <td class="left">
        <span class="mediumtextBlack">0</span>
    </td>
    <td class="right">
        <span class="mediumtextBlack">Aban:</span>
    </td>
    <td class="left">
        <span class="mediumtextBlack">0%</span>
    </td>
    <td class="right">
        <span class="mediumtextBlack">Staffed:</span>
    </td>
    <td class="left">
        <span class="mediumtextBlack">0</span>
    </td>
</tr>
<tr>
    <td class="right">
        <span class="mediumtextBlack">Wait:</span>
    </td>
    <td class="left">
        <span class="mediumtextBlack">0:00</span>
    </td>
    <td class="right">
        <span class="mediumtextBlack">Total:</span>
    </td>
    <td class="left">
        <span class="mediumtextBlack">0</span>
    </td>
    <td class="right">
        <span class="mediumtextBlack">On ACD:</span>
    </td>
    <td class="left">
        <span class="mediumtextBlack">0</span>
    </td>
</tr>
</tbody>
</table>

I need to get 2 pieces of information: the data inside of the td below Queue and the data inside the td below Wait (so the Queue count and wait time). Obviously the numbers are going to update frequently.

I have gotten to the point where the HTML is pilled into an HtmlDocument variable. And I've found something along the lines of using an HtmlNodeCollection to gather nodes that meet a certain criteria. This is basically where I am stuck:

HtmlNodeCollection tds = 
    new HtmlNodeCollection(this.html.DocumentNode.ParentNode);
tds = this.html.DocumentNode.SelectNodes("//td");

foreach (HtmlNode td in tds)
{
    /* I want to write:
     * If the last node's value was 'Queue', give me the value of this node.
     * and
     * If the last node's value was 'Wait Time', give me the value of this node.
     */
}

And I can go through this with a foreach, but I am not certain how to access the value or how to get the next value.

Accepted Answer

Generally, there's no need to go through with a foreach as getting the targeted information is pretty easy (with a foreach you'd have to manage the state of each iteration of the loop and it's really unwieldy).

First, you want to get the table. Filtering on the class attribute is generally a bad idea, as you can have multiple elements in an HTML document that have the class applied to it. If you had an id attribute, that would be ideal.

That said, if this is the only table with this class, then you can get the body of the table element using:

// Get the table.
HtmlNode tableBody = document.DocumentNode.SelectSingleNode(
    "//table[@class='integrationteamstats']/tbody");

From there, you want to get the individual rows. Since these are direct children of the tbody element, you can get the rows by position through the ChildNodes property, like so:

HtmlNode queueRow = tableBody.ChildNodes[0];
HtmlNode waitRow = tableBody.ChildNodes[1];

Then you want the second td element in each row. While there's a span tag in there that wraps the content, you want all of the text that's in the td element in it's entirety, you can use the InnerText property to get the value:

string queueValue = queueRow.ChildNodes[1].InnerText;
string waitValue = waitRow.ChildNodes[1].InnerText;

Note, there's replication here, so if you find there are a lot of rows that you have to parse like this, you might want to factor out some of the logic into helper methods.


Popular Answer

You could also use CsQuery to do this. Since it uses familiar CSS selector syntax & jQuery methods, it can be easier to use than HAP for more complex DOM navigation. For example:

// function to get the text from the cell AFTER the one containing 'text'

string getNextCellText(CQ dom, string text) {
    // find the target cell
    CQ target= dom.Select(".integrationteamstats td:contains(" + text + ")");

    // return the text contents of the next cell
    return target.Next().Text();
}

void Main() {
    var dom = CQ.Create(html);
    string queue = getNextCellText(dom,"Queue");
    string wait = getNextCellText(dom,"Wait:");

    .. do stuff
}



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why