Get specific Tables with Html Agility Pack

c# html-agility-pack xpath

Question

I'm experiencing problems using HTML Agility Pack to acquire certain particular tables. I am also unable to alter the HTML itself, thus I am unable to use other IDs, classes, or anything else.

Could someone please demonstrate how to access each of the following tables separately?

<table class="newTable">
      //table 1 contents
    <table border="0" cellpadding="3" cellspacing="2" width="100%">
         //table 1 - A contents
    </table>
</table>
<table border="0" cellpadding="0" cellspacing="0" class="newTable">
     //table 2 contents
    <table width="100%" border="0" cellspacing="2" cellpadding="0">
        //table 2 - A contents
    </table>
    <table width="100%" border="0" cellspacing="2" cellpadding="0">
       //table 2 - B contents
    </table>
    <table width="100%" cellspacing="2" cellpadding="0">
       //table 2 - C contents
    </table>
</table>
<table>
     //table 3 contents
</table>

If I had to make the following call right now

HtmlNode table = doc.DocumentNode.SelectSingleNode("//table");
foreach (var cell in table.SelectNodes("//tr/td"))
{
     string someVariable = cell.InnerText
}

I would investigate everything. In order to correlate where I am keeping the data, I want to be able to access tables in multiple ways.

I've attempted to look at anything like

doc.DocumentNode.SelectNodes("//table[1]");

However, when I attempt to specify a table with an index, it still reads in all tables or none, thus it doesn't appear to function.

The same is true for this; either it works well or not at all.

foreach (var cell in table.SelectNodes("//table").Skip(some_number))
{
     string someVariable = cell.InnerText
}

I'm using HTML Agility Pack 1.4.9 from the NuGet package.

EDIT:

I made an effort to just obtain Table 1 - A's contents. Both provide endcodingfound or null exceptions.

HtmlNode table = doc.DocumentNode.SelectSingleNode("//table/tr/td/table[1]");

HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]/tr/td/table[1]");

1
3
10/1/2014 5:56:53 PM

Accepted Answer

Your second call made a mistake since it caused the "/tr/td" to return to the root element. The first half of your issue may be resolved by using your indexer, and the second can be resolved by declaring that you wish to travel from your current location:

HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]");
foreach (var cell in table.SelectNodes(".//tr/td")) // **notice the .**
{
     string someVariable = cell.InnerText
}

I'm not sure what else is going on, but the following merely seems to work on my test by adding this code to your test table. It may indicate that you should provide a bit additional background.

The document I utilized for the exams is as follows:

<!DOCTYPE html>

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
    <meta charset="utf-8" />
    <title></title>
</head>
<body>
    <table class="newTable">
        <tr>
            <td>
                <table border="0" cellpadding="3" cellspacing="2" width="100%">
                    <tr><td>
                        //table 1 - A contents
                    </td></tr>
                </table>
            </td>
        </tr>

    </table>
    <table border="0" cellpadding="0" cellspacing="0" class="newTable">
        <tr>
            <td>
                //table 2 contents
                <table width="100%" border="0" cellspacing="2" cellpadding="0">
                    <tr>
                        <td>
                            //table 2 - A contents
                        </td>
                    </tr>
                </table>
                <table width="100%" border="0" cellspacing="2" cellpadding="0">
                    <tr>
                        <td>
                            //table 2 - B contents
                        </td>
                    </tr>
                </table>
                <table width="100%" cellspacing="2" cellpadding="0">
                    <tr>
                        <td>
                            //table 2 - C contents
                        </td>
                    </tr>
                </table>
            </td>
        </tr>
    </table>
    <table>
        <tr>
            <td>
                //table 3 contents
            </td>
        </tr>
    </table>
</body>
</html>

The code to extract the values you're looking for is as follows:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);

var node1A = doc.DocumentNode.SelectSingleNode("//table[1]//table[1]");
string content1A = node1A.InnerText;
Console.WriteLine(content1A);

var node2C = doc.DocumentNode.SelectSingleNode("//table[2]//table[3]");
string content2C = node2C.InnerText;
Console.WriteLine(content2C);

Shows:

enter image description here

Update

Okay, I used your real HTML, and I too received a NullReference. Uncertain of why, but there must be something that causes the Agility Pack significant confusion. Though some testing with the Linq API seems to be successful, I hope it may still be an option for you:

var table = doc.DocumentNode.DescendantsAndSelf("table").Skip(1).First().Descendants("table").First();
var tds   = table.Descendants("td");
6
10/2/2014 5:49:40 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow