Using HtmlAgilityPack to get a specific row and column data

c# html-agility-pack xml-parsing xpath

Question

This is my table

<table class="DataRows" frame="myFrames" rules="Standard" width="100%">

  <colgroup><col width="70" align="CENTER">
  <col width="200" align="LEFT">
  <col width="80" align="LEFT">
  <col align="LEFT">
  <col align="RIGHT">

  </colgroup><thead>

  <col width="70" align="CENTER">
  <col width="200" align="LEFT">
  <col width="80" align="LEFT">
  <col align="LEFT">
  <col align="RIGHT">

  <thead>

  <tr>
    <td valign="TOP"><span class="classicBold"> 20 </span> Kg.
    <td class="BOLD" valign="TOP" nowrap="">
      PA Passion Foods Inc.
    <td class="BOLD">Fax:
    <td>
      222-555666
    <td class="BOLD">
      Processed foods and juices

  <tr>
    <td><a target="_blank" href="">See on Map </a>
    <td>
      120 NW 157TH AVE 
    <td class="BOLD">Warehouse Hours:
    <td colspan="2">


  <tr>
    <td>
    <td><span class="BOLD">
      Jacksonville,
      </span>
      FL 300000
    <td class="BOLD">Url:
    <td colspan="2">
      <a target="_blank" href="">PA Passion</a>
      &nbsp&nbsp
      <span class="BOLD">E-mail:</span>
      zoro@xyz.com

  <tr>
    <td>
    <td class="REDBOLD" colspan="4">


  <tr>
    <td>
    <td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
 Nutrella


</span>
  <tr>
    <td>
    <td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
APPLE Foods, Constants
</span>
  <tr>
    <td>
    <td colspan="4" align="LEFT"><span class="BOLD">

</span>

  <tr>
    <td>
    <td colspan="4" align="LEFT">We service:<span class="BOLD">
All occasions and hospitality services
</span>

  <tr>
    <td>
    <td colspan="4" align="LEFT">We sell :<span class="BOLD">
----
</span>

</td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td></td></tr>
  </thead>
</table>

I am looping through each node in my Html document using the code below

foreach (HtmlNode node in htmlAgilityPackDoc.DocumentNode.SelectNodes("//table[contains(@class,'DataRows')]"))
{

}

When I use the following

node.SelectSingleNode(".//tr[1]/td[1]").InnerHtml

I get the following html

<span class="classicBold"> 20 </span> Kg.
        <td class="BOLD" valign="TOP" nowrap="">
          PA Passion Foods Inc.
        <td class="BOLD">Fax:
        <td>
          222-555666
        <td class="BOLD">
          Processed foods and juices

      <tr>
        <td><a target="_blank" href="">See on Map </a>
        <td>
          120 NW 157TH AVE 
        <td class="BOLD">Warehouse Hours:
        <td colspan="2">


      <tr>
        <td>
        <td><span class="BOLD">
          Jacksonville,
          </span>
          FL 300000
        <td class="BOLD">Url:
        <td colspan="2">
          <a target="_blank" href="">PA Passion</a>
          &nbsp&nbsp
          <span class="BOLD">E-mail:</span>
          zoro@xyz.com

      <tr>
        <td>
        <td class="REDBOLD" colspan="4">


      <tr>
        <td>
        <td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
     Nutrella


    </span>
      <tr>
        <td>
        <td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
    APPLE Foods, Constants
    </span>
      <tr>
        <td>
        <td colspan="4" align="LEFT"><span class="BOLD">

    </span>

      <tr>
        <td>
        <td colspan="4" align="LEFT">We service:<span class="BOLD">
    All occasions and hospitality services
    </span>

      <tr>
        <td>
        <td colspan="4" align="LEFT">We sell :<span class="BOLD">
    ----
    </span>

    </td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td></td>

How do I extract the address 120 NW 157TH AVE from this ?

When I tried using

node.SelectSingleNode(".//td[@class='BOLD'][4]/preceding-sibling::td").InnerText;

I get an error:

Object reference not set to an instance of an object

Accepted Answer

Your html is a mess tags are overlapping i suggest you use text nodes as your identifiers rather than indices for example

.//td[./a[contains(text(),'See on Map')]]/td/text() 

to get

120 NW 157TH AVE

Here is a full example that gets you everything

    var table = doc.DocumentNode.SelectSingleNode("//table[contains(@class,'DataRows')]");

    var name = table.SelectSingleNode(".//td[@class='BOLD']/text()").InnerText.Trim();
    var fax = table.SelectSingleNode(".//td[contains(text(),'Fax')]/td/text()").InnerText.Trim();
    var email = table.SelectSingleNode(".//span[contains(text(),'E-mail')]/following-sibling::text()").InnerText.Trim();
    var address = table.SelectSingleNode(".//td[./a[contains(text(),'See on Map')]]/td/text()").InnerText.Trim();
    var city = table.SelectSingleNode(".//tr[./td/a[contains(text(),'See on Map')]]//tr/td/td/span").InnerText.Trim(',');
    var zip = table.SelectSingleNode(".//tr[./td/a[contains(text(),'See on Map')]]//tr/td/td/span/following-sibling::text()").InnerText.Trim();

Note because of how messy your html is the xpaths has to be as messy, trying to access the tr element by index won't work because all tr elements are children of the previous tr, what is .//tr[4] in a normal table is .//tr/tr/tr/tr in your table.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why