This is my table
<table class="DataRows" frame="myFrames" rules="Standard" width="100%">
<colgroup><col width="70" align="CENTER">
<col width="200" align="LEFT">
<col width="80" align="LEFT">
<col align="LEFT">
<col align="RIGHT">
</colgroup><thead>
<col width="70" align="CENTER">
<col width="200" align="LEFT">
<col width="80" align="LEFT">
<col align="LEFT">
<col align="RIGHT">
<thead>
<tr>
<td valign="TOP"><span class="classicBold"> 20 </span> Kg.
<td class="BOLD" valign="TOP" nowrap="">
PA Passion Foods Inc.
<td class="BOLD">Fax:
<td>
222-555666
<td class="BOLD">
Processed foods and juices
<tr>
<td><a target="_blank" href="">See on Map </a>
<td>
120 NW 157TH AVE
<td class="BOLD">Warehouse Hours:
<td colspan="2">
<tr>
<td>
<td><span class="BOLD">
Jacksonville,
</span>
FL 300000
<td class="BOLD">Url:
<td colspan="2">
<a target="_blank" href="">PA Passion</a>
  
<span class="BOLD">E-mail:</span>
zoro@xyz.com
<tr>
<td>
<td class="REDBOLD" colspan="4">
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
Nutrella
</span>
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
APPLE Foods, Constants
</span>
<tr>
<td>
<td colspan="4" align="LEFT"><span class="BOLD">
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We service:<span class="BOLD">
All occasions and hospitality services
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We sell :<span class="BOLD">
----
</span>
</td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td></td></tr>
</thead>
</table>
I am looping through each node in my Html document using the code below
foreach (HtmlNode node in htmlAgilityPackDoc.DocumentNode.SelectNodes("//table[contains(@class,'DataRows')]"))
{
}
When I use the following
node.SelectSingleNode(".//tr[1]/td[1]").InnerHtml
I get the following html
<span class="classicBold"> 20 </span> Kg.
<td class="BOLD" valign="TOP" nowrap="">
PA Passion Foods Inc.
<td class="BOLD">Fax:
<td>
222-555666
<td class="BOLD">
Processed foods and juices
<tr>
<td><a target="_blank" href="">See on Map </a>
<td>
120 NW 157TH AVE
<td class="BOLD">Warehouse Hours:
<td colspan="2">
<tr>
<td>
<td><span class="BOLD">
Jacksonville,
</span>
FL 300000
<td class="BOLD">Url:
<td colspan="2">
<a target="_blank" href="">PA Passion</a>
  
<span class="BOLD">E-mail:</span>
zoro@xyz.com
<tr>
<td>
<td class="REDBOLD" colspan="4">
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
Nutrella
</span>
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
APPLE Foods, Constants
</span>
<tr>
<td>
<td colspan="4" align="LEFT"><span class="BOLD">
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We service:<span class="BOLD">
All occasions and hospitality services
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We sell :<span class="BOLD">
----
</span>
</td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td></td>
How do I extract the address 120 NW 157TH AVE from this ?
When I tried using
node.SelectSingleNode(".//td[@class='BOLD'][4]/preceding-sibling::td").InnerText;
I get an error:
Object reference not set to an instance of an object
Your html is a mess tags are overlapping i suggest you use text nodes as your identifiers rather than indices for example
.//td[./a[contains(text(),'See on Map')]]/td/text()
to get
120 NW 157TH AVE
Here is a full example that gets you everything
var table = doc.DocumentNode.SelectSingleNode("//table[contains(@class,'DataRows')]");
var name = table.SelectSingleNode(".//td[@class='BOLD']/text()").InnerText.Trim();
var fax = table.SelectSingleNode(".//td[contains(text(),'Fax')]/td/text()").InnerText.Trim();
var email = table.SelectSingleNode(".//span[contains(text(),'E-mail')]/following-sibling::text()").InnerText.Trim();
var address = table.SelectSingleNode(".//td[./a[contains(text(),'See on Map')]]/td/text()").InnerText.Trim();
var city = table.SelectSingleNode(".//tr[./td/a[contains(text(),'See on Map')]]//tr/td/td/span").InnerText.Trim(',');
var zip = table.SelectSingleNode(".//tr[./td/a[contains(text(),'See on Map')]]//tr/td/td/span/following-sibling::text()").InnerText.Trim();
Note because of how messy your html is the xpaths has to be as messy, trying to access the tr
element by index won't work because all tr elements are children of the previous tr
, what is .//tr[4]
in a normal table is .//tr/tr/tr/tr
in your table.