I'm trying to find a table in a HTML document with the first 2 rows containing 3 columns with text in.
I have experimented trying to use the following query, which I want to return the node that has the first 2 rows of the table contain text in the first column:
string xpath = @"//table//table[//tr[1]//td[1]//*[contains(text(), *)] and //tr[2]//td[1]//*[contains(text(), *)]]";
HtmlNode temp = doc.DocumentNode.SelectSingleNode(xpath);
It doesn't work properly, mon.
Here is some sample HTML, which is the table I'm trying to match:
<table width="100%" cellpadding="0" border="0">
<tbody>
<tr>
<td width="27%" valign="center"><b><font size="1" face="Helvetica">SOME TEXT<br></font></b></td>
<td width="1%"></td>
<td width="9%" valign="center"><font size="1" face="Helvetica">SOME TEXT<br></font></td>
<td width="1%"></td>
<td width="25%" valign="center"><font size="1" face="Helvetica">SOME TEXT<br></font></td>
<td width="37%"></td>
</tr>
<tr>
<td valign="center"><font size="1" face="Helvetica">SOME TEXT<br></font></td>
<td></td>
<td valign="center"><font size="1" face="Helvetica">1<br></font></td>
<td></td>
<td valign="center"><font size="1" face="Helvetica">SOME TEXT<br></font></td>
<td></td>
</tr>
</tbody>
</table>
You notice the columns 1,3,5 have text in the first 2 rows. That's what I'm trying to match.
//table//table[//tr[1]//td[1]//*[contains(text(), *)] and //tr[2]//td[1]//*[contains(text(), *)]]
There are many problems with this XPath expression:
//table//table
selects any table
that is a descendant of a table
. However, in the provided XML document there are no nested tables.
table[//tr[1]//td[1]//*[contains(text(), *)]
. The //tr
inside the predicate is an absolute Xpath expression -- it selects all tr
elements in the whole document -- not only in the subtree rooted by this table
element. Most probably you want .//tr
instead of //tr
.
//td[1]
selects any td
element that is the first td
child of its parent -- but most probably you want only the first descendant td
element. If so, you need to use this XPath expression: (//td)[1]
//*[contains(text(), *)]
this selects any element whose first text node child contains the string value of the first element child -- but you simply want to verify that a td
has a descendant text child node -- this can correctly be selected with: td[.//text()]
Combining the corrections of all these issues, what you probably want is something like:
//table
[(.//tr)[1]/td[1][.//text()]
and
(.//tr)[2]/td[1][.//text()]
]
Alternatively, one could write an equivalent but more understandable and less error-prone expression like this:
//table
[descendant::tr[1]/td[1][descendant::text()]
and
descendant::tr[1]/td[1][descendant::text()]
]