Using XPath, select all links from an HTML table (and HtmlAgilityPack)

c# html-agility-pack xpath

Question

To extract all links with a href attribute that begins with http://, https://, or / is what I'm aiming for. These connections are located inside a table with a certain class (tbody > tr > td, etc.). I though I could provide just the an element without including the whole route, however it does not seem to work. A NullReferenceException occurs at the line where the links are selected:

var table = doc.DocumentNode.SelectSingleNode("//table[@class='containerTable']");
if (table != null)
{
    foreach (HtmlNode item in table.SelectNodes("a[starts-with(@href, 'https://')]"))
    {
        //not working

Regarding XPath, I'm not aware of any guidelines or best practices. Do my two queries on the document result in overhead?

1
3
3/20/2010 10:11:18 PM

Accepted Answer

Use:

 //tbody/descendant::a[starts-with(@href,'https://')
                     or
                       starts-with(@href,'http://')
                     or
                       starts-with(@href,'./') 
                      ]

Unless you modify your code to reflect the fact that the instance method for XmlNode.SelectNodes() has a return type of You'll still be having issues.,XmlNodeList , notHtmlNode .

3
3/22/2010 10:13:33 AM

Popular Answer

The issue is that after choosing the table, you attempt to choose the anchors right away as if they were direct decedents. A number oftr and td tagging in the center.

Thus, if you modify your xpath to the following, everything ought to function:

"tbody/tr/td/a[starts-with(@href, 'https://')]"

You might choose all of the anchors in the current node set (i.e. table), but this won't work if your anchors are encased in anything else:

"//a[starts-with(@href, 'https://')]"

For more information on xpath syntax, see this.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow