How to select a table which contains certain keyword - c# - xpath - htmlagilitypack

c# html-agility-pack keyword select xpath

Question

I have to gather information from a product page which does not have any class or id. I am using htmlagilitypack and c# 4.0.

There are many tables at this product page source code. The prices table contains " KDV" string. So i would like to get this " KDV" string containing table. How can i do that ?

The xpath below would select all tables for example

string srxPathOfCategory = "//table";
var selectedNodes = myDoc.DocumentNode.SelectNodes(srxPathOfCategory);

The code below selects the table but starting from most outer table. I need to select most inner table which contains that given string

//table[contains(., ' KDV')]

c# , xpath , htmlagilitypack

Accepted Answer

The code below selects the table but starting from most outer table. I need to select most inner table which contains that given string

Use:

//table
    [not(descendant::table) 
   and 
     .//text()[contains(., ' KDV')]
    ]

This selects any table in the XML document that doesn't have a table descendant, and that has a text node descendant that contains the string " KDV" .

In general the above expression could select many such table elements.

If you want only one of them selected (say the first), use this XPath expression -- do notice the brackets:

   (//table
        [not(descendant::table) 
       and 
         .//text()[contains(., ' KDV')]
        ]
    )[1]

Remember: If you want to select the first someName element in the document, using this (as in the currently accepted answer) is wrong:

//someName[1]

This is the second most FAQ in XPath (after the one how to select elements with unprefixed names in an XML document with a default namespace).

The expression above actually selects any someName element in the document, that is the first child of its parent -- try it.

The reason for this unintuitive behavior is because the XPath [] operator has a higher precedence (priority) that the // pseudo-operator.

The correct expression that really selects only the first someName element (in any XML document), if such exists is:

(//someName)[1]

Here the brackets are used to explicitly override the default XPath operator precedence.


Popular Answer

There might be a more efficient way to do it. Anyway, this is the entire code I have used for your case and it works for me:

        HtmlDocument doc = new HtmlDocument();
        string url = "http://www.pratikev.com/fractalv33/pratikEv/pages/viewProduct.jsp?pInstanceId=3138821";
        using (var response = (WebRequest.Create(url).GetResponse()))
        {
            doc.LoadHtml(new StreamReader(response.GetResponseStream()).ReadToEnd());
        }
        /*There is an bug in the xpath used here. Should have been 
          (//table/tr/td/font[contains(.,'KDV')])[1]/ancestor::table[2] 
          See Dimitre's answer for an explanation and an alternative / 
          more generic / (needless to say) better approach */
        string xpath = "//table/tr/td/font[contains(.,'KDV')][1]/ancestor::table[2]"; 
        HtmlNode table = doc.DocumentNode.SelectSingleNode(xpath);


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why