Extract content with XPath?

c# dom html-agility-pack xml xpath

Question

I have html content that I am storing as an XML document (using HTML Agility Pack). I know some XPath, but I am not able to zero into the exact content I need.

In my example below, I am trying to extract the "src" and "alt" text from the large image. This is my example:

<html>
<body>
   ....
   <div id="large_image_display">
      <img class="photo" src="images/KC0763_l.jpg" alt="Circles t-shirt - Navy" />
   </div>
   ....
   <div id="small_image_display">
      <img class="photo" src="images/KC0763_s.jpg" alt="Circles t-shirt - Navy" />
   </div>
</body>
</html>

What is the XPath to get "images/KC0763_l.jpg" and "Circles t-shirt - Navy"? This is how far I got but it is wrong. Mostly pseudo code at this point:

\\div[@class='large_image_display']\img[1][@class='photo']@src
\\div[@class='large_image_display']\img[1][@class='photo']@alt

Any help in getting this right would be greatly appreciated.

Accepted Answer

The following xpath will get you to the src attributes for the img tags:

'//html/body/div/img[@class="photo"]/@src'

And similarly this will get you to the alt attributes:

'//html/body/div/img[@class="photo"]/@alt'

From there you can get to the attribute text. If you want to only find the ones that match 'large_image_display' then you would filter it further like this:

'//html/body/div[@id="large_image_display"]/img[@class="photo"]/@src'    

Popular Answer

Use the following XPath expressions:

/html/body/div[@id='large_image_display']/img/@src

and

/html/body/div[@id='large_image_display']/img/@alt

Always try to avoid using the // abbreviation, because it may result in very inefficient evaluation (causes the whole (sub)tree to be scanned).

In this particular case we know that the html element is the top element of the document and we can simply select it by /html -- not //html.

Your major problem was that in your expressions you were using \ and \\ and there are no such operators in XPath. The correct XPath operators you were trying to use are / and the // abbreviation.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why