XPath "Not". Ignore branches with a specific tag

html-agility-pack xpath

Question

I have loaded a web page into the HTML Agility Pack and have a DOM. I want to use XPATH to pull out all of the text on the page (but not the javascript found within <script> tags).

I figure I need a //text() and then a 'not' to ignore any tag within the branch that has a <script> in it.

I have tried

doc.DocumentNode.SelectNodes("//text()[not(self::script)]"))

and

doc.DocumentNode.SelectNodes("//text()[not(script)]"))

but neither work. An example of the XPath property of a node that they return is (notice the Script)

/html[1]/body[1]/div[2]/div[4]/div[1]/div[1]/div[1]/div[3]/script[1]/#text[1]

I have consulted with both of these posts.

Is it possible to do 'not' matching in XPath?

Grab all text from html with Html Agility Pack (This is a good post but it brings out the JS)

Any suggestions?

Accepted Answer

Your first attempt rejects all text nodes that are script elements, and your second rejects all text nodes that have script node children. Of course, in both cases, the condition is never true.

You haven't explained your requirements clearly, but I guess you want to reject all text nodes that have script elements as their parent, which would be

//text()[not(parent::script)]

or

//*[not(self::script)]/text()



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why