I have loaded a web page into the HTML Agility Pack and have a DOM. I want to use XPATH to pull out all of the text on the page (but not the javascript found within <script>
tags).
I figure I need a //text() and then a 'not' to ignore any tag within the branch that has a <script>
in it.
I have tried
doc.DocumentNode.SelectNodes("//text()[not(self::script)]"))
and
doc.DocumentNode.SelectNodes("//text()[not(script)]"))
but neither work. An example of the XPath property of a node that they return is (notice the Script)
/html[1]/body[1]/div[2]/div[4]/div[1]/div[1]/div[1]/div[3]/script[1]/#text[1]
I have consulted with both of these posts.
Is it possible to do 'not' matching in XPath?
Grab all text from html with Html Agility Pack (This is a good post but it brings out the JS)
Any suggestions?
Your first attempt rejects all text nodes that are script elements, and your second rejects all text nodes that have script node children. Of course, in both cases, the condition is never true.
You haven't explained your requirements clearly, but I guess you want to reject all text nodes that have script elements as their parent, which would be
//text()[not(parent::script)]
or
//*[not(self::script)]/text()