XPath to the first instance of an element with a text length of more than 200 characters

c# html html-agility-pack xpath

Question

How can I find the first element with a plain text inner text of 200 characters or greater, excluding any children?

I've built up a system of fallbacks where I initially look for Embed.ly when attempting to develop an HTML parser similar to that one.og:description then and only then would I look for this eventdescription page meta.

This is due to the fact that most websites evenmeta description Instead of listing the information on the current page, that tag should explain their website.

Example:

<html>
    <body>
        <div>some characters
            <p>200 characters <span>some more stuff</span></p>
        </div>
    </body>
</html>

What selector might I use in order to get the 200 text messages section of that HTML fragment? The a few more things is something I also don't want, and I don't care what element it is (except for<script> or <style> ), provided that the initial plain text is at least 200 characters long.

What structure should the XPath query have?

1
6
3/6/2012 1:11:11 AM

Accepted Answer

Use:

(//*[not(self::script or self::style)]/text()[string-length() > 200])[1]

The following expression should be used if the document is an XHTML document, which implies that all elements are in the xhrml namespace:

(//*[not(self::x:script or self::x:style)]/text()[string-length() > 200])[1]

in which the prefix"x:" the namespace for XHTML must be constrained —"http://www.w3.org/1999/xhtml" (Or, as many XPath APIs refer to it: the namespace must begin with "Registered").

8
3/6/2012 3:04:19 AM

Popular Answer

I had something like this in mind:

root.SelectNodes("html/body/.//*[(name() !='script') and (name()!='style')]/text()[string-length() > 200]")

seems to function quite well.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow