HtmlAgilityPack: xpath and regex

c# html-agility-pack regex

Question

I'm presently using HTML Agility Pack to do an xpath query search for specific content. Possibly like this:

var col = doc.DocumentNode.SelectNodes("//*[text()[contains(., 'foo'] or @*....

I now want to use a regular expression to search for specified material across the whole HTML source code (text, tags, and attributes). How can HTMLAgilityPack help with this? What would be the best approach to search using a regex and HTML Agility Pack? Can it support xpath+regex?

1
2
11/4/2014 8:37:10 PM

Accepted Answer

For its XPATH functionality, the HTML Agility Pack makes use of the underlying.NET XPATH implementation. It's a good thing that XPATH in.NET is completely expandable (BTW: it's a pity Microsoft isn't investing more on this fantastic technology).

So, let's say I have the following HTML:

<div>hello</div>
<div>hallo</div>

Here is an example of code that compares the nodes with the 'h.llo' regex expression and chooses both nodes:

HtmlNodeNavigator nav = new HtmlNodeNavigator("mypage.htm");
foreach (var node in SelectNodes(nav, "//div[regex-is-match(text(), 'h.llo')]"))
{
    Console.WriteLine(node.OuterHtml); // should dump both div elements
}

Because I've developed a new XPATH function named "regex-is-match" in a unique Xslt/XPath context, it works. The SelectNodes utility code is as follows:

public static IEnumerable<HtmlNode> SelectNodes(HtmlNodeNavigator navigator, string xpath)
{
    if (navigator == null)
        throw new ArgumentNullException("navigator");

    XPathExpression expr = navigator.Compile(xpath);
    expr.SetContext(new HtmlXsltContext());

    object eval = navigator.Evaluate(expr);
    XPathNodeIterator it = eval as XPathNodeIterator;
    if (it != null)
    {
        while (it.MoveNext())
        {
            HtmlNodeNavigator n = it.Current as HtmlNodeNavigator;
            if (n != null && n.CurrentNode != null)
            {
                yield return n.CurrentNode;
            }
        }
    }
}

Here is the support code as well:

    public class HtmlXsltContext : XsltContext
    {
        public HtmlXsltContext()
            : base(new NameTable())
        {
        }

        public override int CompareDocument(string baseUri, string nextbaseUri)
        {
            throw new NotImplementedException();
        }

        public override bool PreserveWhitespace(XPathNavigator node)
        {
            throw new NotImplementedException();
        }

        protected virtual IXsltContextFunction CreateHtmlXsltFunction(string prefix, string name, XPathResultType[] ArgTypes)
        {
            return HtmlXsltFunction.GetBuiltIn(this, prefix, name, ArgTypes);
        }

        public override IXsltContextFunction ResolveFunction(string prefix, string name, XPathResultType[] ArgTypes)
        {
            return CreateHtmlXsltFunction(prefix, name, ArgTypes);
        }

        public override IXsltContextVariable ResolveVariable(string prefix, string name)
        {
            throw new NotImplementedException();
        }

        public override bool Whitespace
        {
            get { return true; }
        }
    }

    public abstract class HtmlXsltFunction : IXsltContextFunction
    {
        protected HtmlXsltFunction(HtmlXsltContext context, string prefix, string name, XPathResultType[] argTypes)
        {
            Context = context;
            Prefix = prefix;
            Name = name;
            ArgTypes = argTypes;
        }

        public HtmlXsltContext Context { get; private set; }
        public string Prefix { get; private set; }
        public string Name { get; private set; }
        public XPathResultType[] ArgTypes { get; private set; }

        public virtual int Maxargs
        {
            get { return Minargs; }
        }

        public virtual int Minargs
        {
            get { return 1; }
        }

        public virtual XPathResultType ReturnType
        {
            get { return XPathResultType.String; }
        }

        public abstract object Invoke(XsltContext xsltContext, object[] args, XPathNavigator docContext);

        public static IXsltContextFunction GetBuiltIn(HtmlXsltContext context, string prefix, string name, XPathResultType[] argTypes)
        {
            if (name == "regex-is-match")
                return new RegexIsMatch(context, name);

            // TODO: create other functions here
            return null;
        }

        public static string ConvertToString(object argument, bool outer, string separator)
        {
            if (argument == null)
                return null;

            string s = argument as string;
            if (s != null)
                return s;

            XPathNodeIterator it = argument as XPathNodeIterator;
            if (it != null)
            {
                if (!it.MoveNext())
                    return null;

                StringBuilder sb = new StringBuilder();
                do
                {
                    HtmlNodeNavigator n = it.Current as HtmlNodeNavigator;
                    if (n != null && n.CurrentNode != null)
                    {
                        if (sb.Length > 0 && separator != null)
                        {
                            sb.Append(separator);
                        }

                        sb.Append(outer ? n.CurrentNode.OuterHtml : n.CurrentNode.InnerHtml);
                    }
                }
                while (it.MoveNext());
                return sb.ToString();
            }

            IEnumerable enumerable = argument as IEnumerable;
            if (enumerable != null)
            {
                StringBuilder sb = null;
                foreach (object arg in enumerable)
                {
                    if (sb == null)
                    {
                        sb = new StringBuilder();
                    }

                    if (sb.Length > 0 && separator != null)
                    {
                        sb.Append(separator);
                    }

                    string s2 = ConvertToString(arg, outer, separator);
                    if (s2 != null)
                    {
                        sb.Append(s2);
                    }
                }
                return sb != null ? sb.ToString() : null;
            }

            return string.Format("{0}", argument);
        }

        public class RegexIsMatch : HtmlXsltFunction
        {
            public RegexIsMatch(HtmlXsltContext context, string name)
                : base(context, null, name, null)
            {
            }

            public override XPathResultType ReturnType { get { return XPathResultType.Boolean; } }
            public override int Minargs { get { return 2; } }

            public override object Invoke(XsltContext xsltContext, object[] args, XPathNavigator docContext)
            {
                if (args.Length < 2)
                    return false;

                return Regex.IsMatch(ConvertToString(args[0], false, null), ConvertToString(args[1], false, null));
            }
        }
    }

Finally, a class named RegexIsMatch implements the regex function. It's not really difficult. It should be noted that the handy utility function ConvertToString attempts to convert any xpath "thing" into a string.

Of course, with this technique, you can quickly construct any XPATH function you want (I often use it to convert between upper- and lowercase letters).

5
11/21/2014 11:23:20 AM

Popular Answer

Immediately quoting

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), you can't possibly make this work. But many will try, some will claim success and others will find the fault and totally mess you up.

In certain cases, an HTML document's components might benefit from the usage of a regular expression. attempting to utilizeHtmlAgilityPack It is counterintuitive and ultimately ineffective to use a regular expression to analyze the tags and structure of an HTML page.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow