HtmlAgilityPack -- Does
close itself for some reason?

c# html-agility-pack

Question

I just wrote up this test to see if I was crazy...

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace HtmlAgilityPackFormBug
{
    class Program
    {
        static void Main(string[] args)
        {
            var doc = new HtmlDocument();
            doc.LoadHtml(@"
<!DOCTYPE html>
<html>
    <head>
        <title>Form Test</title>
    </head>
    <body>
        <form>
            <input type=""text"" />
            <input type=""reset"" />
            <input type=""submit"" />
        </form>
    </body>
</html>
");
            var body = doc.DocumentNode.SelectSingleNode("//body");
            foreach (var node in body.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element))
                Console.WriteLine(node.XPath);
            Console.ReadLine();
        }
    }
}

And it outputs:

/html[1]/body[1]/form[1]
/html[1]/body[1]/input[1]
/html[1]/body[1]/input[2]
/html[1]/body[1]/input[3]

But, if I change <form> to <xxx> it gives me:

/html[1]/body[1]/xxx[1]

(As it should). So... it looks like those input elements are not contained within the form, but directly within the body, as if the <form> just closed itself off immediately. What's up with that? Is this a bug?


Digging through the source, I see:

ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);

It has the "empty" flag, like META and IMG. Why?? Forms are most definitely not supposed to be empty.

Accepted Answer

This is also reported in this workitem. It contains a suggested workaround from DarthObiwan.

You can change this without recompiling. The ElementFlags list is a static property on the HtmlNode class. It can be removed with

    HtmlNode.ElementsFlags.Remove("form");

before doing the document load


Popular Answer

Since I'm the original HAP author, I can explain why it's marked as empty :)

This is because when HAP was designed, back in 2000, HTML 3.2 was the standard. You're probably aware that tags can perfectly overlap in HTML. That is: <b>bold<i>italic and bold</b>italic</i> (bolditalic and bolditalic) is supported by all browsers (although it's not officially in the HTML specification). And the FORM tag can also perfectly overlap as well.

Since HAP has been designed to handle any HTML content, rather than break most pages that you could find at that time, we just decided to handle overlapping tags as EMPTY (using the ElementFlags property) so:

  • you can still load them
  • you can save them back without breaking the original HTML (If you don't need what's inside the form in any programmatic way).

The only thing you cannot do is work with them with the API, using the tree model, nor with XSL, or anything programmatic. Today, with XHTML/XML almost everywhere, this sounds strange, but that's why I created the ElementFlags :)



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why