Problem parsing children of a node with HtmlAgilityPack

c# html-agility-pack html-parsing xpath

Question

I'm having a problem parsing the input tag children of a form in html. I can parse them from the root using //input[@type] but not as children of a specific node.

Here's some code that illustrates the problem:

private const string HTML_CONTENT =
        "<html>" +
        "<head>" +
        "<title>Test Page</title>" +
        "<link href='site.css' rel='stylesheet' type='text/css' />" +
        "</head>" +
        "<body>" +
        "<form id='form1' method='post' action='http://www.someplace.com/input'>" +
        "<input type='hidden' name='id' value='test' />" +
        "<input type='text' name='something' value='something' />" +
        "</form>" +
        "<a href='http://www.someplace.com'>Someplace</a>" +
        "<a href='http://www.someplace.com/other'><img src='http://www.someplace.com/image.jpg' alt='Someplace Image'/></a>" +
        "<form id='form2' method='post' action='/something/to/do'>" +
        "<input type='text' name='secondForm' value='this should be in the second form' />" +
        "</form>" +
        "</body>" +
        "</html>";

public void Parser_Test()
    {
        var htmlDoc = new HtmlDocument
        {
            OptionFixNestedTags = true,
            OptionUseIdAttribute = true,
            OptionAutoCloseOnEnd = true,
            OptionAddDebuggingAttributes = true
        };

        byte[] byteArray = Encoding.UTF8.GetBytes(HTML_CONTENT);
        var stream = new MemoryStream(byteArray);
        htmlDoc.Load(stream, Encoding.UTF8, true);
        var nodeCollection = htmlDoc.DocumentNode.SelectNodes("//form");
        if (nodeCollection != null && nodeCollection.Count > 0)
        {
            foreach (var form in nodeCollection)
            {
                var id = form.GetAttributeValue("id", string.Empty);
                if (!form.HasChildNodes)
                    Debug.WriteLine(string.Format("Form {0} has no children", id ) );

                var childCollection = form.SelectNodes("input[@type]");
                if (childCollection != null && childCollection.Count > 0)
                {
                    Debug.WriteLine("Got some child nodes");
                }
                else
                {
                    Debug.WriteLine("Unable to find input nodes as children of Form");
                }
            }
            var inputNodes = htmlDoc.DocumentNode.SelectNodes("//input");
            if (inputNodes != null && inputNodes.Count > 0)
            {
                Debug.WriteLine(string.Format("Found {0} input nodes when parsed from root", inputNodes.Count ) );
            }
        }
        else
        {
            Debug.WriteLine("Found no forms");
        }
    }

What is output is:

Form form1 has no children
Unable to find input nodes as children of Form
Form form2 has no children
Unable to find input nodes as children of Form
Found 3 input nodes when parsed from root

What I would expect is that Form1 and Form2 would both have children and that input[@type] would be able to find 2 nodes for form1 and 1 for form2

Is there a specific configuration setting or method that I'm not using that I should be? Any ideas?

Thanks,

Steve

Accepted Answer

Well, I've given up on HtmlAgilityPack for now. Seems like there is still more work to do in that library to get everything working. To solve this problem I've moved the code over to use the SGMLReader library from here: http://developer.mindtouch.com/SgmlReader

Using this library all my unit tests pass properly and the sample code works as expected.


Popular Answer

Check out this discussion thread on the HtmlAgilityPack site - http://htmlagilitypack.codeplex.com/workitem/21782

This is what they say:

This is not a bug, but a feature and is configurable. FORM is treated like this because many HTML pages used to have overlapping forms, as this was actually a (powerful) feature of original HTML. Now that XML and XHTML exist, everybody assumes that overlapping is an error, but it's not (in HTML 3.2). Check the HtmlNode.cs file, and modify the ElementsFlags collection (or do it at runtime if you prefer)

To modify the HtmlNode.cs file, comment out following line -

ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);


Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why