HtmlAgilityPack Drops Option End Tags

html html-agility-pack parsing

Question

I am using HtmlAgilityPack. I create an HtmlDocument and LoadHtml with the following string:

<select id="foo_Bar" name="foo.Bar"><option selected="selected" value="1">One</option><option value="2">Two</option></select>

This does some unexpected things. First, it gives two parser errors, EndTagNotRequired. Second, the select node has 4 children - two for the option tags and two more for the inner text of the option tags. Last, the OuterHtml is like this:

<select id="foo_Bar" name="foo.Bar"><option selected="selected" value="1">One<option value="2">Two</select>

So basically it is deciding for me to drop the closing tags on the options. Let's leave aside for a moment whether it is proper and desirable to do that. I am using HtmlAgilityPack to test HTML generation code, so I don't want it to make any decision for me or give any errors unless the HTML is truly malformed. Is there some way to make it behave how I want? I tried setting some of the options for HtmlDocument, specifically:

 doc.OptionAutoCloseOnEnd = false;
 doc.OptionCheckSyntax = false;
 doc.OptionFixNestedTags = false;

This is not working. If HtmlAgilityPack cannot do what I want, can you recommend something that can?

Accepted Answer

The exact same error is reported on the HAP home page's discussion, but it looks like no meaningful fixes have been made to the project in a few years. Not encouraging.

A quick browse of the source suggests the error might be fixable by commenting out line 92 of HtmlNode.cs:

// they sometimes contain, and sometimes they don 't...
ElementsFlags.Add("option", HtmlElementFlag.Empty);

(Actually no, they always contain label text, although a blank string would also be valid text. A careless author might omit the end-tag, but then that's true of any element.)

ADD

An equivalent solution is calling HtmlNode.ElementsFlags.Remove("option"); before any use of liberary (without need to modify the liberary source code)


Popular Answer

It seems that there is some reason not to parse the Option tag as a "generic" tag, for XHTML compliance, however this can be a real pain in the neck.

My suggestion is to do a whole-string-replace and change all "option" tags to "my_option" tags, that way you:

  1. Don't have to modify the source of the library (and can upgrade it later).
  2. Can parse as you usually would.

The original post on HtmlAgilityPack forum can be found at: http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=14982




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why