HTML Agility Pack Conversion to XML <script> corruption

c# html-agility-pack linq-to-xml

Question

I've got an HTML file with a <script> in it:

<html>
   <script type="application/custom+xml">
   <my><xml><goes><here/></goes></xml></my>
   </script>
</html>

I parse it with HTML Agility Pack and then convert it to XML.

HtmlDocument html;
html.OptionOutputAsXml = true;
html.Save(stream);
...
XDocument xml = XDocument.Load(stream);

I then want to use LINQ-to-XML to look at the contents of the script tag which should contain my XML as CDATA. But HTML Agility Pack messes it up somehow and I end up with this escaped XML:

<html>
<script type="application/custom+xml">
//<![CDATA[
&lt;my&gt;&lt;xml&gt;&lt;goes&gt;&lt;here/&gt;&lt;/goes&gt;&lt;/xml&gt;&lt;/my&gt;
//]]>//
</script>
</html>

Does anyone know how I can tell HTML Agility Pack not to escape the contents of the script tag?

Accepted Answer

That's rather easy, by default the AgilityPack is set to treat script tags content as CData, this is done in the static constructor of the HtmlNode class like so:

ElementsFlags.Add("script", HtmlElementFlag.CData);

To change this one doesn't have to modify the AgilityPack, all that's needed is one thing before your code, or just once when your program starts

HtmlNode.ElementsFlags.Remove("script");

Just add that before your code, like that it works for me.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why