OptionWriteEmptyNodes break XML declaration using HtmlAgilityPack

c# end-tag html-agility-pack non-well-formed xml-declaration

Question

Here is the super simple code i have:

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionWriteEmptyNodes = true;
htmlDoc.Load("sourcefilepath");
htmlDoc.Save("destfilepath", Encoding.UTF8);

Input:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
    <link rel="stylesheet" href="main.css" type="text/css"/>
  </head>
  <body>lots of text here, obviously not relevant to this problem</body>
</html>

Output:

<?xml version="1.0" encoding="UTF-8" />
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" />
    <link rel="stylesheet" href="main.css" type="text/css" />
  </head>
  <body>lots of text here, obviously not relevant to this problem</body>
</html>

You can see that in the first line there is an error: /> instead of ?> This happens if i set OptionWriteEmptyNodes to true value. It has been set to true, because otherwise meta/link tags(and some others in the document body) won't be closed.

Anyone know how to solve this?

Accepted Answer

Seems like a bug. You should report it to http://htmlagilitypack.codeplex.com.

Still, you can workaround that bug like this:

HtmlNode.ElementsFlags.Remove("meta");
HtmlNode.ElementsFlags.Remove("link");
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load("sourcefilepath");
htmlDoc.Save("destfilepath", Encoding.UTF8);

Just remove the flags from the meta & link tags that instruct the Html Agility Pack not to close them automatically, and don't set OptionWriteEmptyNodes to true.

It will produce this (note this is slightly different):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"></meta>
    <link rel="stylesheet" href="main.css" type="text/css"></link>
  </head>
  <body>lots of text here, obviously not relevant to this problem</body>
</html>

Popular Answer

Managed to do another way of workaround this problem. This works slightly better in my case than the one above. Basically we are replacing the first child of the DocumentNode, which is the xml declaration.(please note that the input must contain the xml declaration, in my case it's 100%)

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionWriteEmptyNodes = true;
htmlDoc.Load("sourcepath");

var newNodeStr = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>";
var newNode = HtmlNode.CreateNode(newNodeStr);

htmlDoc.DocumentNode.ReplaceChild(newNode, htmlDoc.DocumentNode.FirstChild);


htmlDoc.Save("destpath", Encoding.UTF8);

Please note that Simon's workaround works too, so take the one which better fits in your scenario.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why