OptionWriteEmptyNodes break XML declaration using HtmlAgilityPack

c# end-tag html-agility-pack non-well-formed xml-declaration

Question

Here is my really basic code:

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionWriteEmptyNodes = true;
htmlDoc.Load("sourcefilepath");
htmlDoc.Save("destfilepath", Encoding.UTF8);

Input:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
    <link rel="stylesheet" href="main.css" type="text/css"/>
  </head>
  <body>lots of text here, obviously not relevant to this problem</body>
</html>

Output:

<?xml version="1.0" encoding="UTF-8" />
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" />
    <link rel="stylesheet" href="main.css" type="text/css" />
  </head>
  <body>lots of text here, obviously not relevant to this problem</body>
</html>

There is a mistake in the first line, as you can see: in place of?> If I set OptionWriteEmptyNodes to true, then this occurs. Meta/link tags, as well as several others included in the document body, won't be closed if it isn't set to true.

Does anybody have a solution for this?

1
0
6/15/2012 9:11:20 AM

Accepted Answer

Appears to be a bug. You need to call http://htmlagilitypack.codeplex.com and report it.

However, you may bypass that problem by doing the following:

HtmlNode.ElementsFlags.Remove("meta");
HtmlNode.ElementsFlags.Remove("link");
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load("sourcefilepath");
htmlDoc.Save("destfilepath", Encoding.UTF8);

Simply take down the flags from themeta & link elements telling the HTML Agility Pack not to automatically shut them and not to setOptionWriteEmptyNodes to true .

This is what it will create (notice that this is a little different):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"></meta>
    <link rel="stylesheet" href="main.css" type="text/css"></link>
  </head>
  <body>lots of text here, obviously not relevant to this problem</body>
</html>
1
6/15/2012 9:56:26 AM

Popular Answer

I was able to find a different solution to this issue. In my situation, this works a little bit better than the alternative. In essence, we are changing the xml declaration, the DocumentNode's first child. (Note that the xml declaration must be included in the input; in my instance, it is 100 percent)

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionWriteEmptyNodes = true;
htmlDoc.Load("sourcepath");

var newNodeStr = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>";
var newNode = HtmlNode.CreateNode(newNodeStr);

htmlDoc.DocumentNode.ReplaceChild(newNode, htmlDoc.DocumentNode.FirstChild);


htmlDoc.Save("destpath", Encoding.UTF8);

Please be aware that Simon's solution also works, so choose the one that best matches your situation.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow