HtmlAgilityPack treats everything after < (less than sign) as attributes

c# html-agility-pack

Question

I have some input I get via a textarea and I convert that input into a html document, that is later parsed into a PDF document.

When my users input the less than sign (<) everything brakes in my HtmlDocument. HtmlAgilityPack suddenly handles everything after the less than sign as an attribute. See the output:

Within this Character Data block I can use double dashes as much as I want (along with <, &,="" ',="" and="" ')="" *and="" *="" %="" myparamentity;="" will="" be="" expanded="" to="" the="" text="" 'has="" been="" expanded'...however,="" i="" can't="" use="" the="" cend="" sequence(if="" i="" need="" to="" use="" it="" i="" must="" escape="" one="" of="" the="" brackets="" or="" the="" greater-than="" sign).="">

It gets a little better if I just add the

htmlDocument.OptionOutputOptimizeAttributeValues = true;

which gives me:

Within this Character Data block I can use double dashes as much as I want (along with <, &,= ',= and= ')= *and= *= %= myparamentity;= will= be= expanded= to= the= text= 'has= been= expanded'...however,= i= can't= use= the= cend= sequence(if= i= need= to= use= it= i= must= escape= one= of= the= brackets= or= the= greater-than= sign).=>

I have tried all of the options on the htmldocument and none of them lets me specify that the parser should not be strict. On the other hand I might be able to live with it stripping away the <, but adding all the equal signs doesn't really work for me.

void Main()
{
    var input = @"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";

    var htmlDoc = WrapContentInHtml(input);

    htmlDoc.DocumentNode.OuterHtml.ToString().Dump();
}

private HtmlDocument WrapContentInHtml(string content)
{
    var htmlBuilder = new StringBuilder();
    htmlBuilder.AppendLine("<!DOCTYPE html>");
    htmlBuilder.AppendLine("<html>");
    htmlBuilder.AppendLine("<head>");
    htmlBuilder.AppendLine("<title></title>");
    htmlBuilder.AppendLine("</head>");
    htmlBuilder.AppendLine("<body><div id='sagsfremstillingContainer'>");
    htmlBuilder.AppendLine(content); 
    htmlBuilder.AppendLine("</div></body></html>");

    var htmlDocument = new HtmlDocument();
    htmlDocument.OptionOutputOptimizeAttributeValues = true;
    var htmlDoc = htmlBuilder.ToString();

    htmlDocument.LoadHtml(htmlDoc);

    return htmlDocument;
}

Does anybody have an idea to how I can solve this problem.

The closest question I can find is this: Losing the 'less than' sign in HtmlAgilityPack loadhtml

Where he actually complains about the < disappearing which would be fine for me. Of course fixing the parsing error is the best solution.

EDIT: I am using HtmlAgilityPack 1.4.9

Accepted Answer

Your content is blatantly wrong. This is not about "strictness", it's really about the fact that you're pretending a piece of text is valid HTML. In fact, the results you are getting are exactly because the parser is not strict.

When you need to insert plain text into HTML, you need to encode it first, so that all the various HTML control characters are converted to HTML properly - for example, < must be changed to &lt; and & to &amp;.

One way to handle this is to use the DOM - use InnerText on the target div, instead of slapping strings together and pretending they're HTML. Another is to use some explicit encoding method - for example HttpUtility.HtmlEncode.


Popular Answer

You can use System.Net.WebUtility.HtmlEncode which works even without a reference to System.Web.dll which also has HttpServerUtility.HtmlEncode

var input = @"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(System.Net.WebUtility.HtmlEncode(input));
Debug.Assert(!htmlDocument.ParseErrors.Any());

Result:

Within this Character Data block I can use double dashes as much as I want (along with &lt;, &amp;, &#39;, and &#39;) *and * % MyParamEntity; will be expanded to the text &#39;Has been expanded&#39;...however, I can&#39;t use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).


Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why