HtmlAgilityPack treats everything after < (less than sign) as attributes

c# html-agility-pack

Question

I take some textarea-based input and turn it into an HTML page, which is then processed into a PDF document.

The less than symbol () causes everything in my HTML document to crash when a user enters it. All of a sudden, HtmlAgilityPack treats anything that comes after the less than symbol as an attribute. View the result:

Within this Character Data block I can use double dashes as much as I want (along with <, &,="" ',="" and="" ')="" *and="" *="" %="" myparamentity;="" will="" be="" expanded="" to="" the="" text="" 'has="" been="" expanded'...however,="" i="" can't="" use="" the="" cend="" sequence(if="" i="" need="" to="" use="" it="" i="" must="" escape="" one="" of="" the="" brackets="" or="" the="" greater-than="" sign).="">

If I just add the, it improves a bit.

htmlDocument.OptionOutputOptimizeAttributeValues = true;

which results in:

Within this Character Data block I can use double dashes as much as I want (along with <, &,= ',= and= ')= *and= *= %= myparamentity;= will= be= expanded= to= the= text= 'has= been= expanded'...however,= i= can't= use= the= cend= sequence(if= i= need= to= use= it= i= must= escape= one= of= the= brackets= or= the= greater-than= sign).=>

No matter whatever parameters I use on the htmldocument, I am unable to indicate that the parser should not be strict. On the other side, removing the could be something I could deal with, but adding all the equal signs doesn't really sit well with me.

void Main()
{
    var input = @"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";

    var htmlDoc = WrapContentInHtml(input);

    htmlDoc.DocumentNode.OuterHtml.ToString().Dump();
}

private HtmlDocument WrapContentInHtml(string content)
{
    var htmlBuilder = new StringBuilder();
    htmlBuilder.AppendLine("<!DOCTYPE html>");
    htmlBuilder.AppendLine("<html>");
    htmlBuilder.AppendLine("<head>");
    htmlBuilder.AppendLine("<title></title>");
    htmlBuilder.AppendLine("</head>");
    htmlBuilder.AppendLine("<body><div id='sagsfremstillingContainer'>");
    htmlBuilder.AppendLine(content); 
    htmlBuilder.AppendLine("</div></body></html>");

    var htmlDocument = new HtmlDocument();
    htmlDocument.OptionOutputOptimizeAttributeValues = true;
    var htmlDoc = htmlBuilder.ToString();

    htmlDocument.LoadHtml(htmlDoc);

    return htmlDocument;
}

Does anybody have a suggestion on how I may approach this issue?

This is the closest query I can find: 'less than' indication disappearing from HtmlAgilityPack loadhtml

Whereas I would be alright with his complaining about the disappearance of the. The ideal option is, of course, to correct the parsing mistake.

EDIT: HtmlAgilityPack 1.4.9 is what I'm using.

1
1
5/23/2017 12:09:33 PM

Accepted Answer

Your writing is obviously incorrect. It's not about being "strict," but rather about the fact that you're passing off some text as legitimate HTML. In fact, the parser's not strictness is precisely why you are receiving the results you are.

To effectively transform all the different HTML control characters into HTML when inserting plain text into HTML, you must first encode the content.< Change must be made to&lt; and & to &amp; .

Using the DOM is one approach to do this.InnerText on the intendeddiv instead of just stringing words together and calling it HTML. Another option is to employ an explicit encoding technique, such asHttpUtility.HtmlEncode .

3
6/15/2016 11:30:34 AM

Popular Answer

You may use System.Net.WebUtility.HtmlEncode, which is functional even whenSystem.Web.dll This furthermore includes HttpServerUtility.HtmlEncode

var input = @"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(System.Net.WebUtility.HtmlEncode(input));
Debug.Assert(!htmlDocument.ParseErrors.Any());

Result:

Within this Character Data block I can use double dashes as much as I want (along with &lt;, &amp;, &#39;, and &#39;) *and * % MyParamEntity; will be expanded to the text &#39;Has been expanded&#39;...however, I can&#39;t use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow