In HtmlAgilityPack loadhtml, the 'less than' indication is removed.

c# html html-agility-pack


The HtmlAgilityPack is something I've just lately begun to play around with. Since I am unfamiliar with all of its possibilities, I believe I must be doing something incorrectly.

I have a string that has the following data:

string s = "<span style=\"color: #0000FF;\"><</span>";

You can see that I have a "less than" indicator in my span. I apply the following code to this text to process it:

HtmlDocument htmlDocument = new HtmlDocument();

However, when I scan the area quickly and imprecisely like this:


The span seems to be empty.

What selection do I need to make in order to keep the 'less than' sign? I've previously tried this:

htmlDocument.OptionAutoCloseOnEnd = false;
htmlDocument.OptionCheckSyntax = false;
htmlDocument.OptionFixNestedTags = false;

but without any luck.

I am aware that the HTML is flawed. This is how I use HTMLEncode to encrypt the "less than" signs and correct broken HTML.

Would you kindly point me in the correct direction? I appreciate you.

3/24/2011 8:14:33 PM

Accepted Answer

This is recognized as a mistake by the HTML Agility Packs, which then generates an instance of the HtmlParseError class. Using the ParseErrors function of the HtmlDocument class, you may view all errors. If you execute this code, then:

    string s = "<span style=\"color: #0000FF;\"><</span>";
    HtmlDocument doc = new HtmlDocument();


    foreach (HtmlParseError err in doc.ParseErrors)
        Console.WriteLine(" code=" + err.Code);
        Console.WriteLine(" reason=" + err.Reason);
        Console.WriteLine(" text=" + err.SourceText);
        Console.WriteLine(" line=" + err.Line);
        Console.WriteLine(" pos=" + err.StreamPosition);
        Console.WriteLine(" col=" + err.LinePosition);

This will be shown (the rectified text will appear first, followed by information about the error):

<span style="color: #0000FF;"></span>

 reason=End tag </> is not required

Therefore, because you have all the necessary information (including the line, column, and stream positions), you may attempt to remedy this problem. However, the procedure of fixing (not detecting) faults in HTML is often highly difficult.

4/18/2011 7:29:21 AM

Popular Answer

Pre-paring the HTML to convert orphaned elements, as was indicated in another response, was the best option I could find.< symbols to their corresponding HTML value&lt; .

return Regex.Replace(html, "<(?![^<]+>)", "&lt;");

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow