Losing the 'less than' sign in HtmlAgilityPack loadhtml

c# html html-agility-pack

Question

I recently started experimenting with the HtmlAgilityPack. I am not familiar with all of its options and I think therefor I am doing something wrong.

I have a string with the following content:

string s = "<span style=\"color: #0000FF;\"><</span>";

You see that in my span I have a 'less than' sign. I process this string with the following code:

HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(s);

But when I do a quick and dirty look in the span like this:

htmlDocument.DocumentNode.ChildNodes[0].InnerHtml

I see that the span is empty.

What option do I need to set maintain the 'less than' sign. I already tried this:

htmlDocument.OptionAutoCloseOnEnd = false;
htmlDocument.OptionCheckSyntax = false;
htmlDocument.OptionFixNestedTags = false;

but with no success.

I know it is invalid HTML. I am using this to fix invalid HTML and use HTMLEncode on the 'less than' signs

Please direct me in the right direction. Thanks in advance

Accepted Answer

The Html Agility Packs detects this as an error and creates an HtmlParseError instance for it. You can read all errors using the ParseErrors of the HtmlDocument class. So, if you run this code:

    string s = "<span style=\"color: #0000FF;\"><</span>";
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(s);
    doc.Save(Console.Out);

    Console.WriteLine();
    Console.WriteLine();

    foreach (HtmlParseError err in doc.ParseErrors)
    {
        Console.WriteLine("Error");
        Console.WriteLine(" code=" + err.Code);
        Console.WriteLine(" reason=" + err.Reason);
        Console.WriteLine(" text=" + err.SourceText);
        Console.WriteLine(" line=" + err.Line);
        Console.WriteLine(" pos=" + err.StreamPosition);
        Console.WriteLine(" col=" + err.LinePosition);
    }

It will display this (the corrected text first, and details about the error then):

<span style="color: #0000FF;"></span>

Error
 code=EndTagNotRequired
 reason=End tag </> is not required
 text=<
 line=1
 pos=30
 col=31

So you can try to fix this error, as you have all required information (including line, column, and stream position) but the general process of fixing (not detecting) errors in HTML is very complex.


Popular Answer

As mentioned in another answer, the best solution I found was to pre-parse the HTML to convert orphaned < symbols to their HTML encoded value &lt;.

return Regex.Replace(html, "<(?![^<]+>)", "&lt;");



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why