C# HTMLAgilityPack HTML to Text - Parse Errors

c# html-agility-pack html-parsing

Question

I need to extract text from an HTML file using C#. I am trying to use HTMLAgilityPack but I am seeing some parse errors (tags not closed). I am using these two options:

        htmlDoc.OptionFixNestedTags = true;
        htmlDoc.OptionAutoCloseOnEnd = true;

Is there any "Fix all" type option. I don't care about the errors, I just want the content or close.

Accepted Answer

Maybe this is workaround but once I had to extract text from HTML I used regex:

result = Regex.Replace(result, @"<(.|\n)*?>", String.Empty);
result = Regex.Replace(result, @"^\n*", String.Empty, RegexOptions.Singleline | RegexOptions.IgnoreCase);
result = Regex.Replace(result, @"\n*$", String.Empty, RegexOptions.Singleline | RegexOptions.IgnoreCase);
result = result.Replace("\n", " ");



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why