HTML to Text - Parse Errors in C# HTMLAgilityPack

c# html-agility-pack html-parsing

Question

I need to use C# to extract text from an HTML file. Although I'm attempting to utilize HTMLAgilityPack, I keep running into parsing issues (tags not closed). These are the two choices I'm using:

        htmlDoc.OptionFixNestedTags = true;
        htmlDoc.OptionAutoCloseOnEnd = true;

Is there a "Fix all" option of any kind? I just want the material or near; I don't care about the mistakes.

1
6
9/27/2010 9:35:21 AM

Accepted Answer

When I needed to extract text from HTML, I used regex, which may be a workaround.

result = Regex.Replace(result, @"<(.|\n)*?>", String.Empty);
result = Regex.Replace(result, @"^\n*", String.Empty, RegexOptions.Singleline | RegexOptions.IgnoreCase);
result = Regex.Replace(result, @"\n*$", String.Empty, RegexOptions.Singleline | RegexOptions.IgnoreCase);
result = result.Replace("\n", " ");
4
9/27/2010 9:42:21 AM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow