KeyNotFoundException with using HtmlEntity.DeEntitize() method

c# html-agility-pack keynotfoundexception

Question

I am currently working on a scraper written in C# 4.0. I use variety of tools, including the built-in WebClient and RegEx features of .NET. For a part of my scraper I am parsing a HTML document using HtmlAgilityPack. I got everything to work as I desired and went through some cleanup of the code.

I am using the HtmlEntity.DeEntitize() method to clean up the HTML. I made a few tests and the method seemed to work great. But when I implemented the method in my code I kept getting KeyNotFoundException. There are no further details so I'm pretty lost. My code looks like this:

WebClient client = new WebClient();
string html = HtmlEntity.DeEntitize(client.DownloadString(path));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

The HTML downloaded is UTF-8 encoded. How can I get around the KeyNotFound exception?

Popular Answer

Four years later and I have the same problem with some encoded characters (version 1.4.9.5). In my case, there is a limited set of characters that might generate the problem, so I have just created a function to perform the replacements:

// to be called before HtmlEntity.DeEntitize
public static string ReplaceProblematicHtmlEntities(string str)
{
    var sb = new StringBuilder(str);
    //TODO: add other replacements, as needed
    return sb.Replace(".", ".")
        .Replace("ă", "ă")
        .Replace("â", "â")
        .ToString();
}

In my case, the string contains both html-encoded characters and UTF-8 characters, but the problem is related to some encoded characters only.

This is not an elegant solution, but a quick fix for all those text with a limited (and known) amount of problematic encoded characters.



Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow