I am currently working on a scraper written in C# 4.0. I use variety of tools, including the built-in WebClient and RegEx features of .NET. For a part of my scraper I am parsing a HTML document using HtmlAgilityPack. I got everything to work as I desired and went through some cleanup of the code.
I am using the
HtmlEntity.DeEntitize() method to clean up the HTML. I made a few tests and the method seemed to work great. But when I implemented the method in my code I kept getting
KeyNotFoundException. There are no further details so I'm pretty lost. My code looks like this:
WebClient client = new WebClient(); string html = HtmlEntity.DeEntitize(client.DownloadString(path)); HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html);
The HTML downloaded is UTF-8 encoded. How can I get around the
My HTML had a block of text like so:
... found in sections: 233.9 & 517.3; ...
Despite the spacing and decimal point, it was interpreting
& 517.3; as a unicode character.
Simply HTML Encoding the raw text fixed the problem for me.
string raw = "sections: 233.9 & 517.3;"; // turn '&' into '&', etc, before DeEntitizing string encoded = System.Web.HttpUtility.HtmlEncode(raw); string deEntitized = HtmlEntity.DeEntitize(encoded);