Encoding question: How can I export from HtmlAgilityPack to a StringWriter while maintaining the encoding?

.net c# encoding html-agility-pack


I am reading html in with HtmlAgilityPack, editing it, then outputting it to a StreamWriter. The HtmlAgilityPack Encoding is Latin1, and the StreamWriter is UnicdeEncoding.

I am losing some characters in the conversion, and I do not want to be.

I don't seem to be able to change the Encoding of a StreamWriter. What is the best around this problem?

7/12/2009 12:26:47 PM

Accepted Answer

If the web page is really Latin-1 (ISO-8859-1), it can't have any curly quotes in it; Latin-1 has no mappings for those characters. If you can see curly quotes when you open the page in your browser, they could be in the form of HTML entities (“ and ” or “ and ”). But I suspect the page's encoding is really windows-1252 despite what the headers and embedded declarations say.

windows-1252 is identical to Latin-1 except that it replaces the control characters in the \x80..\x9F range (decimal 128..159) with more useful (or at least prettier) printing characters. If HtmlAgilityPack is taking the page at its word and decoding it as ISO-8859-1, it will convert \x93 to the control character \u0093, which will look like garbage if you can get it to display at all. The browser, meanwhile, will convert it to \u201C, the Unicode code point for the Left Double Quotation Mark.

I'm not familiar with HtmlAgilityPack and I can't find any docs for it, but I would try to force it to use windows-1252. For example, you could create a windows-1252 (or "ANSI") StreamReader and have HAP use that.

7/13/2009 2:19:21 AM

Expert Answer

At a guess; write to a Stream (not a string). If you write to a string (inc. StringWriter/StringBuilder, you are implicitly using .NET's UTF-16 string.

If you just want to tweak the reported encoding (but use a string), then look at Jon's answer here.

5/23/2017 12:19:14 PM

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow