I am reading html in with HtmlAgilityPack, editing it, then outputting it to a StreamWriter. The HtmlAgilityPack Encoding is Latin1, and the StreamWriter is UnicdeEncoding.
I am losing some characters in the conversion, and I do not want to be.
I don't seem to be able to change the Encoding of a StreamWriter. What is the best around this problem?
If the web page is really Latin-1 (ISO-8859-1), it can't have any curly quotes in it; Latin-1 has no mappings for those characters. If you can see curly quotes when you open the page in your browser, they could be in the form of HTML entities (
”). But I suspect the page's encoding is really windows-1252 despite what the headers and embedded declarations say.
windows-1252 is identical to Latin-1 except that it replaces the control characters in the
\x80..\x9F range (decimal
128..159) with more useful (or at least prettier) printing characters. If HtmlAgilityPack is taking the page at its word and decoding it as ISO-8859-1, it will convert
\x93 to the control character
\u0093, which will look like garbage if you can get it to display at all. The browser, meanwhile, will convert it to
\u201C, the Unicode code point for the Left Double Quotation Mark.
I'm not familiar with HtmlAgilityPack and I can't find any docs for it, but I would try to force it to use windows-1252. For example, you could create a windows-1252 (or "ANSI") StreamReader and have HAP use that.
At a guess; write to a
Stream (not a
string). If you write to a
StringBuilder, you are implicitly using .NET's UTF-16 string.
If you just want to tweak the reported encoding (but use a
string), then look at Jon's answer here.