Question about Encodings: How can I output from HtmlAgilityPack to a StringWriter and keep the encoding?

.net c# encoding html-agility-pack

Question

I am reading html in with HtmlAgilityPack, editing it, then outputting it to a StreamWriter. The HtmlAgilityPack Encoding is Latin1, and the StreamWriter is UnicdeEncoding.

I am losing some characters in the conversion, and I do not want to be.

I don't seem to be able to change the Encoding of a StreamWriter. What is the best around this problem?

Accepted Answer

If the web page is really Latin-1 (ISO-8859-1), it can't have any curly quotes in it; Latin-1 has no mappings for those characters. If you can see curly quotes when you open the page in your browser, they could be in the form of HTML entities (“ and ” or “ and ”). But I suspect the page's encoding is really windows-1252 despite what the headers and embedded declarations say.

windows-1252 is identical to Latin-1 except that it replaces the control characters in the \x80..\x9F range (decimal 128..159) with more useful (or at least prettier) printing characters. If HtmlAgilityPack is taking the page at its word and decoding it as ISO-8859-1, it will convert \x93 to the control character \u0093, which will look like garbage if you can get it to display at all. The browser, meanwhile, will convert it to \u201C, the Unicode code point for the Left Double Quotation Mark.

I'm not familiar with HtmlAgilityPack and I can't find any docs for it, but I would try to force it to use windows-1252. For example, you could create a windows-1252 (or "ANSI") StreamReader and have HAP use that.


Expert Answer

At a guess; write to a Stream (not a string). If you write to a string (inc. StringWriter/StringBuilder, you are implicitly using .NET's UTF-16 string.

If you just want to tweak the reported encoding (but use a string), then look at Jon's answer here.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why