Encoding in HTML using HtmlAgilityPack

encoding html-agility-pack

Question

I have a question about Chinese encoding and saving back to a file. I am currently using the HtmlAgilityPack to parse HTML, do some things with it and save it back to the file. I am having a problem with Encoding, such as Chinese (GB2312 (Simplified)). When i open the file, I read the encoding and I save it back, using the HtmlAgilityPack

doc.Save(this._filePath, reader.CurrentEncoding);

but the Chinese letters get completely mutilated. Any ideas on how I can save back to the same file and maintain the current encoding? I also tried getting the Encoding with the HtmlAgilityPack like such:

FileStream fs = new FileStream(this._filePath, FileMode.Open);

StreamReader reader = new StreamReader(fs);

HtmlDocument doc = new HtmlDocument();
doc.Load(reader);

Encoding enc = doc.DeclaredEncoding

fs.Close();

doc.Save(this._filePath, enc);

but that didn't work either. Any ideas?

Accepted Answer

So after some work, I managed to get it to work by reading the Declared encoding out of the Meta tag. I though it was badly formed initially, but actually it was correct. The DeclaredEncoding did read the encoding from the meta tag.

When the file saved, it still saved in ANSI format, and I couldn't seem to change that. However, the meta tag encoding did seem to keep the file in check when it rendered in the browser. Hope that helps someone.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why