Encoding issue with C# and HtmlAgilityPack

c# encoding html-agility-pack

Question

WebClient GodLikeClient = new WebClient();
HtmlAgilityPack.HtmlDocument GodLikeHTML = new HtmlAgilityPack.HtmlDocument();

GodLikeHTML.Load(GodLikeClient.OpenRead("www.alfa.lt");

Thus, this code outputs: "What is your psychological diagnosis of homosexuality? Instead of "Skaitytojo klausimas psichologui: kas lemia homosexualum," why not "Naujien...3 portalas Alfa.lt"? - Naujien's 3 websites Alfa.lt ".

This website is 1257 (baltic) encoded, buttextBox1.Text = GodLikeHTML.DocumentNode.OuterHtml; returns the warped text, with baltic diacritics becoming strange, lengthy strings of characters: (

I have tried the HtmlAgilityPack forums, too. They are awful.

P.S. Although I am not a coder, I am working on a community project and I must get this code to run. Thank you;

1
23
8/10/2010 6:51:48 PM

Accepted Answer

Actually, UTF-8 is used to encrypt the page.

GodLikeHTML.Load(GodLikeClient.OpenRead("http://www.alfa.lt"), Encoding.UTF8);

will function

Or you could use the code in my SO respond that correctly re-encodes after detecting encoding from http headers or meta tags. (It furthermore supports gzip to cut down on download size.)

Your code would look like this if it used the download class:

HttpDownloader downloader = new HttpDownloader("http://www.alfa.lt",null,null);
GodLikeHTML.LoadHtml(downloader.GetPage());
25
5/23/2017 12:25:34 PM

Popular Answer

Similar encoding issues occurred to me. By include the following in my WebClient setup, I was able to correct issue with the most recent version of HTML Agility Pack.

var htmlWeb = new HtmlWeb();
htmlWeb.OverrideEncoding = Encoding.UTF8;
var doc = htmlWeb.Load("www.alfa.lt");


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow