Why am I picking up foreign characters and how can I remove them?

c# html html-agility-pack string

Question

I am picking up extra characters (Â) compared to the source when I grab the InnerText of a H3 tag using the HTML Agility Pack.

I am not sure where these characters are coming from or how to remove them.

Extracted String:

 Week 1

HTML Source:

<h3>
<span> </span>Week 1</h3>

Current Code:

private void getWeekNumber(string url)
{
    HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

    htmlDoc.Load(new System.IO.StringReader(url));

    foreach (HtmlAgilityPack.HtmlNode h3 in htmlDoc.DocumentNode.SelectNodes("//h3"))
    {
        MessageBox.Show(h3.InnerText);
    }
}

Current Workaround (Stolen from somewhere on stackoverflow, lost the link):

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

request.Method = "GET";

using (var stream = request.GetResponse().GetResponseStream())
using (var reader = new System.IO.StreamReader(stream, Encoding.UTF8))
{
    result = reader.ReadToEnd();
}

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

htmlDoc.Load(new System.IO.StringReader(result));

foreach (HtmlAgilityPack.HtmlNode h3 in htmlDoc.DocumentNode.SelectNodes("//h3"))
{
    MessageBox.Show(h3.InnerText);
}

Accepted Answer

You need to set the encoding before you do...

htmlDoc.Load(new System.IO.StringReader(url), Encoding.UTF8);

This tells the agility pack that the characters are UTF8 rather than some other encoding.

The reason you need to do it here is that this is the point when it is parsed incorretly. After this you are storing the literal  characters.

Characters in string changed after downloading HTML from the internet may also be of interest.


Popular Answer

may be your character encoding, set encoding to UTF-8




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why