I am picking up extra characters (Â) compared to the source when I grab the InnerText of a H3 tag using the HTML Agility Pack.
I am not sure where these characters are coming from or how to remove them.
Extracted String:
 Week 1
HTML Source:
<h3>
<span> </span>Week 1</h3>
Current Code:
private void getWeekNumber(string url)
{
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.Load(new System.IO.StringReader(url));
foreach (HtmlAgilityPack.HtmlNode h3 in htmlDoc.DocumentNode.SelectNodes("//h3"))
{
MessageBox.Show(h3.InnerText);
}
}
Current Workaround (Stolen from somewhere on stackoverflow, lost the link):
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
using (var stream = request.GetResponse().GetResponseStream())
using (var reader = new System.IO.StreamReader(stream, Encoding.UTF8))
{
result = reader.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.Load(new System.IO.StringReader(result));
foreach (HtmlAgilityPack.HtmlNode h3 in htmlDoc.DocumentNode.SelectNodes("//h3"))
{
MessageBox.Show(h3.InnerText);
}
You need to set the encoding before you do...
htmlDoc.Load(new System.IO.StringReader(url), Encoding.UTF8);
This tells the agility pack that the characters are UTF8 rather than some other encoding.
The reason you need to do it here is that this is the point when it is parsed incorretly. After this you are storing the literal  characters.
Characters in string changed after downloading HTML from the internet may also be of interest.