如何在C#中获得HTML编码?

c# encoding html html-agility-pack webclient

我正试图从网络词典中获取某些单词的发音。例如,在下面的代码中,我想从http://collinsdictionary.com获得good发音

(此处使用HTTP Agility Pack

static void test()
{
    String url = "http://www.collinsdictionary.com/dictionary/english/good";
    WebClient client = new WebClient();
    client.Encoding = System.Text.Encoding.UTF8;
    String html = client.DownloadString(url);
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    HtmlAgilityPack.HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[@id=\"good_1\"]/div[1]/h2/span/text()[1]");
    if (node == null)
    {
        Console.WriteLine("XPath not found.");
    }
    else
    {
        Console.WriteLine(node.WriteTo());
    }
}

我在期待

static void test()
{
    String url = "http://www.collinsdictionary.com/dictionary/english/good";
    WebClient client = new WebClient();
    client.Encoding = System.Text.Encoding.UTF8;
    String html = client.DownloadString(url);
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    HtmlAgilityPack.HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[@id=\"good_1\"]/div[1]/h2/span/text()[1]");
    if (node == null)
    {
        Console.WriteLine("XPath not found.");
    }
    else
    {
        Console.WriteLine(node.WriteTo());
    }
}

但我最多能得到的是

static void test()
{
    String url = "http://www.collinsdictionary.com/dictionary/english/good";
    WebClient client = new WebClient();
    client.Encoding = System.Text.Encoding.UTF8;
    String html = client.DownloadString(url);
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    HtmlAgilityPack.HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[@id=\"good_1\"]/div[1]/h2/span/text()[1]");
    if (node == null)
    {
        Console.WriteLine("XPath not found.");
    }
    else
    {
        Console.WriteLine(node.WriteTo());
    }
}

如何做到对了?

一般承认的答案

问题不在于解析文本,而是控制台输出存在问题。如果从命令行应用程序执行此操作,则可以将控制台的输出编码设置为unicode:

Console.OutputEncoding = System.Text.Encoding.Unicode;

您还需要确保控制台中的字体是具有unicode支持的字体。有关详细信息,请参阅此答案


热门答案

如果您知道页面编码(例如System.Text.Encoding.UTF8);

string html = DownloadSmallFiles_String(url, System.Text.Encoding.UTF8, 20000);

或使用自动编码检测(取决于服务器响应)

string html = DownloadSmallFiles_String(url, System.Text.Encoding.UTF8, 20000);

最后加载html

string html = DownloadSmallFiles_String(url, System.Text.Encoding.UTF8, 20000);

试试下面的代码

string html = DownloadSmallFiles_String(url, System.Text.Encoding.UTF8, 20000);



许可下: CC-BY-SA with attribution
不隶属于 Stack Overflow
这个KB合法吗? 是的,了解原因
许可下: CC-BY-SA with attribution
不隶属于 Stack Overflow
这个KB合法吗? 是的,了解原因