C# - Html Agility Pack - can't read from web

c# html-agility-pack

Question

I'm trying to make a small program to read content from a wikipedia page, and to get the html, I found this code elsewhere on SO

        HtmlDocument doc = new HtmlDocument();
        StringBuilder output = new StringBuilder();

        doc.LoadHtml("http://en.wikipedia.org/wiki/The Metamorphosis of Prime Intellect");
        var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);

        foreach (string line in text)
            output.AppendLine(line);

        string textOnly = HttpUtility.HtmlDecode(output.ToString());

        Console.WriteLine(textOnly);

However, I'm getting a runtime error "ArgumentNullException was unhandled", and this line is highlighted:

        var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);

Does anyone see the problem?

1
0
9/2/2013 9:59:59 PM

Popular Answer

doc.LoadHtml takes html string not url. To download that page you can use HtmlAgilityPack.HtmlWeb class

var web = new HtmlAgilityPack.HtmlWeb();
var doc = web.Load("http://en.wikipedia.org/wiki/The Metamorphosis of Prime Intellect");

var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
var output = String.Join("\n", text);

SelectNodes returns 622 items in my test.

4
9/2/2013 10:13:31 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow