I'm trying to make a small program to read content from a wikipedia page, and to get the html, I found this code elsewhere on SO
HtmlDocument doc = new HtmlDocument();
StringBuilder output = new StringBuilder();
doc.LoadHtml("http://en.wikipedia.org/wiki/The Metamorphosis of Prime Intellect");
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
foreach (string line in text)
output.AppendLine(line);
string textOnly = HttpUtility.HtmlDecode(output.ToString());
Console.WriteLine(textOnly);
However, I'm getting a runtime error "ArgumentNullException was unhandled", and this line is highlighted:
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
Does anyone see the problem?
doc.LoadHtml
takes html string not url. To download that page you can use HtmlAgilityPack.HtmlWeb
class
var web = new HtmlAgilityPack.HtmlWeb();
var doc = web.Load("http://en.wikipedia.org/wiki/The Metamorphosis of Prime Intellect");
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
var output = String.Join("\n", text);
SelectNodes
returns 622 items in my test.