Question about HTML Agility Pack (Attempting to parse string from source)

c# html html-agility-pack html-parsing


I am attempting to use the Agility pack to parse certain bits of info from various pages. I am kind of worried that using this might be overkill for what I need, if that is case feel free to let me know. Anyway, I am attempting to parse a page from motley fool to get the name of a company based on the ticker. I will be parsing several pages to get stock info in a similar way.

The HTML that I want to parse looks like:

<h1 class="subHead"> 
    Microsoft Corp <span>(NASDAQ:MSFT)</span>

Also, the page I want to parse is:

So, I guess my question is how do I simply get the Microsoft Corp from the html and should I even be using the agility pack to do things like this?

Edit: Current code

public String getStockName(String ticker)
    String text ="";
    HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
    HtmlAgilityPack.HtmlDocument doc = web.Load("" + ticker + ".aspx");

    var node = doc.DocumentNode.SelectSingleNode("/h1[@class='subHead']");
    text = node.FirstChild.InnerText.Trim();
    return text;
4/10/2011 9:30:45 PM

Accepted Answer

This would give you a list of all stock names, for your sample Html just of Microsoft:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

var nodes = doc.DocumentNode.SelectNodes("//h1[@class='subHead']");
foreach (var node in nodes)
    string text = node.FirstChild.InnerText; //output: "Microsoft Corp"
    string textAll = node.InnerText; //output: "Microsoft Corp (NASDAQ:MSFT)"

Edit based on updated question - this should work for you:

string text = "";
HtmlWeb web = new HtmlWeb();

string url = string.Format("{0}.aspx", ticker);
HtmlAgilityPack.HtmlDocument doc = web.Load(url);

var node = doc.DocumentNode.SelectSingleNode("//h1[@class='subHead']");
text = node.FirstChild.InnerText.Trim();
return text;
4/10/2011 9:16:57 PM

Popular Answer

Use an xpath expression to select the element then pickup the text.

 foreach (var element in doc.DocumentNode.SelectNodes("//h1[@clsss='subHead']/span"))
    Console.WriteLine (element.InnerText);

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow