HTML Agility Pack Question (Attempting to parse string from source)

c# html html-agility-pack html-parsing

Question

I am attempting to use the Agility pack to parse certain bits of info from various pages. I am kind of worried that using this might be overkill for what I need, if that is case feel free to let me know. Anyway, I am attempting to parse a page from motley fool to get the name of a company based on the ticker. I will be parsing several pages to get stock info in a similar way.

The HTML that I want to parse looks like:

<h1 class="subHead"> 
    Microsoft Corp <span>(NASDAQ:MSFT)</span>
</h1>

Also, the page I want to parse is: http://caps.fool.com/Ticker/MSFT.aspx

So, I guess my question is how do I simply get the Microsoft Corp from the html and should I even be using the agility pack to do things like this?

Edit: Current code

public String getStockName(String ticker)
{
    String text ="";
    HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
    HtmlAgilityPack.HtmlDocument doc = web.Load("http://caps.fool.com/Ticker/" + ticker + ".aspx");

    var node = doc.DocumentNode.SelectSingleNode("/h1[@class='subHead']");
    text = node.FirstChild.InnerText.Trim();
    return text;
}

Accepted Answer

This would give you a list of all stock names, for your sample Html just of Microsoft:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("test.html");

var nodes = doc.DocumentNode.SelectNodes("//h1[@class='subHead']");
foreach (var node in nodes)
{
    string text = node.FirstChild.InnerText; //output: "Microsoft Corp"
    string textAll = node.InnerText; //output: "Microsoft Corp (NASDAQ:MSFT)"
}

Edit based on updated question - this should work for you:

string text = "";
HtmlWeb web = new HtmlWeb();

string url = string.Format("http://caps.fool.com/Ticker/{0}.aspx", ticker);
HtmlAgilityPack.HtmlDocument doc = web.Load(url);

var node = doc.DocumentNode.SelectSingleNode("//h1[@class='subHead']");
text = node.FirstChild.InnerText.Trim();
return text;

Popular Answer

Use an xpath expression to select the element then pickup the text.

 foreach (var element in doc.DocumentNode.SelectNodes("//h1[@clsss='subHead']/span"))
 {
    Console.WriteLine (element.InnerText);
 } 



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why