Question about HTML Agility Pack (Attempting to parse string from source)

c# html html-agility-pack html-parsing

Question

I'm trying to parse certain pieces of information from different sites using the Agility pack. I'm a little concerned that employing this could be unnecessary for what I require; if so, please let me know. Anyway, I'm trying to extract the name of a corporation from a motley fool website based on the ticker. In a similar manner, I will parse numerous websites to get stock information.

I want to parse HTML that looks like this:

<h1 class="subHead"> 
    Microsoft Corp <span>(NASDAQ:MSFT)</span>
</h1>

I also want to parse the following page: http://caps.fool.com/Ticker/MSFT.aspx

So, I suppose my query is: How do I just extract the Microsoft Corp from the HTML, and should I even be utilizing the agility pack for such things?

Edit: modern code

public String getStockName(String ticker)
{
    String text ="";
    HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
    HtmlAgilityPack.HtmlDocument doc = web.Load("http://caps.fool.com/Ticker/" + ticker + ".aspx");

    var node = doc.DocumentNode.SelectSingleNode("/h1[@class='subHead']");
    text = node.FirstChild.InnerText.Trim();
    return text;
}
1
1
4/10/2011 9:30:45 PM

Accepted Answer

For your example HTML alone, this would provide you with a list of all stock names for Microsoft:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("test.html");

var nodes = doc.DocumentNode.SelectNodes("//h1[@class='subHead']");
foreach (var node in nodes)
{
    string text = node.FirstChild.InnerText; //output: "Microsoft Corp"
    string textAll = node.InnerText; //output: "Microsoft Corp (NASDAQ:MSFT)"
}

Based on the most recent inquiry, this should work for you:

string text = "";
HtmlWeb web = new HtmlWeb();

string url = string.Format("http://caps.fool.com/Ticker/{0}.aspx", ticker);
HtmlAgilityPack.HtmlDocument doc = web.Load(url);

var node = doc.DocumentNode.SelectSingleNode("//h1[@class='subHead']");
text = node.FirstChild.InnerText.Trim();
return text;
2
4/10/2011 9:16:57 PM

Popular Answer

To pick up the text after selecting the element, use an xpath expression.

 foreach (var element in doc.DocumentNode.SelectNodes("//h1[@clsss='subHead']/span"))
 {
    Console.WriteLine (element.InnerText);
 } 


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow