Parsing HTML page with HtmlAgilityPack using LINQ

c# html-agility-pack linq

Question

How can i parse html using Linq on a webpage and add values to a string. I am using the HtmlAgilityPack on a metro application and would like to bring back 3 values and add them to a string.

here is the url = http://explorer.litecoin.net/address/Li7x5UZqWUy7o1tEC2x5o6cNsn2bmDxA2N

I would like to get the values from the following see "belwo"

"Balance:", "Transactions in", "Received"

WebResponse x = await req.GetResponseAsync();
HttpWebResponse res = (HttpWebResponse)x;
if (res != null)
{
    if (res.StatusCode == HttpStatusCode.OK)
    {
        Stream stream = res.GetResponseStream();
        using (StreamReader reader = new StreamReader(stream))
        {
            html = reader.ReadToEnd();
        }
        HtmlDocument htmlDocument = new HtmlDocument();
        htmlDocument.LoadHtml(html);

        string appName = htmlDocument.DocumentNode.Descendants // not sure what t
        string a = "Name: " + WebUtility.HtmlDecode(appName);
    }
}

Accepted Answer

Please try the following. You might also consider pulling the table apart as it is a little better formed than the free-text in the 'p' tag.

Cheers, Aaron.

// download the site content and create a new html document
// NOTE: make this asynchronous etc when considering IO performance
var url = "http://explorer.litecoin.net/address/Li7x5UZqWUy7o1tEC2x5o6cNsn2bmDxA2N";
var data = new WebClient().DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(data);

// extract the transactions 'h3' title, the node we want is directly before it
var transTitle = 
    (from h3 in doc.DocumentNode.Descendants("h3")
     where h3.InnerText.ToLower() == "transactions"
     select h3).FirstOrDefault();

// tokenise the summary, one line per 'br' element, split each line by the ':' symbol
var summary = transTitle.PreviousSibling.PreviousSibling;
var tokens = 
    (from row in summary.InnerHtml.Replace("<br>", "|").Split('|')
     where !string.IsNullOrEmpty(row.Trim())
     let line = row.Trim().Split(':')
     where line.Length == 2
     select new { name = line[0].Trim(), value = line[1].Trim() });

// using linqpad to debug, the dump command drops the currect variable to the output
tokens.Dump();

'Dump()', is a LinqPad command that dumps the variable to the console, the following is a sample of the output from the Dump command:

  • Balance: 5 LTC
  • Transactions in: 2
  • Received: 5 LTC
  • Transactions out: 0
  • Sent: 0 LTC

Popular Answer

the document you have to parse is not the most well formed for parsing many elements are missing the class or at least id attribute but what you want to get is a second p tag content in it

you can try this

HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);



var pNodes = htmlDocument.DocumentNode.SelectNodes("//p")
[1].InnerHtml.ToString().Split(new string[] { "<br />" }, StringSplitOptions.None).Take(3);

 string vl="Balance:"+pNodes[0].Split(':')[1]+"Transactions in"+pNodes[1].Split(':')[1]+"Received"+pNodes[2].Split(':')[1];



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why