HtmlAgilityPack - Extract links from a specific table

.net c# html-agility-pack

Question

I'm having some trouble finding the exact way to parse for links from a site. Using firebug, the table's exact xPath is :

/html/body/div/form/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[1]/td/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr[1]/td/div/table/tbody/tr[3]/td/div/table/tbody/tr/td/div/table

It also has an id ='ctl00_cp1_GridView1' (which hasn't been exactly helpful).

All I want to do is find all of the links in the first and add them to a list.

Here's my current code snippet (with some help from this post:

protected void btnSubmitURL_Click(object sender, EventArgs e)
{
    try
    {
        List<string> siteList = new List<string>();
        int counter = 1;

        var web = new HtmlWeb();
        var doc = web.Load(txtURL.Text);
        var table = doc.DocumentNode.SelectSingleNode("html/body/div/form/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[1]/td/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr[1]/td/div/table/tbody/tr[3]/td/div/table/tbody/tr/td/div/table[@id='ctl00_cp1_GridView1']/tbody");
        HtmlNodeCollection rows = table.SelectNodes("./tr");
        if (rows != null)
        {
            for (int i = 0; i < rows.Count; i++)
            {
                HtmlNodeCollection cols = rows[i].SelectNodes("./td[1]");
                if (cols != null)
                {
                    for (int j = 0; j < cols.Count; j++)
                    {
                        HtmlNode aTags = cols[i].SelectSingleNode("./a[@id='NormalColoredFont']");
                        if (aTags != null)
                        {
                            siteList.Add(counter + ". " + aTags.InnerHtml + " - " + aTags.Attributes["href"].Value);
                        }
                    }
                }
            }
        }

        lblOutput.Text = siteList.Count.ToString();
    }

    catch (Exception ex)
    {
        MessageBox.Show(ex.ToString());
    }
}

I keep getting an Null Exception error out right at the HtmlNodeCollection rows because it can't find that specific table. I've tried searching via the table id but that hasn't helped either.

Any help with getting to that table would be appreciated.

Accepted Answer

I was able to finally extract all of the links using the example used from Scott Mitchell. His example is as followed:

var linksOnPage = from lnks in document.DocumentNode.Descendants()
              where lnks.Name == "a" && 
                   lnks.Attributes["href"] != null && 
                   lnks.InnerText.Trim().Length > 0
              select new
              {
                 Url = lnks.Attributes["href"].Value,
                 Text = lnks.InnerText
              };

Thanks to jessehouwing and casperOne for responding quickly!




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why