Unable to get Child Categories inside
    using HtmlAgilityPack C# ASP.NET

asp.net c# html-agility-pack web-scraping


I am new in Webscraping and trying to get data from a website with HTMLAgilityPack using ASP.NET C#. HTML structure which I am trying to parse is:

<li class='subsubnav' id='new-women-clothing'>
    <span class='cat-name'>CLOTHING</span>

        <li><a href="/womenswear/womens-just-in" id="just-in">Just In</a></li>

        <li><a href="/womenswear/new-season-exclusives" id="exclusives">Exclusives</a></li>

        <li><a href="/womenswear/new-season-dresses" id="dresses-&-gowns">Dresses & Gowns</a></li>

        <li><a href="/womenswear/new-season-coats" id="coats">Coats</a></li>

        <li><a href="/womenswear/new-season-jackets" id="jackets">Jackets</a></li>

        <li><a href="/womenswear/new-season-shirts-and-blouses" id="shirts-&-blouses">Shirts & Blouses</a></li>

        <li><a href="/womenswear/new-season-tops" id="tops">Tops</a></li>

        <li><a href="/womenswear/new-season-knitwear" id="knitwear">Knitwear</a></li>

        <li><a href="/womenswear/new-season-sweatshirts" id="sweatshirts">Sweatshirts</a></li>

        <li><a href="/womenswear/new-season-skirts-and-shorts" id="skirts-&-shorts">Skirts & Shorts</a></li>

        <li><a href="/womenswear/new-season-trousers" id="trousers">Trousers</a></li>

        <li><a href="/womenswear/new-season-jumpsuits" id="jumpsuits">Jumpsuits</a></li>

        <li><a href="/womenswear/new-season-jeans" id="jeans">Jeans</a></li>

        <li><a href="/womenswear/new-season-swimwear" id="swimwear">Swimwear</a></li>

        <li><a href="/womenswear/new-season-lingerie" id="lingerie">Lingerie</a></li>

        <li><a href="/womenswear/new-season-nightwear" id="nightwear">Nightwear</a></li>

        <li><a href="/womenswear/sportswear" id="sportswear">Sportswear</a></li>

        <li><a href="/womenswear/ski-wear" id="ski-wear">Ski Wear</a></li>



I am getting the parent categories which in this case is CLOTHING perfectly but i am unable to get elements inside ul.

here is my c# code:

var html = new HtmlDocument();
html.LoadHtml(new WebClient().DownloadString("http://www.harrods.com/men/t-shirts?icid=megamenu_MW_clothing_t_shirts"));
var root = html.DocumentNode;
var nodes = root.Descendants();
var totalNodes = nodes.Count();
var dt = root.Descendants().Where(n => n.GetAttributeValue("class", "").Equals("cat-name"));

foreach(var x in dt)
    foreach (var element in x.Descendants("ul"))
        child_data.Add(new cat_childs(element.InnerText));

    data.Add(new Categories(x.InnerText,child_data));

test.DataSource = data;

So how can I get the link and text of anchor tags inside <ul>?

3/15/2016 7:36:24 AM

Accepted Answer

If you want to base the iteration on span with class='cat-name', then the target ul relation to the span is following sibling instead of descendant. You can use SelectNodes() to get following sibling elements from current span, like so :

foreach (var x in dt)
    foreach (var element in x.SelectNodes("following-sibling::ul/li/a"))
        child_data.Add(new cat_childs(element.InnerText));

    data.Add(new Categories(x.InnerText,child_data));


It seems that the actual problem is in child_data variable being declared outside the outer loop. It means that you're keep adding item to the same child_data instance. Try to declare it inside the outer loop, right after foreach (var x in dt){. Alternatively, you can write the entire codes as a LINQ expression, something like this :

var data = (from d in dt
            let child_data = x.SelectNodes("following-sibling::ul/li/a")
                              .Select(o => new cat_childs(o.InnerText))
            select new Categories(x.InnerText, child_data)
3/9/2016 12:08:56 PM

Popular Answer

Using this xpath. It will get all the <li> that contain a <span> that has a class='cat-name'. After which it picks all the <a>s that are enclosed by <li>.

//If the span has no influence on what you want you can simply use: 
//HtmlNodeCollection hNC = htmlDoc.DocumentNode.SelectNodes("//ul/li/a");

HtmlNodeCollection hNC = htmlDoc.DocumentNode.SelectNodes("//li/span[@class='cat-name']/parent::*/ul/li");
foreach (HtmlNode h in hNC)
    Console.Write(h.InnerText+" ");
    Console.WriteLine(h.GetAttributeValue("href", ""));

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow