无法获取子类别<ul>使用HtmlAgilityPack C#ASP.NET

asp.net c# html-agility-pack web-scraping

我是Webscraping的新手,并尝试使用ASP.NET C#从HTMLAgilityPack网站获取数据。我想解析的HTML结构是:

<li class='subsubnav' id='new-women-clothing'>
    <span class='cat-name'>CLOTHING</span>

    <ul>
        <li><a href="/womenswear/womens-just-in" id="just-in">Just In</a></li>

        <li><a href="/womenswear/new-season-exclusives" id="exclusives">Exclusives</a></li>

        <li><a href="/womenswear/new-season-dresses" id="dresses-&-gowns">Dresses & Gowns</a></li>

        <li><a href="/womenswear/new-season-coats" id="coats">Coats</a></li>

        <li><a href="/womenswear/new-season-jackets" id="jackets">Jackets</a></li>

        <li><a href="/womenswear/new-season-shirts-and-blouses" id="shirts-&-blouses">Shirts & Blouses</a></li>

        <li><a href="/womenswear/new-season-tops" id="tops">Tops</a></li>

        <li><a href="/womenswear/new-season-knitwear" id="knitwear">Knitwear</a></li>

        <li><a href="/womenswear/new-season-sweatshirts" id="sweatshirts">Sweatshirts</a></li>

        <li><a href="/womenswear/new-season-skirts-and-shorts" id="skirts-&-shorts">Skirts & Shorts</a></li>

        <li><a href="/womenswear/new-season-trousers" id="trousers">Trousers</a></li>

        <li><a href="/womenswear/new-season-jumpsuits" id="jumpsuits">Jumpsuits</a></li>

        <li><a href="/womenswear/new-season-jeans" id="jeans">Jeans</a></li>

        <li><a href="/womenswear/new-season-swimwear" id="swimwear">Swimwear</a></li>

        <li><a href="/womenswear/new-season-lingerie" id="lingerie">Lingerie</a></li>

        <li><a href="/womenswear/new-season-nightwear" id="nightwear">Nightwear</a></li>

        <li><a href="/womenswear/sportswear" id="sportswear">Sportswear</a></li>

        <li><a href="/womenswear/ski-wear" id="ski-wear">Ski Wear</a></li>

    </ul>

</li>

我得到的父类别在这种情况下是完美的衣服,但我无法获得ul内的元素。

这是我的c#代码:

<li class='subsubnav' id='new-women-clothing'>
    <span class='cat-name'>CLOTHING</span>

    <ul>
        <li><a href="/womenswear/womens-just-in" id="just-in">Just In</a></li>

        <li><a href="/womenswear/new-season-exclusives" id="exclusives">Exclusives</a></li>

        <li><a href="/womenswear/new-season-dresses" id="dresses-&-gowns">Dresses & Gowns</a></li>

        <li><a href="/womenswear/new-season-coats" id="coats">Coats</a></li>

        <li><a href="/womenswear/new-season-jackets" id="jackets">Jackets</a></li>

        <li><a href="/womenswear/new-season-shirts-and-blouses" id="shirts-&-blouses">Shirts & Blouses</a></li>

        <li><a href="/womenswear/new-season-tops" id="tops">Tops</a></li>

        <li><a href="/womenswear/new-season-knitwear" id="knitwear">Knitwear</a></li>

        <li><a href="/womenswear/new-season-sweatshirts" id="sweatshirts">Sweatshirts</a></li>

        <li><a href="/womenswear/new-season-skirts-and-shorts" id="skirts-&-shorts">Skirts & Shorts</a></li>

        <li><a href="/womenswear/new-season-trousers" id="trousers">Trousers</a></li>

        <li><a href="/womenswear/new-season-jumpsuits" id="jumpsuits">Jumpsuits</a></li>

        <li><a href="/womenswear/new-season-jeans" id="jeans">Jeans</a></li>

        <li><a href="/womenswear/new-season-swimwear" id="swimwear">Swimwear</a></li>

        <li><a href="/womenswear/new-season-lingerie" id="lingerie">Lingerie</a></li>

        <li><a href="/womenswear/new-season-nightwear" id="nightwear">Nightwear</a></li>

        <li><a href="/womenswear/sportswear" id="sportswear">Sportswear</a></li>

        <li><a href="/womenswear/ski-wear" id="ski-wear">Ski Wear</a></li>

    </ul>

</li>

那么如何在<ul>获取锚标签的链接和文本?

一般承认的答案

如果您希望使用class='cat-name'将迭代基于span ,那么与span的目标ul关系将遵循兄弟而不是后代 。您可以使用SelectNodes()从当前span获取以下兄弟元素,如下所示:

foreach (var x in dt)
{
    foreach (var element in x.SelectNodes("following-sibling::ul/li/a"))
    {
        child_data.Add(new cat_childs(element.InnerText));
    }

    data.Add(new Categories(x.InnerText,child_data));
}

更新:

似乎实际问题是在外部循环外声明的child_data变量。这意味着您不断向同一个child_data实例添加项目。尝试在foreach (var x in dt){之后的外部循环中声明它。或者,您可以将整个代码编写为LINQ表达式,如下所示:

foreach (var x in dt)
{
    foreach (var element in x.SelectNodes("following-sibling::ul/li/a"))
    {
        child_data.Add(new cat_childs(element.InnerText));
    }

    data.Add(new Categories(x.InnerText,child_data));
}

热门答案

使用此xpath。它将获得包含具有class ='cat-name'的<span>的所有<li>。之后它会选择<li>所包含的所有<a>。

//If the span has no influence on what you want you can simply use: 
//HtmlNodeCollection hNC = htmlDoc.DocumentNode.SelectNodes("//ul/li/a");

HtmlNodeCollection hNC = htmlDoc.DocumentNode.SelectNodes("//li/span[@class='cat-name']/parent::*/ul/li");
foreach (HtmlNode h in hNC)
{
    Console.Write(h.InnerText+" ");
    Console.WriteLine(h.GetAttributeValue("href", ""));
}



许可下: CC-BY-SA with attribution
不隶属于 Stack Overflow
这个KB合法吗? 是的,了解原因
许可下: CC-BY-SA with attribution
不隶属于 Stack Overflow
这个KB合法吗? 是的,了解原因