Parsing dl with HtmlAgilityPack

asp.net c# html-agility-pack screen-scraping

Question

This is the sample HTML I am trying to parse with Html Agility Pack in ASP.Net (C#).

<div class="content-div">
    <dl>
        <dt>
            <b><a href="1.html" title="1">1</a></b>
        </dt>
        <dd> First Entry</dd>
        <dt>
            <b><a href="2.html" title="2">2</a></b>
        </dt>
        <dd> Second Entry</dd>
        <dt>
            <b><a href="3.html" title="3">3</a></b>
        </dt>
        <dd> Third Entry</dd>
    </dl>
</div>

The Values I want are :

  • The hyperlink -> 1.html
  • The Anchor Text ->1
  • Inner Text od dd -> First Entry

(I have taken examples of the first entry here but I want the values for these elements for all the entries in the list )

This is the code I am using currently,

var webGet = new HtmlWeb();
            var document = webGet.Load(url2);
var parsedValues=
   from info in document.DocumentNode.SelectNodes("//div[@class='content-div']")
   from content in info.SelectNodes("dl//dd")
   from link in info.SelectNodes("dl//dt/b/a")
       .Where(x => x.Attributes.Contains("href"))
   select new 
   {
       Text = content.InnerText,
       Url = link.Attributes["href"].Value,
       AnchorText = link.InnerText,
   };

GridView1.DataSource = parsedValues;
GridView1.DataBind();

The problem is that I get the values for the link and the anchor text correctly but for the inner text of it just takes the value of the first entry and fills the same value for all other entries for the total number of times the element occurs and then it starts over with the second one. I may not be so clear in my explanation so here's a sample output I am getting with this code:

First Entry     1.html  1
First Entry     2.html  2
First Entry     3.html  3
Second Entry    1.html  1
Second Entry    2.html  2
Second Entry    3.html  3
Third Entry     1.html  1
Third Entry     2.html  2
Third Entry     3.html  3

Whereas I am trying to get

First Entry      1.html     1
Second Entry     2.html     2
Third Entry      3.html     3

I am pretty new to HAP and have very little knoweledge on xpath, so I am sure I am doing something wrong here, but I couldn't make it work even after spending hours on it. Any help would be much appreciated.

Accepted Answer

Solution 1

I have defined a function that given a dt node will return the next dd node after it:

private static HtmlNode GetNextDDSibling(HtmlNode dtElement)
{
    var currentNode = dtElement;

    while (currentNode != null)
    {
        currentNode = currentNode.NextSibling;

        if(currentNode.NodeType == HtmlNodeType.Element && currentNode.Name =="dd")
            return currentNode;
    }

    return null;
}

and now the LINQ code can be transformed to:

var parsedValues =
    from info in document.DocumentNode.SelectNodes("//div[@class='content-div']")
    from dtElement in info.SelectNodes("dl/dt")
    let link = dtElement.SelectSingleNode("b/a[@href]")
    let ddElement = GetNextDDSibling(dtElement)
    where link != null && ddElement != null
    select new
    {
        Text = ddElement.InnerHtml,
        Url = link.GetAttributeValue("href", ""),
        AnchorText = link.InnerText
    };

Solution 2

Without additional functions:

var infoNode = 
        document.DocumentNode.SelectSingleNode("//div[@class='content-div']");

var dts = infoNode.SelectNodes("dl/dt");
var dds = infoNode.SelectNodes("dl/dd");

var parsedValues = dts.Zip(dds,
    (dt, dd) => new
    {
        Text = dd.InnerHtml,
        Url = dt.SelectSingleNode("b/a[@href]").GetAttributeValue("href", ""),
        AnchorText = dt.SelectSingleNode("b/a[@href]").InnerText
    });

Popular Answer

Just a e.g. of how can you parse some elements using Html Agility Pack

public string ParseHtml()
{
    string output = null;
    HtmlDocument htmldocument = new HtmlDocument();
    htmldocument.LoadHtml(YourHTML);

    HtmlNode node = htmldocument.DocumentNode;    

    HtmlNodeCollection dds = node.SelectNodes("//dd"); //Select all dd tags
    HtmlNodeCollection anchors = node.SelectNodes("//b/a[@href]"); //Select all 'a' tags that contais href attribute

    for (int i = 0; i < dds.Count; i++)
    {
        string atributteValue = null.
        Text = dds[i].InnerText;
        Url = anchors[i].GetAttributeValue("href", atributteValue);
        AnchorText = anchors[i].InnerText;

        //Your code...
    }
    return output;
}



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why