node.Descendants(0) seems to return all child nodes instead of first level

.net html-agility-pack

Question

I am using HtmlAgilityPack to traverse through a document tree one level at a time. However, it seems that calling node.Descendants(0) returns the entire node tree.

Note: I tried pasting in my verbatim HTML string, but the SE parser didn't like it, so I added it as a snippet.

<html>
    <head>
    <meta name="generator"
    content="HTML Tidy for HTML5 (experimental) for Windows https://github.com/w3c/tidy-html5/tree/c63cc39" />
    <title></title>
    </head>
    <body>
    <p id="p1" class="newline">
        <span id="span1" class="bold">
        <span id="span2" class="literal">BOLD TEXT</span>
        </span>
    </p>
    </body>
</html>

var doc = new HtmlAgilityPack.HtmlDocument();

doc.LoadHtml(html);

var lines = doc.DocumentNode.Descendants().Where(x => x.HasClass("newline")).ToArray();

Console.WriteLine(string.Join("\r\n", lines[0].Descendants(0)
    .Select(x => $"{x.Name} {x.Id} {(x as HtmlTextNode)?.Text}")));

What the above code does is get the first p tag's descendants. If I pass 0 or 1 as an argument, it returns the entire node tree and outputs below. The thing is that the text node containing BOLD TEXT is nested 3 levels down from the p tag. With the code above, I would only expect it to return a text node, span1, and then another text node.

What am I doing wrong in my call to .Descendants?

#text

span span1
#text

span span2
#text  BOLD TEXT
#text

#text

Edit: A temporary workaround is to make sure that you only get descendants where the parent is equal to the current node. Still looking for a more practical solution, though.

Console.WriteLine(string.Join("\r\n", lines[0].Descendants(0)
    .Where(x => x.ParentNode == lines[0])
    .Select(x => $"{x.Name} {x.Id} {(x as HtmlTextNode)?.Text}")));

Popular Answer

I got the same issue, started googling and found your question :). And then I decided to ask developers directly. And here is a short version of the answer:

According to code it behaves in a different way:

/// <summary>
/// Gets all Descendant nodes in enumerated list
/// </summary>
/// <returns></returns>
public IEnumerable<HtmlNode> Descendants(int level)
{
    if (level > HtmlDocument.MaxDepthLevel)
    {
        throw new ArgumentException(HtmlNode.DepthLevelExceptionMessage);
    }

    foreach (HtmlNode node in ChildNodes)
    {
        yield return node;

        foreach (HtmlNode descendant in node.Descendants(level + 1))
        {
            yield return descendant;
        }
    }
}

It takes all descendant and takes childs by incrementing the level by one until no more descendant or the max level is reached (int.MaxValue). However, I agree with you, it should probably return descendant until the specified level is reached. Unfortunately, for backward compatibility, we will probably do nothing on this method for not affecting current applications.

However, in your case ChildNodes can be used instead of Descendants(0). The code will look like:

    Console.WriteLine(string.Join("\r\n", lines[0].ChildNodes
                            .Select(x => $"{x.Name} {x.Id} {(x as HtmlTextNode)?.Text}")));


Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow