Select all DOM elements with HTMLAgilityPack

.net c# dom html html-agility-pack

Question

I've been looking at related queries and doing some online research, but I can't seem to find a solution. I'm attempting to pick every DOM element and arrange them into an arraylist or something similar.

right now I have

public void Parse()
    {
        HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

        // There are various options, set as needed
        //htmlDoc.OptionFixNestedTags = true;

        // filePath is a path to a file containing the html
        htmlDoc.Load("Test.html");

        // Use:  htmlDoc.LoadHtml(xmlString);  to load from a string (was htmlDoc.LoadXML(xmlString)

        // ParseErrors is an ArrayList containing any errors from the Load statement
        if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
        {
            Console.WriteLine("There was an error parsing the HTML file");
        }
        else
        {
            if (htmlDoc.DocumentNode != null)
            {
                htmlDoc.DocumentNode.Descendants();

                Console.WriteLine("document node not null");
                //HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");

                foreach (HtmlNode node in htmlDoc.DocumentNode.Descendants())
                {
                    Console.WriteLine(node.Name);
                }
            }
        }
    }

Although the concluding tags are printed as "#text," the code outputs the name of the node (html, title, picture, etc.). This, I presume, is because tags begin with a "/" How can I correctly read all of the DOM elements?

1
0
3/21/2014 10:53:37 PM

Accepted Answer

The name of text nodes is "#text," and closing tags are not identified in the DOM as anything special.

<div><span>foo</span> bar</div>

giving you a tree like

div
   span
      #text:foo
   #text:bar
0
3/21/2014 11:40:06 PM

Popular Answer

I believe#text Line breaks rather than ending tags are the items that you saw. as in the following html input:

<div>
    <a href="http://example.org"></a>
</div>

If I use your code, I get:

div
#text   <- line break between <div> and <a>
a
#text  <- line break between </a> and </div>

To retrieve all items that aren't plain text nodes instead (skipping those unneeded line breaks), use the following XPath query:

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//*"))
{
    Console.WriteLine(node.Name);
}

Select all descendants of the current element with any name, according to XPath (* ).



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow