get all nodes and its content using htmldocument/HtmlAgilityPack

c# html html-agility-pack uwp

Question

I need to extract all nodes from an HTML file, after which I need to extract the text and sub-nodes from those nodes, and finally the same thing from those sub-sub-nodes. As an example, I have this HTML:

<p>This <b>is a <a href="">Link</a></b> with <b>bold</b></p>

So I need a means to get the p node, followed by the unformatted text (this), the only text that is bold (is a), the bolded link (Link), then the remainder of the content, both formatted and unformatted.

I am aware that I can choose all nodes and sub-nodes in an HTML page, but how? I need to create the rendered version of the html ("This is a 32-32-32-zzz. with bold"), therefore can you please provide the text before the sub-node, the sub-node, and its text/sub-nodes?

Please be aware that the sample given above is basic. More complicated elements like lists, frames, numbered lists, triple-formatted text, etc. would be included in the HTML. Furthermore, keep in mind that the rendered object is not an issue. I previously did it, but in a different method. I simply need the section that retrieves nodes and their contents. I also can't filter by nothing since I can't ignore any node. The primary node may begin as a p, div, frame, ul, etc.

1
0
1/16/2017 3:52:57 PM

Accepted Answer

I found an easy approach to read HTML code after looking at the htmldoc and its attributes and thanks to @HungCao's observation.

I'll share a simplified version of my code since it is a bit too complicated to offer as an example.

The htmlDoc has to be loaded first. Any function might be affected:

HtmlDocument htmlDoc = new HtmlDocument();
string html = @"<p>This <b>is a <a href="""">Link</a></b> with <b>bold</b></p>";
htmlDoc.LoadHtml(html);

Then, based on each "main" node's type (in this example, p), we must interpret it and load a LoopFunction (InterNode)

HtmlNodeCollection nodes = htmlDoc.DocumentNode.ChildNodes;

foreach (HtmlNode node in nodes)
{
    if(node.Name.ToLower() == "p") //Low the typeName just in case
    {
        Paragraph newPPara = new Paragraph();
        foreach(HtmlNode childNode in node.ChildNodes)
        {
            InterNode(childNode, ref newPPara);
        }
        richTextBlock.Blocks.Add(newPPara);
    }
}

Please be aware that the "NodeType" attribute does not return the appropriate type. So use the "Name" property instead (Also note that the Name property in htmlNode is not the same as the Name attribute in HTML).

The InterNode function will add inlines to the referenced (ref) paragraph as a last step.

public bool InterNode(HtmlNode htmlNode, ref Paragraph originalPar)
{
    string htmlNodeName = htmlNode.Name.ToLower();

    List<string> nodeAttList = new List<string>();
    HtmlNode parentNode = htmlNode.ParentNode;
    while (parentNode != null) {
        nodeAttList.Add(parentNode.Name);
        parentNode = parentNode.ParentNode;
    } //we need to get it multiple types, because it could be b(old) and i(talic) at the same time.

    Inline newRun = new Run();
    foreach (string noteAttStr in nodeAttList) //with this we can set all the attributes to the inline
    {
        switch (noteAttStr)
        {
            case ("b"):
            case ("strong"):
                {
                    newRun.FontWeight = FontWeights.Bold;
                    break;
                }
            case ("i"):
            case ("em"):
                {
                    newRun.FontStyle = FontStyle.Italic;
                    break;
                }
        }
    }

    if(htmlNodeName == "#text") //the #text means that its a text node. Like <i><#text/></i>. Thanks @HungCao
    {
        ((Run)newRun).Text = htmlNode.InnerText;
    } else //if it is not a #text, don't load its innertext, as it's another node and it will always have a #text node as a child (if it has any text)
    {
        foreach (HtmlNode childNode in htmlNode.ChildNodes)
        {
            InterNode(childNode, ref originalPar);
        }
    }

    return true;
}

Note: Although this example code produces the same results as a webview, I previously said that my program must render HTML in a different manner than a webview. Nevertheless, this is just a simplified version of my final code. In actuality, my original/complete code—of which this is only the base—is operating as it should.

0
1/19/2017 9:53:01 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow