HtmlAgilityPack : How do I combine html elements together into one tag with a class?



Issue: I need to examine some HTML elements using HtmlAgilityPack and combine the tag names. Is it possible to extract each tag, from the parent to the child, replacing it with a span that has a class with a name of “strikeUEmStrong”. Also, the name changes based on the HTML element.

Order of the name of the class does in fact matter, I realized this through trial and error. As long as its able to get all of the elements and combine them together. It is very possible that it will have multiple text nodes with various levels of formatting.

This will affect multiple paragraphs.

For example, if I have this html code:

<strike><u><em><strong>four styles</strong></em></u></strike></p>

How do I convert it to this:

<span class="strikeUEmStrong">four styles</span></p>

Its possible to have this type of code as well:

    <strike><u><em><strong>four styles</strong></em></u></strike>&nbsp; <strike><u><em>three styles</em></u></strike></p>
    <em><strong>two styles</strong></em></p>

The output should look like this:

<span class="strikeUEmStrong">four styles</span>&nbsp; <span class="strikeUEm">three styles<span></p><p><span class="emStrong">two styles<span></p>


'Retrive the class name of each format node
Function GetClassName(ByVal n As HtmlNode) As String
    Dim ret As String = String.Empty

    If (n.Name <> "#text") And (n.Name <> "p") Then
        ret = n.Name + " "
    End If

    'Get the next node
    For Each child As HtmlNode In n.ChildNodes
        ret &= GetClassName(child)

    Return ret
End Function

'Create a list of class names
Function GetClassNameList(ByVal classNameList As String) As List(Of String)
    Dim ret As New List(Of String)
    Dim classArr() As String = classNameList.Split(" ")

    For Each className As String In classArr

    Return ret
End Function

'Sort a list of class names and return a merged class string
Function GetSortedClassNameString(ByVal classList As List(Of String)) As String

    Dim sortedMergedClass As String = String.Empty


    For Each className As String In classList
        sortedMergedClass &= className

    Return sortedMergedClass
End Function

'Lets point to the body node
Dim bodyNode As HtmlNode = htmlDoc.DocumentNode.SelectSingleNode("//body")

'Lets create some generic nodes
Dim currPNode As HtmlNode

Dim formatNodes As HtmlNodeCollection

Dim text As String = String.Empty
Dim textSize As Integer = 0

'Make sure the editor has something in it
If editorText <> "" Then

   'Send the text from the editor to the body node
    If bodyNode IsNot Nothing Then
       bodyNode.InnerHtml = editorText
    End If

    Dim pNode = bodyNode.SelectNodes("//p")

    Dim span As HtmlNode = htmlDoc.CreateElement("span")
    Dim tmpBody As HtmlNode = htmlDoc.CreateElement("body")
    Dim textNode As HtmlNode = htmlDoc.CreateTextNode

    Dim pCount As Integer = bodyNode.SelectNodes("//body/p").Count - 1

    For childCountP As Integer = 0 To pCount

        Dim paragraph = HtmlNode.CreateNode(htmlDoc.CreateElement("p").WriteTo)

        'Which paragraph I am at.
        currPNode = pNode.Item(childCountP)

        'For this paragraph get me the collection of html nodes
        formatNodes = currPNode.ChildNodes

        'Count how many Format nodes we have in a paragraph
        Dim formatCount As Integer = currPNode.ChildNodes.Count - 1

       'Go through each node and examine the elements. 
       'Then look at the markup to create classes and then group them under one span
       For child As Integer = 0 To formatCount

           'Iterate through the formateNodes, strike, em, strong, etc.
           Dim currFormatNode = HtmlNode.CreateNode(formatNodes(child).WriteTo)

           'TODO: Handle nested images and links? How do we know what to rip out?

           'First check for format nodes
           'Note, we can't let it use everything because it will change nested elements as well. I.E. span within span.
           If (currFormatNode.Name <> "#text") And (currFormatNode.Name = "strike") Or (currFormatNode.Name = "em") _
               Or (currFormatNode.Name = "strong") Or (currFormatNode.Name = "u") Or (currFormatNode.Name = "sub") _
               Or (currFormatNode.Name = "sup") Or (currFormatNode.Name = "b") Then

              'strip all tags, just take the inner text
              span.InnerHtml = currFormatNode.InnerText

              'Create a text node with text from the lowest node
              textNode = htmlDoc.CreateTextNode(span.InnerText)

              'Recursively go through the format nodes
              'Create a list from the string
              'Then sort the list and return a string
              'Appending the class to the span
               span.SetAttributeValue("class", GetSortedClassNameString(GetClassNameList(GetClassName(currFormatNode).Trim())))

              'Attach the span before the current format node
              currFormatNode.ParentNode.InsertBefore(span, currFormatNode)

             'Remove the formatted children leaving the above node

             'We need to build a paragraph here
             paragraph.InnerHtml &= span.OuterHtml

             'Lets output something for debugging
             childNodesTxt.InnerText &= span.OuterHtml

             Else 'handle #text and other nodes seperately
                  'We need to build a paragraph here
                  paragraph.InnerHtml &= span.OuterHtml
                  textNode = htmlDoc.CreateTextNode(currFormatNode.InnerHtml)

                  'Lets output something for debugging
                  childNodesTxt.InnerText &= textNode.OuterHtml
             End If

        'End of formats

        'Start adding the new paragraph's to the body node
     'End of paragraphs

    'Clean out body first and replace with new elements

    'Update our body

 End If

 End If


<span class="strikeuemstrong">four styles</span>

Finally getting the right output, after I fixed the ordering issue. Thank you for the help.

11/16/2012 9:53:46 PM

Accepted Answer

This is not a straight forward question to answer. I'll describe how I'd write the algorithm to do this, and include some pseudo code to help.

  1. I'd get my parent tag. I'll assume you want to do this for all "p" tags
  2. I'd iterate over my children tags, taking the tag name and appending it into a class name
  3. I'd recursively iterate children until I get my appended tag name

Pseudo-code. Please excuse any typos, as I'm typing this on the fly.

public string GetClassName(Node n)
var ret = n.TagName;

foreach(var child in n.ChildNodes)
ret += GetClassName(child);

return ret;

foreach(var p in paragraphs)
foreach(var child in p.ChildNodes)
 var span = new Span();
 span.InnerText = child.InnerText; // strip all tags, just take the inner text

span.ClassName = GetClassName(child);

child.ReplaceWith(span); // note: if you do this with a FOREACH and not a for loop, it'll blow up C# for modifying the collection while iterating.  Use for loops. if you're going to do "active" replacement like in this pseudo code

I'd be happy to modify my answer once I get more context. Please review what I'm suggesting and comment on it with more context if you need me to refine my suggestion. If not, I hope this gets you what you need :)

11/8/2012 8:34:49 PM

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow