Splitting HTML string into two parts with HtmlAgilityPack

c# dom html html-agility-pack parsing

Question

I'm looking for the best way to split an HTML document over some tag in C# using HtmlAgilityPack. I want to preserve the intended markup as I'm doing the split. Here is an example.

If the document is like this:

<p>
<div>
    <p>
        Stuff
    </p>
    <p>
        <ul>
            <li>Bullet 1</li>
            <li><a href="#">link</a></li>
            <li>Bullet 3</li>
        </ul>
    </p>
            <span>Footer</span>
</div>
</p>

Once it's split, it should look like this:

Part 1

<p>
<div>
    <p>
        Stuff
    </p>
    <p>
        <ul>
            <li>Bullet 1</li>
        </ul>
    </p>
</div>
</p>

Part 2

<p>
<div>
    <p>
        <ul>
            <li>Bullet 3</li>
        </ul>
    </p>
            <span>Footer</span>
</div>

</p>

What would be the best way of doing something like that?

Accepted Answer

Here is what I came up with. This does the split and removes the "empty" elements of the element where the split happens.

    private static void SplitDocument()
    {
        var doc = new HtmlDocument();
        doc.Load("HtmlDoc.html");
        var links = doc.DocumentNode.SelectNodes("//a[@href]");
        var firstPart = GetFirstPart(doc.DocumentNode, links[0]).DocumentNode.InnerHtml;
        var secondPart = GetSecondPart(links[0]).DocumentNode.InnerHtml;
    }

    private static HtmlDocument GetFirstPart(HtmlNode currNode, HtmlNode link)
    {
        var nodeStack = new Stack<Tuple<HtmlNode, HtmlNode>>();        
        var newDoc = new HtmlDocument();
        var parent = newDoc.DocumentNode;

        nodeStack.Push(new Tuple<HtmlNode, HtmlNode>(currNode, parent));

        while (nodeStack.Count > 0)
        {
            var curr = nodeStack.Pop();
            var copyNode = curr.Item1.CloneNode(false);
            curr.Item2.AppendChild(copyNode);

            if (curr.Item1 == link)
            {
                var nodeToRemove = NodeAndEmptyAncestors(copyNode);
                nodeToRemove.ParentNode.RemoveChild(nodeToRemove);
                break;
            }

            for (var i = curr.Item1.ChildNodes.Count - 1; i >= 0; i--)
            {
                nodeStack.Push(new Tuple<HtmlNode, HtmlNode>(curr.Item1.ChildNodes[i], copyNode));
            }
        }

        return newDoc;
    }

    private static HtmlDocument GetSecondPart(HtmlNode link)
    {
        var nodeStack = new Stack<HtmlNode>();
        var newDoc = new HtmlDocument();

        var currNode = link;
        while (currNode.ParentNode != null)
        {
            currNode = currNode.ParentNode;
            nodeStack.Push(currNode.CloneNode(false));
        }

        var parent = newDoc.DocumentNode;
        while (nodeStack.Count > 0)
        {
            var node = nodeStack.Pop();
            parent.AppendChild(node);
            parent = node;
        }

        var newLink = link.CloneNode(false);
        parent.AppendChild(newLink);

        currNode = link;
        var newParent = newLink.ParentNode;

        while (currNode.ParentNode != null)
        {
            var foundNode = false;
            foreach (var child in currNode.ParentNode.ChildNodes)
            {
                if (foundNode) newParent.AppendChild(child.Clone());
                if (child == currNode) foundNode = true;
            }

            currNode = currNode.ParentNode;
            newParent = newParent.ParentNode;
        }

        var nodeToRemove = NodeAndEmptyAncestors(newLink);
        nodeToRemove.ParentNode.RemoveChild(nodeToRemove);

        return newDoc;
    }

    private static HtmlNode NodeAndEmptyAncestors(HtmlNode node)
    {
        var currNode = node;
        while (currNode.ParentNode != null && currNode.ParentNode.ChildNodes.Count == 1)
        {
            currNode = currNode.ParentNode;
        }

        return currNode;
    }

Popular Answer

Definitely not by . (Note: this was originally a tag on the question—now removed.) I'm usually not one to jump on The Pony is Coming bandwagon, but this is one case in which regular expressions would be particularly bad.

First, I would write a recursive function that removes all siblings of a node that follow that node—call it RemoveSiblingsAfter(node)—and then calls itself on its parent, so that all siblings following the parent are removed as well (and all siblings following the grandparent, and so on). You can use an XPath to find the node(s) on which you want to split, e.g. doc.DocumentNode.SelectNodes("//a[@href='#']"), and call the function on that node. When done, you'd remove the splitting node itself, and that's it. You'd repeat these steps for a copy of the original document, except you'd implement RemoveSiblingsBefore(node) to remove siblings that precede a node.

In your example, RemoveSiblingsBefore would act as follows:

  1. <a href="#"> has no siblings, so recurse on parent, <li>.
  2. <li> has a preceding sibling—<li>Bullet 1</li>—so remove, and recurse on parent, <ul>.
  3. <ul> has no siblings, so recurse on parent, <p>.
  4. <p> has a preceding sibling—<p>Stuff</p>—so remove, and recurse on parent, <div>.
  5. and so on.



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why