Best way to combine nodes with Html Agility Pack

c# html-agility-pack

Question

I've converted a large document from Word to HTML. It's close, but I have a bunch of "code" nodes that I'd like to merge into one "pre" node.

Here's the input:

<p>Here's a sample MVC Controller action:</p>
<code>        public ActionResult Index()</code>
<code>        {</code>
<code>            return View();</code>
<code>        }</code>
<p>We'll start by making the following changes...</p>

I want to turn it into this, instead:

<p>Here's a sample MVC Controller action:</p>
<pre class="brush: csharp">        public ActionResult Index()
    {
        return View();
    }</pre>
<p>We'll start by making the following changes...</p>

I ended up writing a brute-force loop that iterates nodes looking for consecutive ones, but this seems ugly to me:

HtmlDocument doc = new HtmlDocument();
doc.Load(file);

var nodes = doc.DocumentNode.ChildNodes;
string contents = string.Empty;

foreach (HtmlNode node in nodes)
{

    if (node.Name == "code")
    {
        contents += node.InnerText + Environment.NewLine;
        if (node.NextSibling.Name != "code" && 
            !(node.NextSibling.Name == "#text" && node.NextSibling.NextSibling.Name == "code")
            )
        {
            node.Name = "pre";
            node.Attributes.RemoveAll();
            node.SetAttributeValue("class", "brush: csharp");
            node.InnerHtml = contents;
            contents = string.Empty;
        }
    }
}

nodes = doc.DocumentNode.SelectNodes(@"//code");
foreach (var node in nodes)
{
    node.Remove();
}

Normally I'd remove the nodes in the first loop, but that doesn't work during iteration since you can't change the collection as you iterate over it.

Better ideas?

Popular Answer

The first approach: select all the <code> nodes, group them, and create a <pre> node per group:

var idx = 0;
var nodes = doc.DocumentNode
    .SelectNodes("//code")
    .GroupBy(n => new { 
        Parent = n.ParentNode, 
        Index = n.NextSiblingIsCode() ? idx : idx++ 
    });

foreach (var group in nodes)
{
    var pre = HtmlNode.CreateNode("<pre class='brush: csharp'></pre>");
    pre.AppendChild(doc.CreateTextNode(
        string.Join(Environment.NewLine, group.Select(g => g.InnerText))
    ));
    group.Key.Parent.InsertBefore(pre, group.First());

    foreach (var code in group)
        code.Remove();
}

The grouping field here is combined field of a parent node and group index which is increased when new group is found. Also I used NextSiblingIsCode extension method here:

public static bool NextSiblingIsCode(this HtmlNode node)
{
    return (node.NextSibling != null && node.NextSibling.Name == "code") ||
        (node.NextSibling is HtmlTextNode && 
         node.NextSibling.NextSibling != null && 
         node.NextSibling.NextSibling.Name == "code");
}

It used to determine whether the next sibling is a <code> node.


The second approach: select only the top <code> node of each group, then iterate through each of these nodes to find the next <code> node until the first non-<code> node. I used xpath here:

var nodes = doc.DocumentNode.SelectNodes(
    "//code[name(preceding-sibling::*[1])!='code']"
);
foreach (var node in nodes)
{
    var pre = HtmlNode.CreateNode("<pre class='brush: csharp'></pre>");
    node.ParentNode.InsertBefore(pre, node);
    var content = string.Empty;
    var next = node;
    do
    {
        content += next.InnerText + Environment.NewLine;
        var previous = next;
        next = next.SelectSingleNode("following-sibling::*[1][name()='code']");
        previous.Remove();
    } while (next != null);
    pre.AppendChild(doc.CreateTextNode(
        content.TrimEnd(Environment.NewLine.ToCharArray())
    ));
}


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why