Split an HTML string into N sections

c# html-agility-pack htmltidy regex

Question

Does anybody have an example of breaking an HTML text into N pieces using C#? The string is coming from a little mce editor.

I have to equally divide the string without dividing the words.

To attempt to solve the broken tags, I was considering just dividing the HTML and utilizing the HtmlAgilityPack. Although, ideally, the split point should be based just on the text and not the html as well, I'm not sure how to detect it.

Anyone have suggestions about how to approach this?

UPDATE

Here is an example of input and expected output in response to your request.

INPUT:

<p><strong>Lorem ipsum dolor sit amet, <em>consectetur adipiscing</em></strong> elit.</p>

OUTPUT (When divided into three columns):

Part1: <p><strong>Lorem ipsum dolor</strong></p>
Part2: <p><strong>sit amet, <em>consectetur</em></strong></p>
Part3: <p><strong><em>adipiscing</em></strong> elit.</p>

2ND UPDATE

If I can find a method to identify the split pints, this would be a nice alternative as I've recently experimented with Tidy HTML and it appears to be effective at resolving broken tags.

ADVANCE 3

I've now been able to acquire a list of plain text words that will make up each section using a technique similar to this In.NET C#, truncate strings on whole words.. So, assuming I have an acceptable XML structure for the HTML using Tidy HTML, does anybody have any suggestions on how to separate this list of words?

ADVANCE 4

Can anybody see a problem with using a regex to get the HTML indices in the manner shown below:

Replace any spaces in the plain text string "sit amet, consectetur" with the regex "(s|(.|n)+?>)*," which should theoretically locate that string with any mix of spaces and/or tags.

The broken HTML tags could then be fixed using Tidy HTML, right?

Thank you

Matt

1
6
5/23/2017 12:00:17 PM

Accepted Answer

An Idea for a Solution

This is one of my curse, man! Evidently, I must work on an issue for at least unreasonable hours before I can leave it alone.

I gave it some thinking. HTML Tidy crossed my mind; maybe it would work, but I was having a hard time seeing it.

So I created my own answer.

I tried this with the input your as well as some additional input I put together on my own. It seems to function rather well. It undoubtedly has flaws, but at least it gives you a place to start.

Anyway, here was my strategy:

  1. Using a class that contains details about the word's placement in the HTML document hierarchy up to a certain "top," you may represent the idea of a single word in an HTML page. This is what I've done in theHtmlWord class underneath.
  2. Make a class that can write these HTML terms together in a single line while adding the start-element and end-element tags where they belong. This is what I've done in theHtmlLine class underneath.
  3. Create a few extension methods to quickly and easily make these classes available from anHtmlAgilityPack.HtmlNode object. These are what I've done in theHtmlHelper class underneath.

Is everything I'm doing crazy? Most likely, yeah. But, you know, you may try this if you can't think of any other ways.

The process is as follows using your example input:

var document = new HtmlDocument();
document.LoadHtml("<p><strong>Lorem ipsum dolor sit amet, <em>consectetur adipiscing</em></strong> elit.</p>");

var nodeToSplit = document.DocumentNode.SelectSingleNode("p");
var lines = nodeToSplit.SplitIntoLines(3);

foreach (var line in lines)
    Console.WriteLine(line.ToString());

Output:

<p><strong>Lorem ipsum dolor </strong></p>
<p><strong>sit amet, <em>consectetur </em></strong></p>
<p><strong><em>adipiscing </em></strong>elit. </p>

Here is the code:

Word class HTML

using System;
using System.Collections.Generic;
using System.Linq;

using HtmlAgilityPack;

public class HtmlWord {
    public string Text { get; private set; }
    public HtmlNode[] NodeStack { get; private set; }

    // convenience property to display list of ancestors cleanly
    // (for ease of debugging)
    public string NodeList {
        get { return string.Join(", ", NodeStack.Select(n => n.Name).ToArray()); }
    }

    internal HtmlWord(string text, HtmlNode node, HtmlNode top) {
        Text = text;
        NodeStack = GetNodeStack(node, top);
    }

    private static HtmlNode[] GetNodeStack(HtmlNode node, HtmlNode top) {
        var nodes = new Stack<HtmlNode>();

        while (node != null && !node.Equals(top)) {
            nodes.Push(node);
            node = node.ParentNode;
        };

        return nodes.ToArray();
    }
}

Class HtmlLine

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;

using HtmlAgilityPack;

[Flags()]
public enum NodeChange {
    None = 0,
    Dropped = 1,
    Added = 2
}

public class HtmlLine {
    private List<HtmlWord> _words;
    public IList<HtmlWord> Words {
        get { return _words.AsReadOnly(); }
    }

    public int WordCount {
        get { return _words.Count; }
    }

    public HtmlLine(IEnumerable<HtmlWord> words) {
        _words = new List<HtmlWord>(words);
    }

    private static NodeChange CompareNodeStacks(HtmlWord x, HtmlWord y, out HtmlNode[] droppedNodes, out HtmlNode[] addedNodes) {
        var droppedList = new List<HtmlNode>();
        var addedList = new List<HtmlNode>();

        // traverse x's NodeStack backwards to see which nodes
        // do not include y (and are therefore "finished")
        foreach (var node in x.NodeStack.Reverse()) {
            if (!Array.Exists(y.NodeStack, n => n.Equals(node)))
                droppedList.Add(node);
        }

        // traverse y's NodeStack forwards to see which nodes
        // do not include x (and are therefore "new")
        foreach (var node in y.NodeStack) {
            if (!Array.Exists(x.NodeStack, n => n.Equals(node)))
                addedList.Add(node);
        }

        droppedNodes = droppedList.ToArray();
        addedNodes = addedList.ToArray();

        NodeChange change = NodeChange.None;
        if (droppedNodes.Length > 0)
            change &= NodeChange.Dropped;
        if (addedNodes.Length > 0)
            change &= NodeChange.Added;

        // could maybe use this in some later revision?
        // not worth the effort right now...
        return change;
    }

    public override string ToString() {
        if (WordCount < 1)
            return string.Empty;

        var lineBuilder = new StringBuilder();

        using (var lineWriter = new StringWriter(lineBuilder))
        using (var xmlWriter = new XmlTextWriter(lineWriter)) {
            var firstWord = _words[0];
            foreach (var node in firstWord.NodeStack) {
                xmlWriter.WriteStartElement(node.Name);
                foreach (var attr in node.Attributes)
                    xmlWriter.WriteAttributeString(attr.Name, attr.Value);
            }
            xmlWriter.WriteString(firstWord.Text + " ");

            for (int i = 1; i < WordCount; ++i) {
                var previousWord = _words[i - 1];
                var word = _words[i];

                HtmlNode[] droppedNodes;
                HtmlNode[] addedNodes;

                CompareNodeStacks(
                    previousWord,
                    word,
                    out droppedNodes,
                    out addedNodes
                );

                foreach (var dropped in droppedNodes)
                    xmlWriter.WriteEndElement();
                foreach (var added in addedNodes) {
                    xmlWriter.WriteStartElement(added.Name);
                    foreach (var attr in added.Attributes)
                        xmlWriter.WriteAttributeString(attr.Name, attr.Value);
                }

                xmlWriter.WriteString(word.Text + " ");

                if (i == _words.Count - 1) {
                    foreach (var node in word.NodeStack)
                        xmlWriter.WriteEndElement();
                }
            }
        }

        return lineBuilder.ToString();
    }
}

the static class HtmlHelper

using System;
using System.Collections.Generic;
using System.Linq;

using HtmlAgilityPack;

public static class HtmlHelper {
    public static IList<HtmlLine> SplitIntoLines(this HtmlNode node, int wordsPerLine) {
        var lines = new List<HtmlLine>();

        var words = node.GetWords(node.ParentNode);

        for (int i = 0; i < words.Count; i += wordsPerLine) {
            lines.Add(new HtmlLine(words.Skip(i).Take(wordsPerLine)));
        }

        return lines.AsReadOnly();
    }

    public static IList<HtmlWord> GetWords(this HtmlNode node, HtmlNode top) {
        var words = new List<HtmlWord>();

        if (node.HasChildNodes) {
            foreach (var child in node.ChildNodes)
                words.AddRange(child.GetWords(top));
        } else {
            var textNode = node as HtmlTextNode;
            if (textNode != null && !string.IsNullOrEmpty(textNode.Text)) {
                string[] singleWords = textNode.Text.Split(
                    new string[] {" "},
                    StringSplitOptions.RemoveEmptyEntries
                );
                words.AddRange(
                    singleWords
                        .Select(w => new HtmlWord(w, node.ParentNode, top)
                    )
                );
            }
        }

        return words.AsReadOnly();
    }
}

Conclusion

Just to be clear, this is a last-minute solution and I'm sure it has issues. If you can't attain the desired behavior by other ways, I'm merely offering it as a starting point for you to think about.

17
5/1/2010 6:50:23 PM

Popular Answer

This idea is only a workaround; perhaps there is a better approach.

Basically, you want to divide a large block of HTML-formatted text into smaller chunks while keeping the original font and other formatting. I believe it is possible to import the original HTML into an RTF control or a Word object, divide it there into parts that maintain the formatting, and then output the individual HTML.

If HtmlAgilityPack offers a straightforward method of extracting text with formatting information from the source HTML, there may be a way to use it similarly.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow