How to fix ill-formed HTML with HTML Agility Pack?

.net c# html html-agility-pack parsing

Question

This HTML is improperly formatted and has overlapping tags.

<p>word1<b>word2</p>
<p>word3</b>word4</p>

Furthermore, the overlapping may be layered.

How can I use HTML Agility Pack (HAP) to transform it into properly formatted HTML?

I'm trying to find this result:

<p>word1<b>word2</b></p>
<p><b>word3</b>word4</p>

A tryHtmlNode.ElementsFlags["b"] = HtmlElementFlag.Closed | HtmlElementFlag.CanOverlap however it does not function as anticipated.

1
10
3/28/2014 3:20:00 PM

Accepted Answer

In fact, it is functioning as anticipated, but maybe not quite as anticipated. Anyway, here is an example piece of code (a Console program) showing how to use the library to modify some HTML.

The library features aParseErrors collection that you may use to find out what mistakes markup parsing found.

There are really two different kinds of issues here:

1) 24-24-24. The library fixes this one by default, however there is a setting on the P element that makes it impossible in this situation.

2) 32-32-32. This one is more difficult since how you want to change it and where you want the tag opened rely on each other. The element in the following example is opened by the closest prior text sibling node.

static void Main(string[] args)
{
    // clear the flags on P so unclosed elements in P will be auto closed.
    HtmlNode.ElementsFlags.Remove("p");

    // load the document
    HtmlDocument doc = new HtmlDocument();
    doc.Load("yourTestFile.htm");

    // build a list of nodes ordered by stream position
    NodePositions pos = new NodePositions(doc);

    // browse all tags detected as not opened
    foreach (HtmlParseError error in doc.ParseErrors.Where(e => e.Code == HtmlParseErrorCode.TagNotOpened))
    {
        // find the text node just before this error
        HtmlTextNode last = pos.Nodes.OfType<HtmlTextNode>().LastOrDefault(n => n.StreamPosition < error.StreamPosition);
        if (last != null)
        {
            // fix the text; reintroduce the broken tag
            last.Text = error.SourceText.Replace("/", "") + last.Text + error.SourceText;
        }
    }

    doc.Save(Console.Out);
}

public class NodePositions
{
    public NodePositions(HtmlDocument doc)
    {
        AddNode(doc.DocumentNode);
        Nodes.Sort(new NodePositionComparer());
    }

    private void AddNode(HtmlNode node)
    {
        Nodes.Add(node);
        foreach (HtmlNode child in node.ChildNodes)
        {
            AddNode(child);
        }
    }

    private class NodePositionComparer : IComparer<HtmlNode>
    {
        public int Compare(HtmlNode x, HtmlNode y)
        {
            return x.StreamPosition.CompareTo(y.StreamPosition);
        }
    }

    public List<HtmlNode> Nodes = new List<HtmlNode>();
}
24
3/31/2014 10:21:17 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow