HTMl agility pack error parsing and returning XElement

.net-3.5 c# html-agility-pack html-parsing

Question

I can parse the document and generate an output however the output cannot be parsed into an XElement because of a p tag, everything else within the string is parsed correctly.

My input:

var input = "<p> Not sure why is is null for some wierd reason!<br><br>I have implemented the auto save feature, but does it really work after 100s?<br></p> <p> <i>Autosave?? </i> </p> <p>we are talking...</p><p></p><hr><p><br class=\"GENTICS_ephemera\"></p>";

My code:

public static XElement CleanupHtml(string input)
    {  


    HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

    htmlDoc.OptionOutputAsXml = true;
    //htmlDoc.OptionWriteEmptyNodes = true;             
    //htmlDoc.OptionAutoCloseOnEnd = true;
    htmlDoc.OptionFixNestedTags = true;

    htmlDoc.LoadHtml(input);

    // ParseErrors is an ArrayList containing any errors from the Load statement
    if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
    {

    }
    else
    {

        if (htmlDoc.DocumentNode != null)
        {
            var ndoc = new HtmlDocument(); // HTML doc instance
            HtmlNode p = ndoc.CreateElement("body");  

            p.InnerHtml = htmlDoc.DocumentNode.InnerHtml;
            var result = p.OuterHtml.Replace("<br>", "<br/>");
            result = result.Replace("<br class=\"special_class\">", "<br/>");
            result = result.Replace("<hr>", "<hr/>");
            return XElement.Parse(result, LoadOptions.PreserveWhitespace);
        }
    }
    return new XElement("body");

}

My output:

<body>
   <p> Not sure why is is null for some wierd reason chappy!
   <br/>
   <br/>I have implemented the auto save feature, but does it really work after 100s?
   <br/>
   </p> 
   <p> 
   <i>Autosave?? </i> 
   </p> 
   <p>we are talking...</p>
   **<p>**
   <hr/>
   <p>
   <br/>
   </p>
</body>

The bold p tag is the one that did not output correctly... Is there a way around this? Am I doing something wrong with the code?

Accepted Answer

What you are trying to do is basically transform an Html input into an Xml output.

Html Agility Pack can do that when you use the OptionOutputAsXml option, but in this case, you should not use the InnerHtml property, and instead let the Html Agility Pack do the ground work for you, with one of HtmlDocument's Save methods.

Here is a generic function to convert an Html text to an XElement instance:

public static XElement HtmlToXElement(string html)
{
    if (html == null)
        throw new ArgumentNullException("html");

    HtmlDocument doc = new HtmlDocument();
    doc.OptionOutputAsXml = true;
    doc.LoadHtml(html);
    using (StringWriter writer = new StringWriter())
    {
        doc.Save(writer);
        using (StringReader reader = new StringReader(writer.ToString()))
        {
            return XElement.Load(reader);
        }
    }
}

As you see, you don't have to do much work by yourself! Please note that since your original input text has no root element, the Html Agility Pack will automatically add one enclosing SPAN to ensure the output is valid Xml.

In your case, you want to additionnally process some tags, so, here is how to do with your exemple:

    public static XElement CleanupHtml(string input)
    {
        if (input == null)
            throw new ArgumentNullException("input");

        HtmlDocument doc = new HtmlDocument();
        doc.OptionOutputAsXml = true;
        doc.LoadHtml(input);

        // extra processing, remove some attributes using DOM
        HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//br[@class='special_class']");
        if (coll != null)
        {
            foreach (HtmlNode node in coll)
            {
                node.Attributes.Remove("class");
            }
        }

        using (StringWriter writer = new StringWriter())
        {
            doc.Save(writer);
            using (StringReader reader = new StringReader(writer.ToString()))
            {
                return XElement.Load(reader);
            }
        }
    }

As you see, you should not use raw string function, but instead use the Html Agility Pack DOM functions (SelectNodes, Add, Remove, etc...).


Popular Answer

If you check the documentation comments for OptionFixNestedTags you will see the following:

//     Defines if LI, TR, TH, TD tags must be partially fixed when nesting errors
//     are detected. Default is false.

So I don't think this will help you with unclosed HTML p tags. According to an old SO question C# library to clean up html though HTML Tidy might work for this purpose.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why