Error parsing HTMl agility pack and returning XElement

.net-3.5 c# html-agility-pack html-parsing

Question

The text can be parsed, and I can produce an output, but the result cannot be converted into an XElement because of a p tag, even if the rest of the string has been appropriately processed.

My opinion

var input = "<p> Not sure why is is null for some wierd reason!<br><br>I have implemented the auto save feature, but does it really work after 100s?<br></p> <p> <i>Autosave?? </i> </p> <p>we are talking...</p><p></p><hr><p><br class=\"GENTICS_ephemera\"></p>";

My key:

public static XElement CleanupHtml(string input)
    {  


    HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

    htmlDoc.OptionOutputAsXml = true;
    //htmlDoc.OptionWriteEmptyNodes = true;             
    //htmlDoc.OptionAutoCloseOnEnd = true;
    htmlDoc.OptionFixNestedTags = true;

    htmlDoc.LoadHtml(input);

    // ParseErrors is an ArrayList containing any errors from the Load statement
    if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
    {

    }
    else
    {

        if (htmlDoc.DocumentNode != null)
        {
            var ndoc = new HtmlDocument(); // HTML doc instance
            HtmlNode p = ndoc.CreateElement("body");  

            p.InnerHtml = htmlDoc.DocumentNode.InnerHtml;
            var result = p.OuterHtml.Replace("<br>", "<br/>");
            result = result.Replace("<br class=\"special_class\">", "<br/>");
            result = result.Replace("<hr>", "<hr/>");
            return XElement.Parse(result, LoadOptions.PreserveWhitespace);
        }
    }
    return new XElement("body");

}

My product:

<body>
   <p> Not sure why is is null for some wierd reason chappy!
   <br/>
   <br/>I have implemented the auto save feature, but does it really work after 100s?
   <br/>
   </p> 
   <p> 
   <i>Autosave?? </i> 
   </p> 
   <p>we are talking...</p>
   **<p>**
   <hr/>
   <p>
   <br/>
   </p>
</body>

The p tag that did not output properly is the one in bold. Exists a method to get around this? Am I use the code incorrectly?

1
6
3/17/2011 4:32:16 PM

Accepted Answer

In essence, what you're attempting to accomplish is convert an HTML input into an XML output.

If you utilize the Html Agility Pack, it can do that.OptionOutputAsXml instead of using the InnerHtml property in this situation, let the HTML Agility Pack do the preparation work by using one of HtmlDocument'sSave methods.

Here is a general method to change a piece of HTML text into an instance of an XElement:

public static XElement HtmlToXElement(string html)
{
    if (html == null)
        throw new ArgumentNullException("html");

    HtmlDocument doc = new HtmlDocument();
    doc.OptionOutputAsXml = true;
    doc.LoadHtml(html);
    using (StringWriter writer = new StringWriter())
    {
        doc.Save(writer);
        using (StringReader reader = new StringReader(writer.ToString()))
        {
            return XElement.Load(reader);
        }
    }
}

You can see that you don't need to work alone too much. Please be aware that the HTML Agility Pack will automatically create an enclosing root element since your original input text lacks one.SPAN to guarantee that the XML output is valid.

You want to further process certain tags in your situation, so here's how to accomplish that using your example:

    public static XElement CleanupHtml(string input)
    {
        if (input == null)
            throw new ArgumentNullException("input");

        HtmlDocument doc = new HtmlDocument();
        doc.OptionOutputAsXml = true;
        doc.LoadHtml(input);

        // extra processing, remove some attributes using DOM
        HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//br[@class='special_class']");
        if (coll != null)
        {
            foreach (HtmlNode node in coll)
            {
                node.Attributes.Remove("class");
            }
        }

        using (StringWriter writer = new StringWriter())
        {
            doc.Save(writer);
            using (StringReader reader = new StringReader(writer.ToString()))
            {
                return XElement.Load(reader);
            }
        }
    }

As you can see, you should utilize the DOM methods from the HTML Agility Pack rather than the raw string function (SelectNodes, Add, Remove, etc...).

9
3/18/2011 8:04:23 AM

Popular Answer

When reviewing the documentation comments, look forOptionFixNestedTags You'll notice what follows:

//     Defines if LI, TR, TH, TD tags must be partially fixed when nesting errors
//     are detected. Default is false.

Therefore, I doubt that it will assist you with unclosed HTML.p tags. According to a previous SO question, the correct answer is C# library for HTML cleanup, while HTML Clean could also work.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow