HtmlAgilityPack giving problems with malformed html

c# html-agility-pack

Question

I want to extract meaningful text out of an html document and I was using html-agility-pack for the same. Here is my code:

string convertedContent = HttpUtility.HtmlDecode(
    ConvertHtml(HtmlAgilityPack.HtmlEntity.DeEntitize(htmlAsString))
);

ConvertHtml:

public string ConvertHtml(string html)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    sw.Flush();
    return sw.ToString();
}

ConvertTo:

public void ConvertTo(HtmlAgilityPack.HtmlNode node, TextWriter outText)
{
    string html;
    switch (node.NodeType)
    {
        case HtmlAgilityPack.HtmlNodeType.Comment:
            // don't output comments
            break;

        case HtmlAgilityPack.HtmlNodeType.Document:
            foreach (HtmlNode subnode in node.ChildNodes)
            {
              ConvertTo(subnode, outText);
            }
            break;

        case HtmlAgilityPack.HtmlNodeType.Text:
            // script and style must not be output
            string parentName = node.ParentNode.Name;
            if ((parentName == "script") || (parentName == "style"))
                break;

            // get text
            html = ((HtmlTextNode)node).Text;

            // is it in fact a special closing node output as text?
            if (HtmlNode.IsOverlappedClosingElement(html))
                break;

            // check the text is meaningful and not a bunch of whitespaces
            if (html.Trim().Length > 0)
            {
                outText.Write(HtmlEntity.DeEntitize(html) + " ");
            }
            break;

        case HtmlAgilityPack.HtmlNodeType.Element:
            switch (node.Name)
            {
                case "p":
                    // treat paragraphs as crlf
                    outText.Write("\r\n");
                    break;
            }

            if (node.HasChildNodes)
            {
            foreach (HtmlNode subnode in node.ChildNodes)
             {
              ConvertTo(subnode, outText);
             }
            }
            break;
    }
}

Now in some cases when the html pages are malformed (for example the following page - http://rareseeds.com/cart/products/Purple_of_Romagna_Artichoke-646-72.html has a malformed meta-tag like <meta content="text/html; charset=uft-8" http-equiv="Content-Type">) [Note "uft" instead of utf] my code is puking at the time I am trying to load the html document.

Can someone suggest me how can I overcome these malformed html pages and still extract relevant text out of a html document?

Thanks, Kapil

Accepted Answer

As it is said in the HtmlAgilityPack project page "The parser is very tolerant with 'real world' malformed HTML". But the kind of error you describe is too serious maybe to be corrected. You can set the default encoding with:

 HtmlDocument doc = new HtmlDocument();
 doc.OptionDefaultStreamEncoding = Encoding.UTF8;


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why