HtmlAgilityPack is having issues with html that is malformed.

c# html-agility-pack

Question

I was using html-agility-pack to extract relevant content from an HTML page. This is my code:

string convertedContent = HttpUtility.HtmlDecode(
    ConvertHtml(HtmlAgilityPack.HtmlEntity.DeEntitize(htmlAsString))
);

ConvertHtml:

public string ConvertHtml(string html)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    sw.Flush();
    return sw.ToString();
}

ConvertTo:

public void ConvertTo(HtmlAgilityPack.HtmlNode node, TextWriter outText)
{
    string html;
    switch (node.NodeType)
    {
        case HtmlAgilityPack.HtmlNodeType.Comment:
            // don't output comments
            break;

        case HtmlAgilityPack.HtmlNodeType.Document:
            foreach (HtmlNode subnode in node.ChildNodes)
            {
              ConvertTo(subnode, outText);
            }
            break;

        case HtmlAgilityPack.HtmlNodeType.Text:
            // script and style must not be output
            string parentName = node.ParentNode.Name;
            if ((parentName == "script") || (parentName == "style"))
                break;

            // get text
            html = ((HtmlTextNode)node).Text;

            // is it in fact a special closing node output as text?
            if (HtmlNode.IsOverlappedClosingElement(html))
                break;

            // check the text is meaningful and not a bunch of whitespaces
            if (html.Trim().Length > 0)
            {
                outText.Write(HtmlEntity.DeEntitize(html) + " ");
            }
            break;

        case HtmlAgilityPack.HtmlNodeType.Element:
            switch (node.Name)
            {
                case "p":
                    // treat paragraphs as crlf
                    outText.Write("\r\n");
                    break;
            }

            if (node.HasChildNodes)
            {
            foreach (HtmlNode subnode in node.ChildNodes)
             {
              ConvertTo(subnode, outText);
             }
            }
            break;
    }
}

Now, occasionally, when html pages are improperly formatted (for instance, the following page, http://rareseeds.com/cart/products/Purple_of_Romagna_Artichoke-646-72.html, includes an improper meta-tag like<meta content="text/html; charset=uft-8" http-equiv="Content-Type"> When I attempt to load the html content, my code is throwing up (note the use of "uft" rather than "utf").

Can someone give me any advice on how to get around these ill-formed HTML pages and still extract useful content from them?

Thanks, Kapil

1
3
2/24/2012 9:03:30 AM

Accepted Answer

According to the project website for the HTML Agility Pack, "The parser is quite patient with'real world' incorrect HTML." However, the sort of problem you describe may be too significant to be fixed. Setting the default encoding may be done using:

 HtmlDocument doc = new HtmlDocument();
 doc.OptionDefaultStreamEncoding = Encoding.UTF8;
3
5/31/2010 3:03:55 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow