How to prevent Stackoverflow Exception in HtmlAgilityPack for very bad html

c# html-agility-pack stack-overflow

Question

I am using HtmlAgilityPack in an MVC 5 Web Api. 99.99% of the time, there are no problems... sites load and I parse them to extract the text I want. My API could be hit several hundred thousand times a day without issues. It has happily handled over 2 million hits in 24 hours in the past...

Occasionally however, terribly formed websites cause an error 500 response. Then all subsequent requests get 500 errors and the site becomes completely unusable. The only solution in this scenario is to restart the web application. The site is hosted on Windows Azure. I have used load balanced Large instances and once CPU spikes it stays high. In the past, this has run fine on a single Medium Azure instance (2 core/3.5 GB RAM)

The error is a Stackoverflow... which I know I cannot catch.

Note, that this code does NOT crash a Console application

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://nursingandmidwiferycareersni.com/");            
Console.Write(doc.DocumentNode.InnerText);

...but it definitely will crash an MVC web app.

However, in a simple MVC web application, I can reproduce the stackoverflow error with a site such as http://nursingandmidwiferycareersni.com/. If you put http://nursingandmidwiferycareersni.com/ into https://validator.w3.org You will manage to get an Internal Server Error on validator.w3.org!

I will make a hack to HAP source code if necessary to get around this... at present I am just using the Nuget package.

Is it possible to prevent the stackoverflow happening in HAP?
Or is there a way of checking for awful html and preventing the crash from occurring in the first place?

Popular Answer

Give something like this a try, where the ParseHtml method and the ParsedHtml type are just placeholders for you to fill in:

public async Task<ParsedHtml> TryParseHtml(
    string untrustedHtml,
    CancellationToken cancellationToken)
{
    var tcs = new TaskCompletionSource<ParsedHtml>();

    var thread = new Thread(() =>
    {
        ParsedHtml result = ParseHtml(untrustedHtml);
        tcs.TrySetResult(result);
    });
    thread.Start();

    using (cancellationToken.Register(() => tcs.TrySetCanceled()))
    {
        try
        {
            return await tcs.Task;
        }
        catch (OperationCanceledException)
        {
            thread.Abort();
            throw;
        }
    }
}

The idea could be extended to be more efficient by reusing threads in the successful case, rather than firing up and tearing down a thread for every HTML page.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why