How to prevent Stackoverflow Exception in HtmlAgilityPack for very bad html

c# html-agility-pack stack-overflow

Question

In an MVC 5 Web Api, I'm using HTML Agility Pack. In 99.99% of cases, websites load without a hitch, and I can parse them to get the desired content. Without any problems, my API may be called several hundred thousand times each day. It has successfully handled more than 2 million hits in a single day in the past.

However, sometimes, poorly designed websites result in an error 500 response. Following that, the site becomes fully inaccessible and all future queries result in 500 errors. The web application must be restarted as the sole option in this situation. Windows Azure is used to host the website. I've tried large instances with load balancing, and once the CPU climbs, it remains high. This has previously worked without a problem on a single Medium Azure machine (2 core/3.5 GB RAM).

Stackoverflow is the error, which I am aware I cannot capture.

Keep in mind that this code DOES NOT cause a Console program to crash.

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://nursingandmidwiferycareersni.com/");            
Console.Write(doc.DocumentNode.InnerText);

But it will undoubtedly cause an MVC web project to fail.

However, using a site like http://nursingandmidwiferycareersni.com/, I was able to replicate the stackoverflow problem in a simple MVC web application. You will manage to get an Internal Server Error on validator.w3.org if you enter http://nursingandmidwiferycareersni.com/ into https://validator.w3.org!

For now, I am just using the Nuget package, but if I need to get past this, I will modify the HAP source code.

Is it feasible to stop a HAP stackoverflow from occurring?
Alternatively, is there a method to check for bad HTML and stop the crash before it starts?

1
1
5/27/2015 4:40:21 PM

Popular Answer

Try something similar to this, where theParseHtml technique and theParsedHtml just serve as blanks for you to fill in:

public async Task<ParsedHtml> TryParseHtml(
    string untrustedHtml,
    CancellationToken cancellationToken)
{
    var tcs = new TaskCompletionSource<ParsedHtml>();

    var thread = new Thread(() =>
    {
        ParsedHtml result = ParseHtml(untrustedHtml);
        tcs.TrySetResult(result);
    });
    thread.Start();

    using (cancellationToken.Register(() => tcs.TrySetCanceled()))
    {
        try
        {
            return await tcs.Task;
        }
        catch (OperationCanceledException)
        {
            thread.Abort();
            throw;
        }
    }
}

By reusing threads in the successful scenario rather than starting and stopping a thread for every HTML page, the concept might be expanded to be more effective.

1
5/27/2015 5:52:53 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow