Work-around a StackOverflowException

html-agility-pack stack-overflow

Question

I'm parsing over 200,000 HTML pages using HtmlAgilityPack.

Although I can't foresee what will be in these papers, one of them will make my application fail with aStackOverflowException . This HTML is found in the document:

<ol>
    <li><li><li><li><li><li>...
</ol>

There are around ten thousand.<li> certain components were nested. HtmlAgilityPack's method of parsing HTML results in aStackOverflowException .

Unfortunately, StackOverflowExceptions in.NET 2.0 and later cannot be caught.

I considered increasing the stack size for the thread, but doing so would make my application use a lot more memory (around 50 threads are started for HTML processing, so all of them would have the higher stack size), and it would need manual adjustment if the same circumstance arose again.

Are there any further workarounds I may use?

1
5
10/1/2012 12:22:36 AM

Accepted Answer

The ideal long-term fix would be to modify HTML Agility Pack to utilize a heap-stack rather than a call-stack, but I lack the resources to take on such a large project. My CodePlex account information has been momentarily lost, but as soon as I get it restored, I'll file an issue report about the issue. A constructed overly-nested HTML page would force the w3wp.exe process to terminate, which may offer a Denial-of-Service attack vulnerability to any website using HtmlAgilityPack to sanitize user-submitted HTML.

In the meanwhile, I decided that manually overriding the maximum thread stack size was the best course of action. I misspoke previously when I said that a larger stack size meant that all threads would automatically use that RAM (it seems memory pages are allocated for a thread stack as it grows, not all-at-once).

I copied the text on the<ol><li> page and conducted several tests. My software was determined to be ineffective when the stack size was less than2^21 a maximum size of bytes, but no more than2^22 succeeded, which is 4MB and, in my opinion, qualifies as a "acceptable" hack... for the time being.

2
10/1/2012 1:00:28 AM

Popular Answer

I just fixed a bug that, in my opinion, corresponds to what you're describing. the patch was uploaded to the hap project website...

(See the patch from 3/8/2012) http://www.codeplex.com/site/users/view/sjdirect

You can also check out further proof of the problem and the outcome here.

https://code.google.com/p/abot/issues/detail?id=77

The real solution was... HtmlDocument.OptionMaxNestedChildNodes has been added, and setting it will prevent StackOverflowExceptions from being triggered by a large number of nested tags. It will issue a message-accompanied ApplicationException "More than X tags are nested in the document. This is probably because the page's ending tags were improper."

How I'm Making Use of Hap After Patch

HtmlDocument hapDoc = new HtmlDocument();
hapDoc.OptionMaxNestedChildNodes = 5000;//This is what was added
string rawContent = GETTHECONTENTHERE
try
{
    hapDoc.LoadHtml(RawContent);    
}
catch (Exception e)
{
    //Instead of a stackoverflow exception you should end up here now
    hapDoc.LoadHtml("");
    _logger.Error(e);
}


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow