RAM intensive C# process getting slower after several hours

c# html-agility-pack memory-management multithreading performance


I run a C# process (service) on a server responsible for parsing HTML pages continuously. It relies on HTMLAgilityPack. The symptom is that it becomes slower and slower as time goes by.

When I start the process, it handles n pages/s. After a a few hours, the speed goes down to around n/2 pages/s. It can go down to n/10 after a few days. The phenomenon has been observed many times and is rather deterministic. Anytime the process is restarted things are back to normal.

Very importantly: I can run other calculations in the same process and they are not slowed down: I can reach 100% CPU with anything I want at any time. The process itself is not slow. Only HTML parsing slows down.

I could reproduce it with minimal code (actually the behaviour in the original service is a bit more extreme but still this piece of code reproduces the behaviour):

public static void Main(string[] args) {
    string url = "https://en.wikipedia.org/wiki/History_of_Texas_A%26M_University";
    string html = new HtmlWeb().Load(url).DocumentNode.OuterHtml;
    while (true) {
        Stopwatch sw = new Stopwatch();
        Parallel.For(0, 10000, i => new HtmlDocument().LoadHtml(html));
        using(var writer = File.AppendText("c:\\parsing.log")) {
            string text = DateTime.Now.ToString() + ";" + (int) sw.Elapsed.TotalSeconds;

With this minimal code, this displays the speed (pages per second) as a function of the numbers of hours elapsed since the process was started:

enter image description here

Every obvious causes have been ruled out:

  • the HTML pages are bigger or different (in the minimal code it's the same page)
  • full RAM: the process uses around 500 MB on 32 GB
  • other processes use CPU or RAM

It could be something about RAM and memory allocation. I know that HTMLAgilityPack makes a lot of small object memory allocation (HTML nodes and strings). It is clear memory allocation and multi-threading do not work well together. But I don't understand how the process can become slower and slower.

Do you know of anything about the CLR or Windows that could be causing some RAM intensive (many allocations) processing to become slower and slower? Like for example penalizing threads doing memory allocations in a certain way?

5/15/2018 5:42:17 PM

Accepted Answer

I have noticed similar behaviour using the HTMLAgilityPack.

I have found that when one yield's data it starts to space leak local variables on the compiler generated classes that start to cause problems. As no code is available, here's my First Aid kit.

  1. Make sure you set the right strategy, changing the GC collection strategy in the app.config will help the fragmentation.
  2. Make sure you null things when you do not need them, as soon as you do not need them, do not wait for the scope to clean your memory as the IEnumerables get called in the calling method and scope of method variables and can live far longer than you think! Open your code in ILSpy and look at the <>d__0(0) generated classes. You will see things generated like d__.X=X; in this case X could hold a fragment or a whole page.
  3. Your local variables are hoisted to the heap as they can't be accessed in the IEnumable iterations if they would not be there.
  4. Locking starts becoming an issue, large items are bleading in your 4th generation ram that are actually going to start blocking the GC. The GC is pausing your threads in able to perform garbage collection.
  5. The worst thing of HTMLAgility is that it fragments that ends up being a real issue

    I am quite sure that when you start to consider the scope of your HTML fragments you will find that things will start going well. Have a look at your execution using WinDbg in SOS and make a dump of your memory and have a look.

How to do that.

  1. open WinDebug, press F6 and attach to the process (enter the process ID in the field and press ok)
  2. then load the execution in your memory by entering

    .loadby sos clr
  3. then enter

    !dumpheap -stat

You'd then get the memory items allocated in your application with the memory address and the size grouped by type and sorted from low header to high header you will see something like System.String[] with a massive number in front of it, that's the stuff you would like to investigate first.

Now to see who has that you can type

!dumpheap -mt <heap address>

And you will see the addresses that are using that memory table (MT) and the size of ram it uses.

Now it becomes interesting, rather than you going through x100 lines of code you type

!gcroot <address>

what it will print is the file and line of code that allocated the memory, the compiler generated class and the variable causing you grief as well as the bytes it holds.

This is what one could call "production debugging" and works if you have access to the server, which I guess you have.

8/10/2019 10:21:58 PM

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow