RAM intensive C# process getting slower after several hours

c# html-agility-pack memory-management multithreading performance

Question

I run a C# process (service) on a server responsible for parsing HTML pages continuously. It relies on HTMLAgilityPack. The symptom is that it becomes slower and slower as time goes by.

When I start the process, it handles n pages/s. After a a few hours, the speed goes down to around n/2 pages/s. It can go down to n/10 after a few days. The phenomenon has been observed many times and is rather deterministic. Anytime the process is restarted things are back to normal.

Very importantly: I can run other calculations in the same process and they are not slowed down: I can reach 100% CPU with anything I want at any time. The process itself is not slow. Only HTML parsing slows down.

I could reproduce it with minimal code (actually the behaviour in the original service is a bit more extreme but still this piece of code reproduces the behaviour):

public static void Main(string[] args) {
    string url = "https://en.wikipedia.org/wiki/History_of_Texas_A%26M_University";
    string html = new HtmlWeb().Load(url).DocumentNode.OuterHtml;
    while (true) {
        //Processing
        Stopwatch sw = new Stopwatch();
        sw.Start();
        Parallel.For(0, 10000, i => new HtmlDocument().LoadHtml(html));
        sw.Stop();
        //Logging
        using(var writer = File.AppendText("c:\\parsing.log")) {
            string text = DateTime.Now.ToString() + ";" + (int) sw.Elapsed.TotalSeconds;
            writer.WriteLine(text);
            Console.WriteLine(text);
        }
    }
}

With this minimal code, this displays the speed (pages per second) as a function of the numbers of hours elapsed since the process was started:

enter image description here

Every obvious causes have been ruled out:

  • the HTML pages are bigger or different (in the minimal code it's the same page)
  • full RAM: the process uses around 500 MB on 32 GB
  • other processes use CPU or RAM

It could be something about RAM and memory allocation. I know that HTMLAgilityPack makes a lot of small object memory allocation (HTML nodes and strings). It is clear memory allocation and multi-threading do not work well together. But I don't understand how the process can become slower and slower.

Do you know of anything about the CLR or Windows that could be causing some RAM intensive (many allocations) processing to become slower and slower? Like for example penalizing threads doing memory allocations in a certain way?

Accepted Answer

I have noticed similar behaviour using the HTMLAgilityPack.

I have found that when one yield's data it starts to space leak local varables on the compiler generated classes that start to cause problems. As no code is available ... bla bla, here's my First Aid kit

  1. Make sure you set the right strategy, changing the GC collection stategy in the app.config will help the fragmentation.
  2. Make sure you null things when you do not need them, as soon as you do not need them, do not wait for the scope to clean your memory as the IEnumerables get called in the calling method and scope of method vaiables and can live far longer than you think! Open your code in ILSpy and look at the <>d__0(0) genrated classes. You will see things generated like d__.X=X; in this case X could hold a fragment or a whole page.
  3. Your local variables are hoisted to the heap as they can't be accessed in the IEnumable iterations if they would not be there.
  4. Locking starts becoming an issue, large items are bleading in your 4th generation ram that are acually going to start blocking the GC. The GC is pausing your threads in able to perform garbage collection.
  5. The worst thing of HTMLAgility is that it fragments that ends up being a real issue

    I am quite sure that when you start to consider the scope of your HTML fragments you will find that things will start going well. Have a look at your execution using WinDbg in SOS and make a dump of your memory and have a look.

How to do that.

  1. open WinDebug, press F6 and attach to the process (enter the process ID in the field and press ok)
  2. then load the execution in your memeory by entering

    .loadby sos clr

  3. then enter

    !dumpheap -stat

You'd then get the memory items allocated in your application with the memory address and the size grouped by type and sorted from low header to high header you will see something like System.String[] with a masive number in front of it, thats the suff you would like to investigate first.

Now to see who has that you can type

!dumpheap -mt <heap address>

And you will see the addresses that are using that memory table (MT) and the size of ram it uses.

now it becomes interesting, rather than you going throu x100 lines of yode you type

!gcroot <address>

what it will print is the file and line of code that allocated the memory, the compiler generated class and the varaible causing you grief as well as the bytes it holds.

This is what one could call "production debuging" and works if you have access to the server, wich I guess you have.

Hope to have been of help,

Walter



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why