I run a C# process (service) on a server responsible for parsing HTML pages continuously. It relies on HTMLAgilityPack. The symptom is that it becomes slower and slower as time goes by.
When I start the process, it handles n pages/s. After a a few hours, the speed goes down to around n/2 pages/s. It can go down to n/10 after a few days. The phenomenon has been observed many times and is rather deterministic. Anytime the process is restarted things are back to normal.
Very importantly: I can run other calculations in the same process and they are not slowed down: I can reach 100% CPU with anything I want at any time. The process itself is not slow. Only HTML parsing slows down.
I could reproduce it with minimal code (actually the behaviour in the original service is a bit more extreme but still this piece of code reproduces the behaviour):
public static void Main(string[] args) {
string url = "https://en.wikipedia.org/wiki/History_of_Texas_A%26M_University";
string html = new HtmlWeb().Load(url).DocumentNode.OuterHtml;
while (true) {
//Processing
Stopwatch sw = new Stopwatch();
sw.Start();
Parallel.For(0, 10000, i => new HtmlDocument().LoadHtml(html));
sw.Stop();
//Logging
using(var writer = File.AppendText("c:\\parsing.log")) {
string text = DateTime.Now.ToString() + ";" + (int) sw.Elapsed.TotalSeconds;
writer.WriteLine(text);
Console.WriteLine(text);
}
}
}
With this minimal code, this displays the speed (pages per second) as a function of the numbers of hours elapsed since the process was started:
Every obvious causes have been ruled out:
It could be something about RAM and memory allocation. I know that HTMLAgilityPack makes a lot of small object memory allocation (HTML nodes and strings). It is clear memory allocation and multi-threading do not work well together. But I don't understand how the process can become slower and slower.
Do you know of anything about the CLR or Windows that could be causing some RAM intensive (many allocations) processing to become slower and slower? Like for example penalizing threads doing memory allocations in a certain way?
I have noticed similar behaviour using the HTMLAgilityPack.
I have found that when one yield's data it starts to space leak local variables on the compiler generated classes that start to cause problems. As no code is available, here's my First Aid kit.
The worst thing of HTMLAgility is that it fragments that ends up being a real issue
I am quite sure that when you start to consider the scope of your HTML fragments you will find that things will start going well. Have a look at your execution using WinDbg in SOS and make a dump of your memory and have a look.
How to do that.
then load the execution in your memory by entering
.loadby sos clr
then enter
!dumpheap -stat
You'd then get the memory items allocated in your application with the memory address and the size grouped by type and sorted from low header to high header you will see something like System.String[] with a massive number in front of it, that's the stuff you would like to investigate first.
Now to see who has that you can type
!dumpheap -mt <heap address>
And you will see the addresses that are using that memory table (MT) and the size of ram it uses.
Now it becomes interesting, rather than you going through x100 lines of code you type
!gcroot <address>
what it will print is the file and line of code that allocated the memory, the compiler generated class and the variable causing you grief as well as the bytes it holds.
This is what one could call "production debugging" and works if you have access to the server, which I guess you have.