I have built a little crawler and now when trying it out i found that when crawling certain sites my crawler uses 98-99% CPU.
dotTrace to see what the problem could be and it pointed me towards my
httpwebrequest method - i optimised it a bit with the help of some previous questions here on stackoverflow.. but the problem was still there.
I then went to see what URLs that were causing the CPU load and found that it was actually sites that are extremely large in size - go figure :) So, now i am 99% certain it has to do with the following piece of code:
HtmlAgilityPack.HtmlDocument documentt = new HtmlAgilityPack.HtmlDocument(); HtmlAgilityPack.HtmlNodeCollection list; HtmlAgilityPack.HtmlNodeCollection frameList; documentt.LoadHtml(_html); list = documentt.DocumentNode.SelectNodes(".//a[@href]");
All that i want to do is to extract the links on the page, so for large sites.. is there anyway i can get this to not use so much CPU?
I was thinking maybe limit what i fetch? What would be my best option here?
Certainly someone must have run into this problem before :)
".//a[@href]" is extremely slow XPath. Tried to replace with "//a[@href]" or with code that simply walks whole document and checks all A nodes.
Why this XPath is slow:
Portion 1+2 ends up with "for every node select all its descendant nodes" which is very slow.
CsQuery is a .NET jQuery port with a full CSS selector engine; it lets you use CSS selectors as well as the jQuery API to access and manipulate HTML. It's on nuget as CsQuery.