I created a little crawler, and after testing it, I discovered that it consumes between 98 and 99 percent CPU while scanning certain websites.
to determine the potential source of the issue, and it directed me to my
With the aid of several earlier inquiries on stackoverflow, I somewhat improved the approach, but the issue persisted.
When I looked into the URLs that were creating the CPU load, I discovered that they were really quite huge websites. I am now almost positive that the following line of code is to blame:
HtmlAgilityPack.HtmlDocument documentt = new HtmlAgilityPack.HtmlDocument(); HtmlAgilityPack.HtmlNodeCollection list; HtmlAgilityPack.HtmlNodeCollection frameList; documentt.LoadHtml(_html); list = documentt.DocumentNode.SelectNodes(".//a[@href]");
I just want to extract the links from the page, therefore for huge sites, is there any way to make this process less CPU-intensive?
Maybe I should restrict the information I fetch? What would be the best course of action in this situation?
Surely someone has encountered this issue before:)
XPath "./a[@href]" is really sluggish. attempted to substitute "/a[@href]" or code that just traverses the whole page and checks all A nodes
What makes this XPath so slow:
Portion 1+2 concludes with the very slow instruction to "for every node choose all its descendent nodes."
Try using CsQuery if you aren't too involved in HTML Agility Pack. When processing the pages, it creates an index, and selectors run much more quickly than in HTML Agility Pack. look at zzz-9 zzz
CsQuery is a comprehensive CSS selector engine.NET jQuery port that enables you to access and work with HTML using both CSS selectors and the jQuery API. CsQuery is available on nuget.