HtmlAgilityPack and large HTML Documents

c# html-agility-pack httpwebrequest


I have built a little crawler and now when trying it out i found that when crawling certain sites my crawler uses 98-99% CPU.

I used dotTrace to see what the problem could be and it pointed me towards my httpwebrequest method - i optimised it a bit with the help of some previous questions here on stackoverflow.. but the problem was still there.

I then went to see what URLs that were causing the CPU load and found that it was actually sites that are extremely large in size - go figure :) So, now i am 99% certain it has to do with the following piece of code:

HtmlAgilityPack.HtmlDocument documentt = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNodeCollection list;
HtmlAgilityPack.HtmlNodeCollection frameList;

list = documentt.DocumentNode.SelectNodes(".//a[@href]");

All that i want to do is to extract the links on the page, so for large sites.. is there anyway i can get this to not use so much CPU?

I was thinking maybe limit what i fetch? What would be my best option here?

Certainly someone must have run into this problem before :)

2/24/2016 11:00:18 AM

Accepted Answer

".//a[@href]" is extremely slow XPath. Tried to replace with "//a[@href]" or with code that simply walks whole document and checks all A nodes.

Why this XPath is slow:

  1. "." starting with a node
  2. "//" select all descendent nodes
  3. "a" - pick only "a" nodes
  4. "@href" with href.

Portion 1+2 ends up with "for every node select all its descendant nodes" which is very slow.

10/9/2012 4:27:48 PM

Popular Answer

If you aren't heavily invested in Html Agility Pack, try using CsQuery instead. It builds an index when parsing the documents, and selectors are much faster than HTML Agility Pack. See a comparison.

CsQuery is a .NET jQuery port with a full CSS selector engine; it lets you use CSS selectors as well as the jQuery API to access and manipulate HTML. It's on nuget as CsQuery.

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow