HtmlAgilityPack and large HTML Documents

c# html-agility-pack httpwebrequest

Question

I created a little crawler, and after testing it, I discovered that it consumes between 98 and 99 percent CPU while scanning certain websites.

I useddotTrace to determine the potential source of the issue, and it directed me to myhttpwebrequest With the aid of several earlier inquiries on stackoverflow, I somewhat improved the approach, but the issue persisted.

When I looked into the URLs that were creating the CPU load, I discovered that they were really quite huge websites. I am now almost positive that the following line of code is to blame:

HtmlAgilityPack.HtmlDocument documentt = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNodeCollection list;
HtmlAgilityPack.HtmlNodeCollection frameList;

documentt.LoadHtml(_html);
list = documentt.DocumentNode.SelectNodes(".//a[@href]");

I just want to extract the links from the page, therefore for huge sites, is there any way to make this process less CPU-intensive?

Maybe I should restrict the information I fetch? What would be the best course of action in this situation?

Surely someone has encountered this issue before:)

1
2
2/24/2016 11:00:18 AM

Accepted Answer

XPath "./a[@href]" is really sluggish. attempted to substitute "/a[@href]" or code that just traverses the whole page and checks all A nodes

What makes this XPath so slow:

  1. beginning with a node
  2. "//" chooses all child nodes
  3. Choose just "a" nodes.
  4. With href, use "@href".

Portion 1+2 concludes with the very slow instruction to "for every node choose all its descendent nodes."

1
10/9/2012 4:27:48 PM

Popular Answer

Try using CsQuery if you aren't too involved in HTML Agility Pack. When processing the pages, it creates an index, and selectors run much more quickly than in HTML Agility Pack. look at zzz-9 zzz

CsQuery is a comprehensive CSS selector engine.NET jQuery port that enables you to access and work with HTML using both CSS selectors and the jQuery API. CsQuery is available on nuget.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow