C# HTMLAgilityPack VS regular expressions for extracting links from HTML

c# html-agility-pack html-parsing regex

Question

I'm writing a C# web crawler and when I run the profiling I can see that HTMLAgilityPack's LoadHTML method is using 10% of the programs overall CPU usage. I'd like to try and lower this.

I'm sure a regular expression would be faster but as I look at link extracting examples on SO I see everyone saying this method should be avoided in favour of a html parser like HTMLAgilityPack.

As all I need to do is extract links from HTML is using HTMLAgilityPack over kill?

Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links?


Downloaded HTML with WebClient then compared.

Using href\\s*=\\s*(?:[\"'](?<1>[^\"']*)[\"']|(?<1>\\S+)) (then trimming and adding to a list) is way faster than HTMLAgilityPack.

43 milliseconds compared to 3 consistently.


See my code on pastebin

Accepted Answer

Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links?

In your case the HTML parser is overkill as your tests have shown.

People who answer on SO use that as a rote answer to all regex questions. One should use the tool if one actually needs to parse the domain of the HTML in a more robust fashion.


Bias against Regular Expressions are found by people who feel that they are too slow or cumbersome [to learn]. There is some merit on what is proposed by them for certain operations, in that specific optimized text for finding utilities do perform better. Sure I agree, but to dismiss regex out of hand, well that is par for the course on StackOverflow.

Why is that? Sometimes the analysis is simply flawed because the pattern provided introduces a lot of unnecessary backtracking and is not optimized. That handicaps regex out of the gate. One does have to learn the regex language and understand what it is doing to tune the engine of regex to not pollute.

For example I took your same C# code test, but I used an optimized pattern of yours and my own and was able to get it down to 1 millisecond consistently!

Most people learn basic pattern matching by doing searches with a *. When they first learn regex they use * with the . such as .*. That step along with indiscriminate usage of the * will most likely will doom any non beginning pattern to the hell of backtracking and slow responses.

Unless you know empirically that there are no items, use the + instead.


Back in 2009 I wrote about this subject on my blog Are C# .Net Regular Expressions Fast Enough for You?



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why