C# HTMLAgilityPack VS regular expressions for extracting links from HTML

c# html-agility-pack html-parsing regex


When I run the profiling for the C# web crawler I'm creating, I can observe thatHTMLAgilityPack's LoadHTML 10% of the CPU use of the application is being used by the procedure. I want to attempt to make this smaller.

Although a regular expression would undoubtedly be quicker, everyone on SO's link extraction examples seems to recommend using an HTML parser instead.HTMLAgilityPack .

Since all I have to do is use HTML to extract links,HTMLAgilityPack a bit much?

Given that I'm simply using my HTML parser to extract links, are the arguments in favor of HTML parsers relevant to me?

used WebClient to get HTML, then compared.

Using href\\s*=\\s*(?:[\"'](?<1>[^\"']*)[\"']|(?<1>\\S+)) Faster than HTMLAgilityPack is (then cutting and adding to a list).

43 vs 3 consistently milliseconds.

Check out my code at pastebin.

5/9/2017 9:42:00 PM

Accepted Answer

Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links?

The HTML parser is excessive in this situation, as your experiments have shown.

That is used as a rote response to any regex inquiries by those who respond on SO. If one genuinely needs to parse the HTML domain in a more thorough manner, one should utilize the program.

19 zz vs zz People who believe they are too sluggish or difficult [to learn] might find regular expressions. There is some value to what they suggest for particular processes, in that a certain optimized wording for discovering utilities does function more effectively. Yes, I do agree, but ignoring regex out of hand on StackOverflow is standard practice.

Is that Why? Sometimes the analysis is just incorrect since the pattern is not optimized and causes a lot of unnecessary backtracking. Regex is disadvantaged as a result right away. To tweak the regex engine such that it doesn't pollute, one does need to master the regex language and comprehend what it does.

Example For I utilized a combination of your and my own optimized patterns to take your same C# code test, and I was regularly able to reduce the time to 1 millisecond!

Most individuals discover fundamental pattern matching by doing searches on a* When people are learning regex, they employ* using the. that is.* This action, along with the indiscriminate use of the* most certainly consign any non-beginning pattern to a purgatory of delayed answers and backtracking.

Unless you have evidence to the contrary, utilize the+ instead.

I wrote about this topic on my blog, Do You Find C#.Net Regular Expressions to be Quick Enough?, in 2009.

2/5/2018 7:19:56 PM

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow