c# .net4 - regex vs html agility pack

c# html-agility-pack memory regex

Question

What's faster? I just made a web scraper that uses HTML Agility pack and it's consuming massive amounts of memory.

Profiling it with a memory profiler, I found that the HTMLDocument, HTMLNode, etc, instances are taking up the most amount of memory.

I feel like maybe it would be faster and more efficient to use regex, am I wrong?

Accepted Answer

A reg-ex will be a lot faster than html agilty pack.

But you should remember that html need not always be well formed. Searching the correct data you want using only reg-ex may fail. Browsers are very forgiving about mistakes.

Agility pack is a great tool. It provides a lot of features for that memory it is consuming.


Popular Answer

Depending on what exactly you do it really could be possible to speed things up and free some mem using regex. The question is - how rigid and well-formed are the pages you are extracting data from. Regex is much more easily confused by perfectly valid, but unexpected, HTML constructs that you might encounter in the wild.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why