Better option for web scraping (HTMLAgilityPack or Python+beautifulsoup) for C# programmer

beautifulsoup c# html-agility-pack python

Question

I'm a .NET programmer. I need to work on a web scraping project. I want to get an idea on HTMLAgilityPack vs BeautifulSoup.

Many people say, BeautifulSoup is much better than HTMLAgilityPack. But for this, I need to learn Python.

So, my question is, Is it reasonable for me to learn Python and BeautifulSoup or continue with C# and HTMLAgilityPack?

Any other suggestion is warmly welcomed.

Accepted Answer

In the C# .NET world, I would recommend the HTMLAgilityPack because it is very flexible. It lets you manipulate poorly formed HTML as if it were well formed XML, so you can use XPath or just iterate over nodes.

BeautifulSoup is a great way to go for HTML scraping but from developer perspective, it is not really easy to get hands-on on a completely new technology. So i would strongly recommend HTMLAgilityPack if you are a .NET guy.

You can get great success with the combination of HTML Agility Pack, regular expressions, and XDocument (LINQ -> XMLy stuff)

It's extremley powerful - LINQ and lambda (part 3) - HTML Agility Pack is a blog post by Vijay Santhanam that got me hooked on it.


Popular Answer

CsQuery, a library I created, is a relatively new alternative to Html Agility Pack. It offers the following advantages:

  • Complete CSS3 selector support, which for most people is already familiar and much easier than xpath, and the same way you've already been coding for the client
  • The jQuery API, for the same reasons
  • Uses the validator.nu HTML parser, a fully HTML5 compliant parser. This is the same code base used by Gecko-based browsers (Firefox) meaning it should produce the exact same DOM as web browsers even for typically bad/invalid markup.
  • Indexes documents making selectors extremely fast, even on very large documents. HAP must traverse the full document tree for each selector, making it very slow for complex selectors and large documents.
  • Extensive unit test coverage - all the tests from jQuery and Sizzle (the jQuery CSS selection engine) have been ported to C#.

Disadvantages:

  • Right now only works compiles .NET 4+ full framework, whereas HAP has builds for most .NET environments.

You can get it from nuget: Install-Package CsQuery.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why