I'm a .NET programmer. I need to work on a web scraping project. I want to get an idea on HTMLAgilityPack vs BeautifulSoup.
Many people say, BeautifulSoup is much better than HTMLAgilityPack. But for this, I need to learn Python.
So, my question is, Is it reasonable for me to learn Python and BeautifulSoup or continue with C# and HTMLAgilityPack?
Any other suggestion is warmly welcomed.
In the C# .NET world, I would recommend the HTMLAgilityPack because it is very flexible. It lets you manipulate poorly formed HTML as if it were well formed XML, so you can use XPath or just iterate over nodes.
BeautifulSoup is a great way to go for HTML scraping but from developer perspective, it is not really easy to get hands-on on a completely new technology. So i would strongly recommend HTMLAgilityPack if you are a .NET guy.
You can get great success with the combination of HTML Agility Pack, regular expressions, and XDocument (LINQ -> XMLy stuff)
It's extremley powerful - LINQ and lambda (part 3) - HTML Agility Pack is a blog post by Vijay Santhanam that got me hooked on it.
CsQuery, a library I created, is a relatively new alternative to Html Agility Pack. It offers the following advantages:
Disadvantages:
You can get it from nuget: Install-Package CsQuery
.