Is the Html Agility Pack still the best .NET HTML parser?

.net c# html html-agility-pack parsing

Question

Html Agility Pack was given as the answer to a StackOverflow question some time ago, is it still the best option? What other options should be considered? Is there something more lightweight?

Accepted Answer

There is a spreadsheet with the comparisons.

In summary:

CsQuery Performance vs. Html Agility Pack and Fizzler I put together some performance tests to compare CsQuery to the only practical alternative that I know of (Fizzler, an HtmlAgilityPack extension). I tested against three different documents:

  • The sizzle test document (about 11 k)
  • The wikipedia entry for "cheese" (about 170 k)
  • The single-page HTML 5 spec (about 6 megabytes)

The overall results are:

  • HAP is faster at loading the string of HTML into an object model. This makes sense, since I don't think Fizzler builds an index (or perhaps it builds only a relatively simple one). CsQuery takes anywhere from 1.1 to 2.6x longer to load the document. More on this below.
  • CsQuery is faster for almost everything else. Sometimes by factors of 10,000 or more. The one exception is the "*" selector, where sometimes Fizzler is faster. For all tests, the results are completely enumerated; this case just results in every node in the tree being enumerated. So this doesn't test the selection engine so much as the data structure.
  • CsQuery did a better job at returning the same results as a browser. Each of the selectors here was verified against the same document in Chrome using jQuery 1.7.2, and the numbers match those returned by CsQuery. This is probably because HtmlAgilityPack handles optional (missing) tags differently. Additionally, nth-child is not implemented completely in Fizzler - it only supports simple values (not formulae).

Popular Answer

When it comes to HTML parsing, there's no comparison to the real thing. This is a C# port of the validator.nu parser. This is the same code base used by Gecko-based browsers (e.g. Firefox). There repo looks a bit dusty but don't be fooled.. the port is outstanding. It's just been overlooked. I integrated it into CsQuery about a month ago. It passes all the CsQuery tests (which include most of the jQuery and Sizzle tests ported to C#).

I'm not aware of any other HTML5 parsers written in C#, or even any that come remotely close to doing a good job in terms of missing, optional, and invalid tag handling. This doesn't just do a great job though - it's standards compliant.

The repo I linked to above is the original port, it includes a basic wrapper that produces an XML node tree. CsQuery versions 1.3 and higher use this parser.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why