Is the Html Agility Pack the best.NET HTML parser still available?

.net c# html html-agility-pack parsing

Question

Is the Agility Pack for HTML, which was previously suggested as the solution to a the StackOverflow question, still the best choice? What alternatives should be taken into account? Is there anything lighter than that?

1
57
5/23/2017 12:18:22 PM

Accepted Answer

A spreadsheet follows the comparisons.

To sum up:

CsQuery Performance vs. Html Agility Pack and Fizzler I put together some performance tests to compare CsQuery to the only practical alternative that I know of (Fizzler, an HtmlAgilityPack extension). I tested against three different documents:

  • The sizzle test document (about 11 k)
  • The wikipedia entry for "cheese" (about 170 k)
  • The single-page HTML 5 spec (about 6 megabytes)

The overall results are:

  • HAP is faster at loading the string of HTML into an object model. This makes sense, since I don't think Fizzler builds an index (or perhaps it builds only a relatively simple one). CsQuery takes anywhere from 1.1 to 2.6x longer to load the document. More on this below.
  • CsQuery is faster for almost everything else. Sometimes by factors of 10,000 or more. The one exception is the "*" selector, where sometimes Fizzler is faster. For all tests, the results are completely enumerated; this case just results in every node in the tree being enumerated. So this doesn't test the selection engine so much as the data structure.
  • CsQuery did a better job at returning the same results as a browser. Each of the selectors here was verified against the same document in Chrome using jQuery 1.7.2, and the numbers match those returned by CsQuery. This is probably because HtmlAgilityPack handles optional (missing) tags differently. Additionally, nth-child is not implemented completely in Fizzler - it only supports simple values (not formulae).
53
8/28/2014 8:15:39 PM

Popular Answer

There is no substitute for the genuine thing when it comes to HTML processing. This The validator.nu C# port parser is used. The code base utilized by Gecko-based browsers is the same (e.g. Firefox). Do not be misled by the repo's appearance; the port is excellent. Just ignored, really. It was just merged into CsQuery. It passes every test run by CsQuery, including the vast majority of the jQuery and Sizzle tests converted to C#.

There aren't any other HTML5 parsers built in C# that I'm aware of—not even ones that even come close to handling missing, optional, and invalid tags well. But this does more than simply a terrific job; it complies with requirements.

The repository that I previously linked to has the original port and a simple wrapper that generates an XML node tree. This parser is used by CsQuery versions 1.3 and above.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow