Can Html Agility Pack be used to parse HTML fragments?

.net c# html html-agility-pack parsing

Question

I need to get LINK and META elements from ASP.NET pages, user controls and master pages, grab their contents and then write back updated values to these files in a utility I'm working on.

I could try using regular expressions to grab just these elements but there are several issues with that approach:

  • I expect many of the input files to contain broken HTML (missing / out-of-sequence elements, etc.)
  • SCRIPT elements that contain comments and/or VBScript/JavaScript that looks like valid elements, etc.
  • I need to be able to special-case IE conditional comments and META and LINK elements inside IE conditional comments
  • Not to mention how HTML is not a regular language

I did some research for HTML parsers in .NET and many SO posts and blogs recommend the HTML Agility Pack. I've never used it before and I don't know if it can parse broken HTML and HTML fragments. (For example, imagine a user control that only contains a HEAD element with some content in it - no HTML or BODY.) I know I could read the documentation but it'd save me quite a bit of time if someone could advise. (Most SO posts involve parsing full HTML pages.)

Accepted Answer

Absolutely, that is what it excels at.

In fact, many web pages you'll find in the wild could be described as HTML fragments, due to missing <html> tags, or improperly closed tags.

The HtmlAgilityPack simulates what the browser has to do - try to make sense from what is sometimes a jumble of mismatched tags. An imperfect science, but HtmlAgilgityPack does it very well.


Popular Answer

An alternative to Html Agility Pack is CsQuery, a C# jQuery port of which I am the primary author. It lets you use CSS selectors and the full Query API to access and manipulate the DOM, which for many people is easier than XPATH. Additionally, it's HTML parser is designed specifically with a variety of purposes in mind and there are several options for parsing HTML: as a full document (missing html, body tags will be added, and any orphaned content moved inside the body); as a content block (meaning - it won't be wrapped as a full document, but optional tags such as tbody that are still mandatory in the DOM are added automatically, same as browsers do), and as a true fragment where no tags are created (e.g. in case you're just working with building blocks).

See creating a new DOM for details.

Additionally, CsQuery's HTML parser has been designed to honor the HTML5 spec for optional closing tags. For example, closing p tags are optional, but there are specific rules that determine when the block should be closed. In order to produce the same DOM that a browser does, the parser needs to implement the same rules. CsQuery does this to provide a high degree of compatibility with browser DOM for a given source.

Using CsQuery is very straightforward, e.g.

CQ docFromString = CQ.Create(htmlString); 
CQ docFromWeb = CQ.CreateFromUrl(someUrl);

// there are other methods for asynchronous web gets, creating from files, streams, etc.

// css selector: the indexer [] is like jQuery $(..)

CQ lastCellInFirstRow = docFromString["table tr:first-child td:last-child"];

// Text() is a jQuery method returning text contents of selection 

string textOfCell = lastCellInFirstRow.Text();

Finally CsQuery indexes documents on class, id, attribute, and tag - making selectors extremely fast compared to Html Agility Pack.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why