Can Html Agility Pack be used to parse HTML fragments?

.net c# html html-agility-pack parsing


I have to go getLINK and META In a tool I'm developing, I take user controls, master pages, and ASP.NET page components, collect their contents, and then put back changed values to these files.

I could attempt to use regular expressions to choose just these components, but there are a number of problems with that strategy:

  • I anticipate that many of the input files will include HTML that is incorrect (e.g., missing or out of order components).
  • SCRIPT elements with comments, VBScript or JavaScript that seems to be a legitimate element, etc.
  • I must be allowed to use IE conditional comments in rare cases.META and LINK IE conditional comments' elements
  • Not to add the fact that HTML is not a normal language

I looked at HTML parsers for.NET and saw that several SO articles and blogs suggest the Agility Pack for HTML. It may be able to parse broken HTML and HTML fragments, but I haven't used it previously to find out. (For instance, consider a user control that has merely aHEAD content-containing element - noHTML or BODY Although I could study the material, it would save me a lot of time if someone could provide some guidance. (Parsing whole HTML pages is a common task for SO postings.)

11/27/2017 1:27:44 PM

Accepted Answer

Yes, that is what it is best at.

In reality, a lot of the websites you'll see in the wild may be defined as HTML fragments since they lack essential<html> tags, or incorrectly shut tags.

The HtmlAgilityPack replicates what the browser must do, which is to attempt to make sense out of what is sometimes a disorganized mess of tags. Although it's an imperfect science, HtmlAgilgityPack excels at it.

9/21/2012 2:42:05 PM

Popular Answer

I am the main creator of the C# jQuery version known as CsQuery, which serves as a substitute for Html Agility Pack. For many folks, it's simpler to utilize CSS selectors and the complete Query API to access and modify the DOM than XPATH. Additionally, it has an HTML parser that was created with a number of objectives in mind and offers a number of choices for parsing HTML: as a full document (missinghtml, body Any orphaned text will be relocated within the body, and tags will be appended; as a content block (i.e., it won't be wrapped as a whole document, but optional tags suchtbody immediately added to the DOM, much as browsers do, and as a genuine fragment where no tags are formed (for instance, if you're just using building blocks).

Details may be found in making a fresh DOM.

The HTML parser in CsQuery has also been developed to adhere to the HTML5 specification for optional closing tags. For instance, finishingp Although tags are not required, there are rules that specify when the block should be closed. The parser must use the same principles in order to generate the same DOM that a browser does. CsQuery accomplishes this to provide a high level of DOM compatibility for a certain source.

It is really simple to use CsQuery, for instance.

CQ docFromString = CQ.Create(htmlString); 
CQ docFromWeb = CQ.CreateFromUrl(someUrl);

// there are other methods for asynchronous web gets, creating from files, streams, etc.

// css selector: the indexer [] is like jQuery $(..)

CQ lastCellInFirstRow = docFromString["table tr:first-child td:last-child"];

// Text() is a jQuery method returning text contents of selection 

string textOfCell = lastCellInFirstRow.Text();

Finally, compared to HTML Agility Pack, selectors are very quick because to CsQuery's indexing of documents based on class, id, attribute, and tag.

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow