I've used HtmlAgilityPack in the past to parse HTML in .Net but I don't like the fact that it only uses a DOM model.
On large documents and/or those with heavy levels of nesting it is possible to hit stack overflow or out of memory exceptions. Also in general a DOM based parsing model uses significantly more memory than a streaming based approach, typically because the process that wants to consume the HTML may only need a few elements to be available at a time.
Does anyone know of a decent HTML parser for .Net that allows you to parse HTML in a manner similar to the XmlReader
class? i.e. in a forward only streaming manner
I usually use SgmlReader for this: https://github.com/MindTouch/SGMLReader
Like others have said, there are issues in that HTML doesn't follow the same well-formed rules of XML, so it is inherently difficult to parse, but SgmlReader usually does a pretty good job.