Using C# to create clean HTML

html html-agility-pack malformed screen-scraping


How can I fix broken HTML using C#? An example of HTML Agility Pack would be a fantastic response!

Scraping a website (for legitimate use). The HTML of the website is OK, however there are some bothersome issues.

Regular expressions would be one route I could go. I utilized Expression Web to analyze the issues and the appropriate regular expressions. Therefore, one method would be to produce C# code for these regular expressions using a program like RegexBuddy.

However, the Agility Pack for HTML is the suggested tool for processing HTML that isn't formatted correctly in C# (HAP). Additionally, I've only examined a small number of pages, so I'm worried that subsequent pages may have patterns I haven't yet seen. Moreover, I'd rather avoid getting into the "identify the problems in the next few pages and rectify them" maintenance business. Therefore, it would be fantastic if HAP already had a reliable, always-working solution. The issue is that, other from a few comments here on SO, I was unable to locate any user guides for this program other than the object-by-object API help file.

Is there a simple way to achieve this before I invest money and effort studying RegexBuddy (which has a paid evaluation version) or struggling through the HAP API documentation? A HAP sample would be beneficial.

4/18/2011 3:10:46 PM

Accepted Answer

What I gleaned from their responses: 1) If you're scraping a website that you don't own, your scraper will constantly go into maintenance mode and you'll have to repair it whenever the design of the page it's scraping changes. 2) Why not create your scraper to fix the issues if you are restricted to this well-known site?

So, it should be as simple as feasible for me to enter maintenance mode if necessary. Consequently, this is how I work:

  1. To find scenes on Web sites, I use SWExplorerAutomation by Webius. A Scene is supposed to be a group of criteria you provide for Internet Explorer. When a web page loads, IE checks to verify whether a certain set of requirements is satisfied (e.g. - page title is "Account Login", the page contains a "Login" text box a "Password" text box). IE notifies the user that a scene has been discovered if a set of circumstances related to the scene are found. This concept offers an abstraction layer, allowing certain web page changes to be translated to changes in the scene file instead of the code. Furthermore, I am protected from IE's event-driven approach by doing this: I call "scene. I'm considering this product, but I'm not sure whether I'll utilize it just yet, mostly due to the poor documentation. Another option is Watin, and this piece accusing its creator of spamming against Watin is another reason I haven't yet purchased SWEA.
  2. I utilize Expression Web to carry out compatibility tests and spot faults after acquiring the web page.
  3. I utilize RegexMagic to eliminate and fix mistakes. I adore this tool so much. Even while it sometimes makes you furiously upset because it prevents you from doing things that ought to be quite simple, this tool is still really pleasant, and the documentation is incredible.
  4. When all the mistakes I'm aware of have been fixed, I utilize HTML Agility Pack to convert to XHTML, making sure to "cross the ts and dot the is" by using lowercase letters throughout and using quotes throughout attributes, among other things.

Hope this is useful!


12/26/2009 10:43:45 PM

Popular Answer

Could you describe the irritating issues you're experiencing?
However, HAP will let you to access the elements of a faulty html using Xpath Queries, so you don't need to use regex to clean the html.
and in order to know how to obtain the desired html components, you essentially need to master Xpath.
it mostly relies on the kind of html that HAP is used to parse.
But there are several methods to get the ingredients.
such as by id, class, or even the element that comes after another element that has a certain text, like "name:," for instance.
Visit Xpath tutorial for W3 schools to access an excellent xpath lesson.

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow