HtmlAgilityPack Issue in reading html

.net c# html html-agility-pack parsing

Question

I am reading websites in C# and get contents as string....there are some sites which do not have well formed html structure.

I am using HtmlAgilityPack which give me issue in that case.

Can you people suggest me what to use so that it can read whole string and i can get useful informations?

Here is my code

 htmlDoc.LoadHtml(s);
  if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)

Why this IF Condition is true for my case

Accepted Answer

What is the error you're getting? Is it throwing an exception or are you just wanting to see the error? Hard to tell what your actual question is.

You can see the markup errors in the HTML by using the HtmlDoc.ParseErrors property and iterate though them. This will give you the line number, code and type of error.

You can see more info about this property here https://stackoverflow.com/a/5367455/235644

Edit

Ok so you've updated your question since my reply. You can see the specific error that's returning true in your IF statement by looping through the .ParseErrors are described above.

Second Edit

You can loop though the errors like so:

 foreach (var error in htmlDoc.ParseErrors)
 {
      Debug.WriteLine(error.Line);
      Debug.WriteLine(error.Reason);
 }

Popular Answer

If your html is external and you can't fix it, you can first run it through a cleanup preprocessor, then parse it with HtmlAgilityPack.

This will attempt to fix as many issues as possible automatically before HtmlAgilityPack gets to see it. The most popular HTML cleanup tool is Tidy. See the .NET version here:

http://sourceforge.net/projects/tidynet/



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why