I am reading websites in C# and get contents as string....there are some sites which do not have well formed html structure.
I am using HtmlAgilityPack which give me issue in that case.
Can you people suggest me what to use so that it can read whole string and i can get useful informations?
Here is my code
htmlDoc.LoadHtml(s);
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
Why this IF Condition is true for my case
What is the error you're getting? Is it throwing an exception or are you just wanting to see the error? Hard to tell what your actual question is.
You can see the markup errors in the HTML by using the HtmlDoc.ParseErrors
property and iterate though them. This will give you the line number, code and type of error.
You can see more info about this property here https://stackoverflow.com/a/5367455/235644
Edit
Ok so you've updated your question since my reply. You can see the specific error that's returning true in your IF statement by looping through the .ParseErrors
are described above.
Second Edit
You can loop though the errors like so:
foreach (var error in htmlDoc.ParseErrors)
{
Debug.WriteLine(error.Line);
Debug.WriteLine(error.Reason);
}
If your html is external and you can't fix it, you can first run it through a cleanup preprocessor, then parse it with HtmlAgilityPack
.
This will attempt to fix as many issues as possible automatically before HtmlAgilityPack
gets to see it. The most popular HTML cleanup tool is Tidy. See the .NET version here: