How to check if it is 404 error page(page does not exist) using HtmlAgilityPack

c# html-agility-pack

Question

Here i am trying to read urls and getting the images in a page. I need to exclude the page if it is 404 and stop getting the images from a 404 error page. How to do it using HtmlAgilityPack? Here is my code

var document = new HtmlWeb().Load(completeurl);
var urls = document.DocumentNode.Descendants("img")
          .Select(e => e.GetAttributeValue("src", null))
          .Where(s => !String.IsNullOrEmpty(s)).ToList();

Accepted Answer

You'll need to register a PostRequestHandler event on the HtmlWeb instance, it will be raised after each downloaded document and you'll get access to the HttpWebResponse object. It has a property for the StatusCode.

 HtmlWeb web = new HtmlWeb();
 HttpStatusCode statusCode = HttpStatusCode.OK;
 web.PostRequestHandler += (request, response) =>
 {
     if (response != null)
     {
         statusCode = response.StatusCode;
     }
 }

 var doc = web.Load(completeUrl)
 if (statusCode == HttpStatusCode.OK)
 {
     // received a read document
 }

Looking at the code of the HtmlAgilityPack on GutHub, it's even simpler, HtmlWeb has a property StatusCode which is set with the value:

var web = new HtmlWeb();
var document = web.Load(completeurl);

if (web.StatusCode == HttpStatusCode.OK)
{
    var urls = document.DocumentNode.Descendants("img")
          .Select(e => e.GetAttributeValue("src", null))
          .Where(s => !String.IsNullOrEmpty(s)).ToList();
}

Update

There has been an update to the AgilityPack API. The trick is still the same:

var htmlWeb = new HtmlWeb();
var lastStatusCode = HttpStatusCode.OK;

htmlWeb.PostResponse = (request, response) =>
{
    if (response != null)
    {
        lastStatusCode = response.StatusCode;
    }
};

Popular Answer

Be aware of the version you use!

I am using HtmlAgilityPack v1.5.1 and there is no PostRequestHandler event.

In the v1.5.1 one has to use PostResponse field. See example below.

var htmlWeb = new HtmlWeb();
var lastStatusCode = HttpStatusCode.OK;

htmlWeb.PostResponse = (request, response) =>
{
    if (response != null)
    {
        lastStatusCode = response.StatusCode;
    }
};

There are not many differences but still they are.

Hope this will save some time to someone.



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why