Here i am trying to read urls and getting the images in a page. I need to exclude the page if it is 404 and stop getting the images from a 404 error page. How to do it using HtmlAgilityPack? Here is my code
var document = new HtmlWeb().Load(completeurl);
var urls = document.DocumentNode.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !String.IsNullOrEmpty(s)).ToList();
You'll need to register a PostRequestHandler
event on the HtmlWeb
instance, it will be raised after each downloaded document and you'll get access to the HttpWebResponse
object. It has a property for the StatusCode
.
HtmlWeb web = new HtmlWeb();
HttpStatusCode statusCode = HttpStatusCode.OK;
web.PostRequestHandler += (request, response) =>
{
if (response != null)
{
statusCode = response.StatusCode;
}
}
var doc = web.Load(completeUrl)
if (statusCode == HttpStatusCode.OK)
{
// received a read document
}
Looking at the code of the HtmlAgilityPack on GutHub, it's even simpler, HtmlWeb
has a property StatusCode
which is set with the value:
var web = new HtmlWeb();
var document = web.Load(completeurl);
if (web.StatusCode == HttpStatusCode.OK)
{
var urls = document.DocumentNode.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !String.IsNullOrEmpty(s)).ToList();
}
There has been an update to the AgilityPack API. The trick is still the same:
var htmlWeb = new HtmlWeb();
var lastStatusCode = HttpStatusCode.OK;
htmlWeb.PostResponse = (request, response) =>
{
if (response != null)
{
lastStatusCode = response.StatusCode;
}
};
Be aware of the version you use!
I am using HtmlAgilityPack v1.5.1
and there is no PostRequestHandler
event.
In the v1.5.1
one has to use PostResponse
field. See example below.
var htmlWeb = new HtmlWeb();
var lastStatusCode = HttpStatusCode.OK;
htmlWeb.PostResponse = (request, response) =>
{
if (response != null)
{
lastStatusCode = response.StatusCode;
}
};
There are not many differences but still they are.
Hope this will save some time to someone.