C# HTMLAgilityPack Website Blocked my IP-Address

c# html-agility-pack ip proxy

Question

I was using HTMLAgilityPack to get the HTML from following Website: http://tennis.wettpoint.com/en/

It worked fine, but now.. after a hour it doesn't work anymore!

First I tried to change my Code - on how I retrieve the HTML:

string url = "http://tennis.wettpoint.com/en/";
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load(url);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
   //Code..
}

Like I said, that worked always fine.. until the site seemed "down" for me.. SO I changed the code to:

using (WebClient wc = new WebClient())
{
    wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
    string html = wc.DownloadString("http://en.wikipedia.org/wiki/United_States");
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);
}

(That didn't work for my site, but worked for an other site)

and at least I have this now, which also works, but not for my site:

HtmlAgilityPack.HtmlDocument doc = GetHTMLDocumentByURL(url);

public HtmlAgilityPack.HtmlDocument GetHTMLDocumentByURL(string url)
{
    var htmlDoc = new HtmlAgilityPack.HtmlDocument();
    htmlDoc.OptionReadEncoding = false;
    var request = (HttpWebRequest)WebRequest.Create(url);
    request.UserAgent = @"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5";
    request.Method = "GET";
    using (var response = (HttpWebResponse)request.GetResponse())
    {
        using (var stream = response.GetResponseStream())
        {
            htmlDoc.Load(stream, Encoding.UTF8);
        }
    }
    return htmlDoc;
}

Well at first I believed the site is down, cause I can't access the site with any Browser either.. So I asked friends and they were able to access the site.. So that means my IP had been blocked.. Whyever.. What can I do? Do I need to change my Ip (how) or use Proxys (how).. I have no clue, as I didn't mention that this would happen :( Hope someone can help me..

Accepted Answer

Wikipedia monitors the number of requests it gets from an IP address and will ban IP's from aggressively scraping it's content. Scraping Google search results will have the same effect.

Initially Wikipedia will only ban you for 24 hours, but if you carry on "offending", your IP will be banned permanently.

You can either - use proxy's in your HttpRequest to change your IP address or slow down your requests.


Popular Answer

First rule of crawling: politeness!

Any time you crawl a website you have to ensure that your crawler abides by the rules in their robots.txt file: http://tennis.wettpoint.com/robots.txt

User-agent: msnbot 
Crawl-delay: 1

User-agent: MJ12bot
Disallow: /

User-agent: sistrix
Disallow: /

User-agent: TurnitinBot
Disallow: /

User-agent: Raven
Disallow: /

User-agent: dotbot
Disallow: /

This means that the msnbot is explicitly allowed to crawl the website with a delay of 1 second. MJ12bot, sistrix, TurnitinBot, Raven and dotbot are explicitly NOT allowed to crawl any of the website. Now, this is the very first line of defense that you will see from a website. This is their most polite way of protecting their website from accidental abuse. For more info on robots.txt, see here: http://www.robotstxt.org/meta.html

You should implement some reasonable crawl delay (1-10 seconds) and see if they allow you to crawl again.

Rule number two: watch out for bot traps!

This doesn't apply to you at the moment, but you should be a ware of it in general. One way to catch bots which are not polite is to put an explicit rule in the robots.txt which bans all robots from going to a specific directory, such as:

User-agent: * Disallow: /the/epic/robot/trap/path

Then somewhere in the HTML there is a link, which is not visible to humans, but visible to bots:

<a href="www.mydomain.com/the/epic/robot/trap/path/gotcha.html"></a>

Clearly, no human will ever see or click on this link if they're using a browser and no bot which follows the robots.txt rules will ever go to the /the/epic/robot/trap/path. However, bots that don't abide by the robots.txt rules and collect internal links for crawling purposes will eventually end up in that directory, and what awaits them there is certain death! The operator of the website is most likely collecting and blocking all of the IPs of users who visit that link.



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why