HttpWebRequest, WebResponse and WebBrowser Differents

c# html-agility-pack httpwebresponse webbrowser-control

Question

I have winform application and I am scraping HTML.Sometimes google redirect me captcha page for verifying.

And problem starting here I am using HtmlAgilityPack and getting html like :

 try
        {
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36";
            request.Timeout = 10000;
            WebResponse response = request.GetResponse();
            using (var reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8))
            {
                return reader.ReadToEnd();
            }
        }
        catch (WebException e)
        {
           //Here I am getting captcha page
            using (var sr = new StreamReader(e.Response.GetResponseStream()))
                return sr.ReadToEnd();
        }

after loading html to my HtmlDocument. I am looking for captcha. If html contains, I am opening WebBrowser and navigating same url again.I am verifying cathpa and "thats it" google is open. But after 30 second, if I try getting html again. It shows me captcha page again. I test it, WebBrowser doesnt show capthca page anymore but my request is still showing WHY ? They both request from same localhost same computer same wifi ..

 var webBrowser1 = new WebBrowser
                     {
                         ScriptErrorsSuppressed = true,
                         AllowNavigation = true,
                         Dock = DockStyle.Fill
                     };
                    BrowserSettings(webBrowser1);

              webBrowser1.Refresh(WebBrowserRefreshOption.Completely);
            //Here I am NOT getting captcha page
                    webBrowser1.Navigate(searchUrl);
                    if (DialogForms == null)
                    {
                        DialogForms = new Form
                        {
                            WindowState = FormWindowState.Maximized,
                            TopMost = true
                        };
                    }
                    DialogForms.Controls.Add(webBrowser1);
                    DialogForms.ShowDialog();

Popular Answer

Somewhat quick non-answer: Because what you're doing is pretty much why reCAPTCHA exists and helps mitigate and/or prevent (emphasis mine):

reCAPTCHA uses an advanced risk analysis engine and adaptive CAPTCHAs to keep automated software from engaging in abusive activities on your site.

...it uses advanced risk analysis techniques, considering the user’s entire engagement with the CAPTCHA, and evaluates a broad range of cues that distinguish humans from bots.


Update:

Q:

but my question is, how can understand reCAPTCHA my request method. Example I get the html by WebBrowser or via Request:Response and read from Stream.Ä°t doesnt show reCAPTCHA for WebBrowser but for Request:Response it does

A:

  • The "bot check" runs based on it's own determination on when to invoke it.

  • I also assumed that the site you're scraping is implementing Google's reCAPTCHA specifically - that's my mistake. The site could very well be behind a WAF (Web Application Firewall) service which will invoke bot checks that offer some challenge based on CAPTCHA (or outright reject the request).

Hth...




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why