Good day! I'm trying to parse Google, or rather there is no problem with parsing, if not for the captcha. With captcha the situation is as follows: Instead of issuing, it redirects to: http://www.google.com/sorry/ I take captcha from there id and upload a picture from here: http://www.google.com/sorry/image?id= { $ id} & hl = en The image is successfully recognized by antigate and I get the code. But from this point on, the problems begin - Google persistently refuses to accept a captcha even after sending a captcha to it in the following way: http://www.google.com/sorry/Captcha?id= confusion; Continuation of the captcha; = http: //www.google.com/ he again displays the captcha input page.

Already 20 times rechecked where some cookies in the browser come from, what kind of link looks like and so on, but doesn’t work hard ... Please tell me what could be the rake, surely many dealt with Google's parsing. Thank you in advance!

PS I do not attach the code. the point here is not in the code, but in the fact that I need to understand how Google determines that I am trying to parse it.

  • one
    and from what parse, Vyzh not from the browser you do it)? Just the browser identifies itself when connecting to the server, maybe on this and fray? in general, I’ll tell you in secret, there are real checks in Google, after one scandal with a small company with the letter M, which in its search engine didn’t get the results, climbed into Google) - JEcho

4 answers 4

Quite a regular behavior of Google. Why should your program pretend to be a browser? Is there a good reason for this ?

Google is not obliged to love bots. Why not do the right thing and use the Google search API specially created for about your tasks?

https://developers.google.com/custom-search/v1/overview

And Google defines you, quite possibly, on the load, if the number of your requests is suspiciously large for the average user. Try, for example, "google" manually from under Tor, you will see the same thing.

  • The issue of the API is completely different from the issue of Google itself. - crystallon
  • one
    is logical. Because Google hones the output for a specific user. At the request of "Delphi" someone will derive about the famous ide, someone about oils , someone from ancient Greece and oracles , and some to whom frivolous pictures . - KoVadim
  • one
    Did you want to say "surprisingly"? 0. I focus on not my thoughts once again: "Google hones the answer for a specific user." The API is more objective for today. 1. First, it works. No problems with captcha. 2. Secondly, even if you succeed in defeating the captcha, Google is not an idiot and still will not allow you to perform many search queries from a single IP. This is known, for example, by Tor users, since there are many users masquerading behind a small number of IP addresses, it is almost impossible to use Google from under Tor. - Softa
  • sorry Vladimir Ivanovich Dal died long ago. And then one more word could fit into his dictionary - coordinately - DreamChild

The thing is that when loading a captcha image - google sends cookies, without which nothing will work out for you, if you load the picture with curl and resave the cookie, then everything works with a bang

    make a pause. Yandex has 3-4 seconds between requests, Google has 2-3 seconds. and they don't even ask for captcha

      The thing is that Google with captcha (with a picture, not a page) sends a separate cookie, and this cookie should be sent back with the code.