Parsing the results of issuing Google. Problem with captcha recognition

Question

I am writing a parser for outputting Google search results. It seems to be easy. Similar parser Yasha and mail works. But there was a problem when downloading a captcha image. Stubbornly gives out 403 forbidden.

I make a request to Google:

$url = 'https://www.google.ru/search?complete=1&hl=ru&q='.urlencode($query)... $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HEADER, 1); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file_path); curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0)'); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_PROXY, $host.":".$port); curl_setopt($ch, CURLOPT_PROXYUSERPWD, $login.":".$pass); $content = curl_exec($ch); // Далее если получил редрект: if(curl_getinfo($ch, CURLINFO_HTTP_CODE) == 302) { // ................. // Выдираю картинку капчи и пытаюсь ее скачать // ................. $captcha_image = 'http://ipv4.google.com/'.$image; $fh = fopen($captcha_file, 'w'); curl_setopt($ch, CURLOPT_URL, $captcha_image); curl_setopt($ch, CURLOPT_FILE, $fh); curl_setopt($ch, CURLOPT_HEADER, 1); curl_exec ($ch); fclose($fh); // Но вместо картинки получаю 403 forbidden. } curl_close($ch);

I tried to re-initialize the cURL session when requesting a picture. The result is the same.

Contents of $cookie_file_path :

 # Netscape HTTP Cookie File # http://curl.haxx.se/rfc/cookie_spec.html # This file was generated by libcurl! Edit at your own risk. #HttpOnly_.google.ru TRUE / FALSE 1490269969 NID 87=m8iayuoh4X_H9kTM4zNlYrVmavd0qd7X6Bj1mbyZwrn23e-BQyA-GlNYBsV9iKq5cVj1ZrB9770cWf036kdakSC3tvlDIu_KVpf8yN5ilKkUk8iHAMbi_QZqD7Inlxs3

Look at the sniffer which headers are transmitted by the browser - you need to repeat them, maybe even everything.
Headers indicated: 'Accept: text / html, application / xhtml + xml, application / xml; q = 0.9, / ; q = 0.8';
'Accept-Language: ru-RU, ru; q = 0.8, en-US; q = 0.5, en; q = 0.3';
'User-Agent: Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0)';
Yes, just go quickly, only services for manual recognition to help you.

Parsing the results of issuing Google. Problem with captcha recognition

0

More articles: