The task of parsing robots.txt. I use php curl. On some sites I get a response type

cURL Error (28): Operation timed out after 30001 milliseconds with 0 bytes received 

Increasing the timeout does not solve the problem. The parsing code itself:

 curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HEADER, 1); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); curl_setopt($ch, CURLOPT_TIMEOUT, 60); curl_setopt($ch, CURLOPT_REFERER, $url); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0'); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false); $robots = curl_exec($ch); $curl_errno = curl_errno($ch); $curl_error = curl_error($ch); curl_close($ch); 

for an example url: https://www.adidas.ru/robots.txt

    3 answers 3

    Add your line to the code:

    curl_setopt ($ ch, CURLOPT_ENCODING, 'gzip');

    https://www.adidas.ru/robots.txt - is given by default by gzip, as you can see a number of other robots.txt with which you have problems.

    UPD

    The second option is to ask not to give the server content gzip'om passing the header

    curl_setopt ($ ch, CURLOPT_HTTPHEADER, ['accept-encoding: deflate, br']);

    This is if anyone in the builds / versions has a problem with gzip

    • This partially solved the problem: if the browser is transmitted to windows on CURLOPT_USERAGENT, everything works, but if the browser is on the macbook, the error is the same. What could be? - Alex
    • @Alex, and what's the problem to transmit always USERAGENT windows for this task? Why is there an error with the USERAGENT makbukovsky? I probably need to look at headers and dig under it, but I don’t see any sense for this task. - Nsk
    • Indeed, at the moment, transferring windows agent is enough, but if you look to the future, then most likely the same technology, which is used now for MacBooks, can be used for everyone, so I would like to make sure - to understand what the reason is - Alex
    • @Alex write the url which is not parsed under the mac-user. And user agent is completely specific with which the error. - Nsk
    • one
      @Alex specifically, this URL is not parsed under Windows UA, but is solved by curl_setopt ($ ch, CURLOPT_HTTPHEADER, ['accept-language: ru-RU, ru; q = 0.9, en-US; q = 0.8, en; q = 0.7' ]); Try to maximize fake headers, see what the browser transmits. Particular attention REFERER, USERAGENT, ACEPT, cookie, various POLICY. Having a lot of experience working with curl under php, I don’t remember that I couldn’t end up sharing the file with them, the case is in the headers, bypassing the protections, sometimes in curved logic cms that are parsed. It probably deserves a separate question with a number of examples that your curl didn’t sparse. - Nsk
     // Один из вариантов возможных причин: 

    CURL - FAQ - 4.1 Problems connecting to SSL servers.

    Sometimes curl problems connecting to SSL servers when using SSLeay or OpenSSL v0.9+ .

    Many older SSL servers do not work with SSLv3 requests. To fix this problem, add the --sslv2 parameter to the curl command line.

    There have been cases where the remote server did not like the SSLv2 request, and instead had to use SSLv3 . Command line --sslv3 .

      In ubuntu 16.04 there are restrictions on the connection sslv3, apparently the server prohibits another version https://ubuntugeeks.com/questions/33156/simple-way-of-enabling-sslv2-and-sslv3-in-openssl