You need to pull out the html code from the marketwatch.com website. I do it this way:

#первый две строки не важны, в них я вытаскиваю рандомный юзерагент и использую его в запросе useragents = open('/home/ubuntu/bot/useragents.txt').read().split('\n') useragent = {'User-Agent': choice(useragents)} response = requests.get(url, verify=True, headers=useragent) return response.text 

Previously, it all worked, but now apparently put some kind of protection and all that returns it to me:

 <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> <meta http-equiv="cache-control" content="max-age=0" /> <meta http-equiv="cache-control" content="no-cache" /> <meta http-equiv="expires" content="0" /> <meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" /> <meta http-equiv="pragma" content="no-cache" /> <meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?requestId=3130ab7e-7dc1-4353-bfe0-15e09c163fc9&httpReferrer=%2Finvesting%2Ffuture%2Fdjia%2520futures" /> <script type="text/javascript"> (function(window){ try { if (typeof sessionStorage !== 'undefined'){ sessionStorage.setItem('distil_referrer', document.referrer); } } catch (e){} })(window); </script> <script type="text/javascript" src="/lxwtsparqmgdowhx.js" defer></script> <style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#fwqssyztxufxfzwwduebdqwxedwrzazqaaavux{display:none!important}</style></head> <body> <div id="distilIdentificationBlock">&nbsp;</div> </body> </html> 

How to bypass the protection and reach the content?

  • I am plagued by vague doubts, do you have a captcha there by chance? - Vladimir Klykov
  • No, there are no captchas - Maxon
  • one
    Are you sure? This is what you see for your search: marketwatch.com/… - Vladimir Klykov
  • Hmm .. So yes, the site somehow redeems that it is a bot and redirects it to a captcha. Are there any options? - Maxon
  • one
    IP hit the "black list" - Vladimir Klykov

0