How to validate several sites at the same time. Immediately on several services? The site will be validated if I send the most ordinary get request to https://validator.w3.org/check?uri=[ Link]

r = requests.get('https://validator.w3.org/check?uri=http://shost-craft.su') 

Or you need to send special headers or a special request that will let the site know that it is necessary to validate the site. Type:

 `r = requests.get('https://validator.w3.org/check?uri=http://shost-craft.su', headers={'Тут что-то для валидации'}` print(r.headers) 

Came headers :

 {'Accept-Encoding': 'gzip', 'Access-Control-Allow-Headers': 'content-type', 'Access-Control-Allow-Origin': '*', 'Cache-Control': 'no-cache', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Sun, 07 May 2017 20:17:42 GMT', 'Expires': 'Thu, 01 Jan 1970 00:00:00 GMT', 'Public-Key-Pins': 'pin-sha256="cN0QSpPIkuwpT6iP2YjEo1bEwGpH/yiUn6yhdy+HNto="; ' 'pin-sha256="WGJkyYjx1QMdMe0UqlyOKXtydPDVrk7sl2fV+nNm1r4="; ' 'pin-sha256="LrKdTxZLRTvyHM4/atX2nquX9BeHRZMCxg3cf4rhc2I="; ' 'max-age=864000', 'Server': 'Jetty(9.2.9.v20150224)', 'Strict-Transport-Security': 'max-age=15552015; preload', 'Transfer-Encoding': 'chunked', 'Vary': 'Accept-Encoding, User-Agent', 'X-Frame-Options': 'deny', 'X-XSS-Protection': '1; mode=block'} 

But I do not see any point in these headers. Or is there still something that means validated or not?

There is also r.text itself: <!DOCTYPE html> <html lang="en"><head><link href="icon.png" rel="icon"><link href="style.css" rel="stylesheet"><title>Showing results for http://shost-craft.su/ - Nu Html Checker</title><meta name="viewport" content="width=device-width, initial-scale=1"></head><body><div id="banner"><h1 id="title"><a href="."><span>Nu Html Checker</span></a></h1></div><p class="disclaimer">This tool is an ongoing experiment in better HTML checking, and its behavior remains subject to change</p><h2 id="top">Showing results for http://shost-craft.su/</h2><form method="get"><fieldset><legend>Checker Input</legend><p class="checkboxes">Show <span class="checkboxgroup"><label title="Display the markup source of the input document." for="showsource"><input type="checkbox" name="showsource" id="showsource" value="yes">source</label><label title="Display an outline of the input document." for="showoutline"><input type="checkbox" name="showoutline" id="showoutline" value="yes">outline</label><label title="Display a report about the textual alternatives for images." for="showimagereport"><input type="checkbox" name="showimagereport" id="showimagereport" value="yes">image report</label></span><input id="show_options" type="button" value="Options…"><span class="extraoptions hidden"><span class="checkboxgroup"><label title="Check the content of all responses, including (non-200) error responses"><input type="checkbox" name="checkerrorpages" id="checkerrorpages" value="yes">check error pages</label></span><label id="user-agent-label" title="Specify the user-agent string to send in the document request">User-Agent <input name="useragent" list="useragents" value="Validator.nu/LV http://validator.w3.org/services"></label><datalist id="useragents"></datalist></span></p><div id="inputregion"><label id="inputlabel" for="doc">Document URL:</label><input type="url" name="doc" id="doc" pattern="(?:(?:https?://.+)|(?:data:.+))?" title="Absolute IRI (http, https or data only) of the document to be checked." tabindex="0" autofocus="autofocus" value="http://shost-craft.su/"></div><p><input value="Check" type="submit" id="submit"></p></fieldset></form><script src="script.js"></script><div id="results"><ol><li class="error"><p><strong>Error</strong>: <span><code>style</code> element between <code>head</code> and <code>body</code>.</span></p><p class="location">From line <span class="first-line">14</span>, column <span class="first-col">5</span>; to line <span class="last-line">14</span>, column <span class="last-col">11</span></p><p class="extract"><code>head&gt;<span class="lf" title="Line break">↩</span> <b>&lt;style&gt;</b><span class="lf" title="Line break">↩</span> b</code></p></li><li class="error fatal"><p><strong>Fatal Error</strong>: <span>Cannot recover after last error. Any further errors will be ignored.</span></p><p class="location">From line <span class="first-line">14</span>, column <span class="first-col">5</span>; to line <span class="last-line">14</span>, column <span class="last-col">11</span></p><p class="extract"><code>head&gt;<span class="lf" title="Line break">↩</span> <b>&lt;style&gt;</b><span class="lf" title="Line break">↩</span> b</code></p></li></ol><p class="failure">There were errors.</p><div class="details"><p class="msgschema">Used the schema for HTML with SVG 1.1, MathML 3.0, RDFa 1.1, and ITS 2.0 support.</p><p class="msgmediatype">Used the HTML parser. Externally specified character encoding was UTF-8.</div><p class="stats">Total execution time 340 milliseconds.</p></div><hr><div id="about"><p><a href="about.html">About this checker</a> • <a href="about.html#issues">Report an issue</a> • <span class="version">Version: 17.5.7</span></p></div></body></html> <!DOCTYPE html> <html lang="en"><head><link href="icon.png" rel="icon"><link href="style.css" rel="stylesheet"><title>Showing results for http://shost-craft.su/ - Nu Html Checker</title><meta name="viewport" content="width=device-width, initial-scale=1"></head><body><div id="banner"><h1 id="title"><a href="."><span>Nu Html Checker</span></a></h1></div><p class="disclaimer">This tool is an ongoing experiment in better HTML checking, and its behavior remains subject to change</p><h2 id="top">Showing results for http://shost-craft.su/</h2><form method="get"><fieldset><legend>Checker Input</legend><p class="checkboxes">Show <span class="checkboxgroup"><label title="Display the markup source of the input document." for="showsource"><input type="checkbox" name="showsource" id="showsource" value="yes">source</label><label title="Display an outline of the input document." for="showoutline"><input type="checkbox" name="showoutline" id="showoutline" value="yes">outline</label><label title="Display a report about the textual alternatives for images." for="showimagereport"><input type="checkbox" name="showimagereport" id="showimagereport" value="yes">image report</label></span><input id="show_options" type="button" value="Options…"><span class="extraoptions hidden"><span class="checkboxgroup"><label title="Check the content of all responses, including (non-200) error responses"><input type="checkbox" name="checkerrorpages" id="checkerrorpages" value="yes">check error pages</label></span><label id="user-agent-label" title="Specify the user-agent string to send in the document request">User-Agent <input name="useragent" list="useragents" value="Validator.nu/LV http://validator.w3.org/services"></label><datalist id="useragents"></datalist></span></p><div id="inputregion"><label id="inputlabel" for="doc">Document URL:</label><input type="url" name="doc" id="doc" pattern="(?:(?:https?://.+)|(?:data:.+))?" title="Absolute IRI (http, https or data only) of the document to be checked." tabindex="0" autofocus="autofocus" value="http://shost-craft.su/"></div><p><input value="Check" type="submit" id="submit"></p></fieldset></form><script src="script.js"></script><div id="results"><ol><li class="error"><p><strong>Error</strong>: <span><code>style</code> element between <code>head</code> and <code>body</code>.</span></p><p class="location">From line <span class="first-line">14</span>, column <span class="first-col">5</span>; to line <span class="last-line">14</span>, column <span class="last-col">11</span></p><p class="extract"><code>head&gt;<span class="lf" title="Line break">↩</span> <b>&lt;style&gt;</b><span class="lf" title="Line break">↩</span> b</code></p></li><li class="error fatal"><p><strong>Fatal Error</strong>: <span>Cannot recover after last error. Any further errors will be ignored.</span></p><p class="location">From line <span class="first-line">14</span>, column <span class="first-col">5</span>; to line <span class="last-line">14</span>, column <span class="last-col">11</span></p><p class="extract"><code>head&gt;<span class="lf" title="Line break">↩</span> <b>&lt;style&gt;</b><span class="lf" title="Line break">↩</span> b</code></p></li></ol><p class="failure">There were errors.</p><div class="details"><p class="msgschema">Used the schema for HTML with SVG 1.1, MathML 3.0, RDFa 1.1, and ITS 2.0 support.</p><p class="msgmediatype">Used the HTML parser. Externally specified character encoding was UTF-8.</div><p class="stats">Total execution time 340 milliseconds.</p></div><hr><div id="about"><p><a href="about.html">About this checker</a> • <a href="about.html#issues">Report an issue</a> • <span class="version">Version: 17.5.7</span></p></div></body></html>

  • In the answer that the validator sends you after your request, for some reason you did not find the answers to your questions? - andreymal
  • @andreymal I changed the question, to be honest, I did not find any factors that passed validation - you have no pass
  • headers printed, now actually answer the answer print something :) - andreymal
  • @andreymal <Response [200]> is this or not? - you have no pass
  • one
    Run the query in the address bar of the browser. You will go to the site. Find the answer to the validation question there. Then in the browser, open the source code. You will see that it matches the output of r.text in your code. Now the wording: find in r.text elements that match those in the browser. As a read, I would advise Ryan Mitchell to "Scrap websites with python". There are answers to all your questions. - ivan_susanin

1 answer 1

How about such validation using aiohttp and asyncio? Code:

 import random import asyncio from aiohttp import ClientSession async def fetch(url, session): async with session.get(url) as response: delay = response.headers.get("DELAY") date = response.headers.get("DATE") print("{}:{} with delay {}".format(date, response.url, delay)) return await response.read() async def bound_fetch(sem, url, session): async with sem: await fetch(url, session) async def run(r): site = "http://shost-craft.su" with open("C:\\cruelnetwork\\cruel.need\\wolfs.txt") as werewolves: array = [row.strip()+site for row in werewolves] for url in array: tasks = [] sem = asyncio.Semaphore(1000) print(url) async with ClientSession() as session: for i in range(r): task = asyncio.ensure_future(bound_fetch(sem, url.format(i), session)) tasks.append(task) responses = asyncio.gather(*tasks) await responses number = 10 loop = asyncio.get_event_loop() future = asyncio.ensure_future(run(number)) loop.run_until_complete(future) 

Found https://pawelmhm.imtqy.com/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html . Or is it better to fix something here? :)