Hello, I am writing a parser using asyncio + aiohttp and I need to download images

At the moment I am downloading the image during the parsing:

def download_image(url): start = time.time() text = re.search("[0-9]+.jpg$", url) if (text != None): h = httplib2.Http('cache') response, content = h.request(url) out = open('images/' + text.group(0), 'wb') out.write(content) out.close() print("Время скачивания:", time.time() -start) 

I want to simply transfer the url to the image in the event stream, so as not to delay it when I try to download it during parsing. (when parsing, if we find the necessary link then we return it to create a new task in the stream of events)

 async def crawl(future, client, pool): futures = [] # Получаем из футуры ссылки urls = await future # Выгребаем для каждой ссылки разметку страницы for request_future in asyncio.as_completed([request(client, url) for url in urls]): # Передаём парсинг разметки в пул потоков parse_future = loop.run_in_executor(pool, parse, (await request_future)) # parse_future = loop.run_in_executor(pool, parse, (await request_future)) # Рекурсивно вызываем себя для парсинга новой порции ссылок futures.append(asyncio.ensure_future(crawl(parse_future, client, pool))) # Это нужно только для того, чтобы знать # когда завершать цикл событий if futures: await asyncio.wait(futures) async def start_main(root_urls): loop = asyncio.get_event_loop() # loop.set_debug(True) # Создаём пул потоков по количеству процессоров # with ThreadPoolExecutor(max_workers=os.cpu_count()) as pool: with ThreadPoolExecutor(count_thread) as pool: conn = aiohttp.TCPConnector(ssl=False) # Создаём клиентскую сессию async with aiohttp.ClientSession(connector=conn) as client: # Создаём корневую футуру initial_future = loop.create_future() # Помещаем в неё ссылки, с которых начнём парсить initial_future.set_result(root_urls) # Передаём эту футуру в сопрограмму обхода ссылок # вместе с пулом потоков и клиентской сессией await crawl(initial_future, client, pool) 

To send requests I use the method:

 async def request(client, url): global limit, headers, len_count_product, count_request_product, proxy_auth async with limit: for i in range(30): try: async with client.get(url, headers=headers, proxy=get_proxy(), proxy_auth=proxy_auth) as r: log.info('Запрос: %s', url) log.info("Статус: %s", r.status) if(r.status == 404): break if(r.status == 200): count_request_product = count_request_product + 1 log.info("Количество запросов: %s", str(count_request_product)) return await r.content.read() else: log.info("Ошибка статус: %s", r.status) log.info("Задержка: %s", i) await asyncio.sleep(1) except Exception as e: print(e) log.info("Задержка: %s", i) await asyncio.sleep(1) 

How can it be modified to download images? And is it worth it at all? (will there be a decrease in the speed of parsing?)

    1 answer 1

    There will be an increase in the speed of parsing and a decrease in its time.

     from os import path from urllib.parse import urlparse def save_file(name, content): with open(path.join('images', name), 'wb') as fh: fh.write(content) async def download_image(url, pool, loop): file_name = path.basename(urlparse(url).path) async with client.get(url) as r: content = await r.read() loop.run_in_executor(pool, save_file, file_name, content) 

    PS A very bad idea to use global variables in a competitive code!

    • My url is defined in the parse method, which already works through the exexutor (the crawl method), should I call it directly in parse via await downloal_image? And why it is impossible to use global variables? - danilshik
    • It is possible directly in parse , only it is worth checking all the nuances of thread safety. - Sergey Gornostaev
    • Why are global variables evil? - Sergey Gornostaev
    • I thought you unsubscribed that system performance is being lost, and so I myself know that it is better not to use them - danilshik
    • Something I did not think of as your way to inject into mine. In my parse method, the input parameters are parse_text, i.e. await r.content.read (). Suppose I somehow can still get a loop, but here’s the pool, as well as in the start_main method with ThreadPoolExecutor (count_thread) as pool: - danilshik