The problem is very simple - I can’t get the contents of a page with Cyrillic characters, for example, take at least Russian Wikipedia. Using urllib did so, but constantly stumble upon Exception

from urllib.request import urlopen from urllib.parse import quote def get_content(name): print( urlopen('http://ru.wikipedia.org/wiki/' + quote(name)).readall() .decode('utf-8')) get_content('лес') 

of this type:

 UnicodeEncodeError: 'charmap' codec can't encode character '\xb2' in position 14187: character maps to <undefined> 

I read similar questions in other discussions, but no matter what I do with quote - the result is still the same. Maybe I'm doing something stupid, but so far just get a page from the wiki does not go

2 answers 2

Just need to add

 # coding=utf-8 from urllib import urlopen, quote def get_content(name): return urlopen('http://ru.wikipedia.org/wiki/' + quote(name)).read() print get_content('лес') 
  • No, in the comments earlier noted - the whole thing in the console output, rather than encoding. I use PyCharm - their console (terminal) differs not only from the Windows console itself, but is also strangely arranged. - Lescott

Perhaps this will help:

 # ! /usr/bin/env python # _*_ coding: utf-8 _*_ print( urlopen(u'http://ru.wikipedia.org/wiki/' + quote(name)).readall() .decode('utf-8'))