Don't get the url parsed with Cyrillic characters

from bs4 import BeautifulSoup from urllib import request html_doc = request.urlopen('http://кто.рф/').read() soup = BeautifulSoup(html_doc) title = soup.title.string print (title) 

I constantly see the same error.

 UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256) 

Encoding decode does not help.
Python3.4. BeautifulSoup 4.3.2

  • you need to convert the address to punycode . There are no Cyrillic domains on the Internet. - etki 2:23 pm
  • @etki thanks for the tip - pnoob
  • @pnoob, please make this a response. - awesoon pm
  • @soon, corrected, thanks for the comment. - pnoob
  • one
    @soon. He writes that I can accept my answer in two days. On Monday I check it. - pnoob

2 answers 2

In order not to implement IDN processing yourself, you can use the requests library to use:

 #!/usr/bin/env python3 import requests # $ pip install requests from bs4 import BeautifulSoup # $ pip install beautifulsoup4 r = requests.get('http://кто.рф') soup = BeautifulSoup(r.content, from_encoding=r.encoding) 
  • Excellent solution and less text, it only works for a second longer than my stack of lines, if suddenly someone will have time in principle. - pnoob
  • @pnoob: second — this is too big a span. The code is most likely limited to network latency, so no library on the client will make a single request faster. In contrast, requests use the connection pool automatically, so several requests must pass faster. - jfs

Decided in the following way

 from urllib.parse import urlsplit, urlunsplit from bs4 import BeautifulSoup from urllib import request url = 'http://кто.рф/' parts = list(urlsplit(url)) parts[1] = parts[1].encode('idna').decode('ascii') url = urlunsplit(parts) html_doc = request.urlopen(url).read() soup = BeautifulSoup(html_doc) title = soup.title.string print (title)