BeautifulSoup can not parse url with Cyrillic characters

Question

Don't get the url parsed with Cyrillic characters

from bs4 import BeautifulSoup from urllib import request html_doc = request.urlopen('http://кто.рф/').read() soup = BeautifulSoup(html_doc) title = soup.title.string print (title)

I constantly see the same error.

 UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)

Encoding decode does not help.
Python3.4. BeautifulSoup 4.3.2

you need to convert the address to punycode . There are no Cyrillic domains on the Internet. - etki 2:23 pm
@soon. He writes that I can accept my answer in two days. On Monday I check it. - pnoob

jfs jfs 44.5k 8 gold signs 53 silver marks 199 bronze marks · Accepted Answer · 2015-07-03T17:59:04

In order not to implement IDN processing yourself, you can use the requests library to use:

 #!/usr/bin/env python3 import requests # $ pip install requests from bs4 import BeautifulSoup # $ pip install beautifulsoup4 r = requests.get('http://кто.рф') soup = BeautifulSoup(r.content, from_encoding=r.encoding)

Excellent solution and less text, it only works for a second longer than my stack of lines, if suddenly someone will have time in principle.
The code is most likely limited to network latency, so no library on the client will make a single request faster.
In contrast, requests use the connection pool automatically, so several requests must pass faster.

Answer 2 · 2015-07-03T15:10:02

Decided in the following way

 from urllib.parse import urlsplit, urlunsplit from bs4 import BeautifulSoup from urllib import request url = 'http://кто.рф/' parts = list(urlsplit(url)) parts[1] = parts[1].encode('idna').decode('ascii') url = urlunsplit(parts) html_doc = request.urlopen(url).read() soup = BeautifulSoup(html_doc) title = soup.title.string print (title)

BeautifulSoup can not parse url with Cyrillic characters

2 answers 2

More articles: