UnicodeEncodeError while parsing page

Question

When I try to parse a web page, I get the following error:

Traceback (most recent call last): File "C:\Users\Butooz\Desktop\untitled\test.py", line 10, in <module> print(soup.findAll('a')) File "C:\Users\Butooz\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp866.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\xab' in position 3987: character maps to <undefined>

The script looks like this:

 url = 'https://aheku.net/news/' html = urllib.request.urlopen(url) soup = BeautifulSoup(html, 'html5lib') print(soup.findAll('a'))

How can this be fixed?

Try instead of 'request' to use 'needle', having previously connected it var needle = request ('needle');
The error does not seem to be parsing, but when outputting to the console.
The author, for this you need to also apply the stack of the track - so that you can understand where the error occurred :)
Look, the problem is in print, he tried to output to the console, and the Windows console with debility encoding - cp866, and you probably have utf-8, if I'm not lying, then you need to decode it into a byte string, and then encode it into cp866, then problems should not occur.
I did this with similar cases: on output, if I caught UnicodeEncodeError, I would output as a byte string
If you consider that I am a beginner pythonist, then I can hardly imagine how to implement it in code ...

Accepted Answer · 2016-05-24T19:37:31

If you want to see what has come, then I propose to output a byte string - encoding will not be used to it, therefore problems will not occur, but the characters will be in the form of hex numbers:

 for a in soup.findAll('a'): try: print(a) except UnicodeEncodeError: print(a.encode('utf-8'))

In the example above, we are trying to output to the console and if it does not work, we output the byte string

If you want to work with data (processing, parsing, etc.) there will be no problems. They occur only when outputting to the console. Linux consoles support utf-8 and there will be no such problems. If you need to collect data, you can, instead of outputting to the console, output to a file — there will be no problems with the encoding — you need to specify the utf-8 encoding when creating the file, otherwise the system encoding may be selected, which may be different.

UnicodeEncodeError while parsing page

1 answer 1

More articles: