Python: 'charmap' codec can't decode byte 0x98

Question

Good afternoon I read the utf8-file and output it to the console. When I try to print the letter "I", an error occurs:

File "I:\ProgramFile\Anaconda\lib\encodings\cp1251.py", line 15, in decode return codecs.charmap_decode(input,errors,decoding_table) UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 1: character maps to < undefined >

Reproduced by this example:

 test_text_1 = "Задача\n" test_text_2 = "Итератор" file = open('temp.txt', 'w', encoding='utf-8') file.write(test_text_1) file.write(test_text_2) file.close() text = open('temp.txt', 'rb') for byte_code in text: print(byte_code.strip()) test_text = byte_code.decode('cp1251') print(test_text.strip())

The first word is displayed normally, and the second is an error. I just can not find a way to overcome the problem.

UPD: Apparently, I described the problem too widely, correct:

How to convert "And" from utf-8 to cp1251? For "A" everything works, but for "I" it does not.

Code:

 byte1 = 'А'.encode('utf-8') byte2 = 'И'.encode('utf-8') print(byte1, byte2) test1 = byte1.decode('cp1251') print(test1) test2 = byte2.decode('cp1251') print(test2)

So far, judging by your code, you're trying to turn normal text into "krakozyabry."
Convert text from Python's internal representation to cp1251 is very simple: 'текст'.encode('cp1251') .
As a result, the bytes are obtained, which can already be applied where you need.

Community spirit ♦ one · Answer 1 · 2016-04-25T20:48:45

To print a file containing text in utf-8 encoding, to the console (similar to type filename in cmd.exe ) in Python:

 #!/usr/bin/env python3 import shutil import sys with open(filename, encoding='utf-8') as file: shutil.copyfileobj(file, sys.stdout)

If you want to print Unicode characters that are not representable in chcp encoding (OEM code page), then see. How can I output a Unicode string to a Windows console from Python?

Answer 2 · 2016-04-25T17:49:30

If your file is written in utf-8 encoding, then you need to decode it from utf-8 encoding:

 ... for byte in text: print(byte.strip()) text = byte.decode('utf-8') print(text.strip())

Result:

 b'\xd0\x97\xd0\xb0\xd0\xb4\xd0\xb0\xd1\x87\xd0\xb0' Задача b'\xd0\x98\xd1\x82\xd0\xb5\xd1\x80\xd0\xb0\xd1\x82\xd0\xbe\xd1\x80' Итератор

When you write text to a file in some encoding, you actually turn the internal representation of the text into bytes in the specified encoding. In order to properly decode these bytes back to the internal representation, when decoding, you need to specify the same encoding as when writing.

I do not know about you, but when decoding in utf-8, it outputs: b '\ xd0 \ x97 \ xd0 \ xb0 \ xd0 \ xb4 \ xd0 \ xb0 \ xd1 \ x87 \ xd0 \ xb0' b '\ xd0 \ x98 \ xd1 \ x82 \ xd0 \ xb5 \ xd1 \ x80 \ xd0 \ xb0 \ xd1 \ x82 \ xd0 \ xbe \ xd1 \ x80'
@Kavaru, "when I decode in utf-8, it outputs" - decoding goes from bytes encoded in utf-8 (or in another encoding) to internal Python representation, encoding - on the contrary, from internal representation to bytes in the specified encoding.

Python: 'charmap' codec can't decode byte 0x98

2 answers 2

More articles: