Python Decoding bytes in UTF-8

Question

There is a function that reads a byte file. It is necessary to convert the file to UTF-8 encoding. I tried to use this code to read the file.

def readTags(filepath): with open(filepath, 'rb') as f: byte = f.read() print(byte) while byte: byte = f.read() try: print(byte.decode('utf-8')) except Exception as e: continue

But the bytes remain in the standard form, i.e.

 \xd0\xa1\xd0\xbf\xd0\xb0\xd1\x81\xd0\xb8\xd0\xb1\xd0\xbe \xd0\xb7\xd0\xb0 \xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd1\x89\xd1\x8c

How can I convert these bytes to a string?

Replace print(byte) with print(byte.decode('utf-8')) obvious
Then the script will generate a syntax error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
This means either another encoding is used, or your bytes are incorrect and it is impossible in principle to decode them.
However, in the “standard view” shown by you, there is no 0xff, which means, thirdly, you do not agree
I'm trying to hide a message in an image using my text at the end of the file.
Well then, it is obvious that it is impossible to decode a picture, because the picture is not a text in utf-8 encoding.
You need to find a way to separate the picture from the text, select only the text and try to decode it already.

extrn extrn 2,352 2 13 · Answer 1 · 2019-04-08T18:21:03

 byte = f.read()

in this case, reads the entire file, the whole file is better and decoded.

 with open(filepath, 'rb') as f: bs = f.read() print(bs.decode('utf-8'))

only then is it better to use text mode right away

 with open(filepath, 'r', encoding='utf-8') as f: cs = f.read() print(cs)

If, however, you need to read exactly byte, then you need to understand - utf-8 is a multibyte encoding, each character can take one or several bytes. Therefore, it is still not strictly byte-coding.

 import codecs decoder = codecs.getincrementaldecoder('utf-8')() with open('utf.txt', 'rb') as f: while True: bt = f.read(1) if bt == b'': break print('byte:', bt, 'chars:', decoder.decode(bt)) decoder.decode(b'', True)

 byte: b'\xd0' chars: byte: b'\xa1' chars: С byte: b'\xd0' chars: byte: b'\xbf' chars: п byte: b'\xd0' chars: byte: b'\xb0' chars: а byte: b'\xd1' chars: byte: b'\x81' chars: с byte: b'\xd0' chars: byte: b'\xb8' chars: и byte: b'\xd0' chars: byte: b'\xb1' chars: б byte: b'\xd0' chars: byte: b'\xbe' chars: о

S. Nick S. Nick 5,687 2 five 12 · Answer 2 · 2019-04-08T17:42:37

 bytes = b'\xd0\xa1\xd0\xbf\xd0\xb0\xd1\x81\xd0\xb8\xd0\xb1\xd0\xbe \xd0\xb7\xd0\xb0 \xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd1\x89\xd1\x8c' print(bytes.decode()) Спасибо за помощь

And if we assume the text is unknown, how can you insert a variable with text inside b ''?
@Cucumber your byte from your code is already automatically "inserted"

Python Decoding bytes in UTF-8

2 answers 2

More articles: