There is a function that reads a byte file. It is necessary to convert the file to UTF-8 encoding. I tried to use this code to read the file.

def readTags(filepath): with open(filepath, 'rb') as f: byte = f.read() print(byte) while byte: byte = f.read() try: print(byte.decode('utf-8')) except Exception as e: continue 

But the bytes remain in the standard form, i.e.

 \xd0\xa1\xd0\xbf\xd0\xb0\xd1\x81\xd0\xb8\xd0\xb1\xd0\xbe \xd0\xb7\xd0\xb0 \xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd1\x89\xd1\x8c 

How can I convert these bytes to a string?

  • Replace print(byte) with print(byte.decode('utf-8')) obvious - andreymal
  • Then the script will generate a syntax error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte - Cucumber 5:48 pm
  • First, it is not a syntax error. Secondly, byte 0xff is really incorrect in utf-8 encoding. This means either another encoding is used, or your bytes are incorrect and it is impossible in principle to decode them. However, in the “standard view” shown by you, there is no 0xff, which means, thirdly, you do not agree something - andreymal
  • I will describe the whole essence of the program. There is a picture, let's say .png format. I'm trying to hide a message in an image using my text at the end of the file. At the output, the picture has a byte class. Already the whole Internet I rummaged can not find anything - Cucumber
  • Well then, it is obvious that it is impossible to decode a picture, because the picture is not a text in utf-8 encoding. You need to find a way to separate the picture from the text, select only the text and try to decode it already.

2 answers 2

 byte = f.read() 

in this case, reads the entire file, the whole file is better and decoded.

 with open(filepath, 'rb') as f: bs = f.read() print(bs.decode('utf-8')) 

only then is it better to use text mode right away

 with open(filepath, 'r', encoding='utf-8') as f: cs = f.read() print(cs) 

If, however, you need to read exactly byte, then you need to understand - utf-8 is a multibyte encoding, each character can take one or several bytes. Therefore, it is still not strictly byte-coding.

 import codecs decoder = codecs.getincrementaldecoder('utf-8')() with open('utf.txt', 'rb') as f: while True: bt = f.read(1) if bt == b'': break print('byte:', bt, 'chars:', decoder.decode(bt)) decoder.decode(b'', True) 
 byte: b'\xd0' chars: byte: b'\xa1' chars: С byte: b'\xd0' chars: byte: b'\xbf' chars: п byte: b'\xd0' chars: byte: b'\xb0' chars: а byte: b'\xd1' chars: byte: b'\x81' chars: с byte: b'\xd0' chars: byte: b'\xb8' chars: и byte: b'\xd0' chars: byte: b'\xb1' chars: б byte: b'\xd0' chars: byte: b'\xbe' chars: о 
     bytes = b'\xd0\xa1\xd0\xbf\xd0\xb0\xd1\x81\xd0\xb8\xd0\xb1\xd0\xbe \xd0\xb7\xd0\xb0 \xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd1\x89\xd1\x8c' print(bytes.decode()) Спасибо за помощь 
    • Thank. And if we assume the text is unknown, how can you insert a variable with text inside b ''? - Cucumber
    • @Cucumber your byte from your code is already automatically "inserted" - andreymal pm
    • Cucumber you fill the picture - Alexander