Problem with utf-16LE encoding with BOM in Python 2.7

Question

There is a dictionary file. Encoding unknown. Here is a fragment, the file is open in vim:

,

As you can see, the transcription is read correctly. Task: parse the file and reload it into another file that is also readable, i.e. with preservation of transcription. Next, I load this file in Python and read line by line:

# -*- coding: utf-8 -*- ... hf = open(sys.argv[1]) for line in hf: ... ...

Something is loading, I look in the debager:

After downloading "this" to another file, the transcription is lost, and some of the characters remain readable, but not as it should, and some are represented by non-printable characters like <8a>, <97>, ^ B, and so on. It is noteworthy that vim does not recognize the transcription in the new file correctly, shows either not what is needed, or an inverted question mark.

Shamanism with coding-transcoding did not give anything. Using stuff like ftfy, chardet is also without result. In particular, ftfy corrects non-printable characters for printed ones, but is incorrect.

This is how the source file looks in HEX:

The transcription symbol is highlighted. What can you do about it?

Added :::

In general, I determined that the original encoding is utf-16LE with a BOM. I didn’t define it, but Geany’s editor)) But I don’t know what to do with it in Python now. This:

 hf = os.open(file,encoding="utf-16LE")

does not help: at each iteration in the line where there is an inconvenient character of the transcription, the exception of the encoding takes off.

Solved the problem by re-saving the file with Geany. A crutch, but I don’t know how to solve the problem in a pitonech way.

TypeError: encoding 'is an invalid keyword for this function
And if so import io + hf = io.open(sys.argv[1], encoding="utf-8") ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

Problem with utf-16LE encoding with BOM in Python 2.7

0

More articles: