How to determine low and high byte in UTF-16

Question

The vast majority of processors and communication channels use byte-addressing. As for the UTF-16 code, determine the low byte or high byte. UTF-8 is very simple.

I did not mean the order of bytes in a symbol but its boundary in a byte array.
Obviously, it will be a multiple of 2 bytes from the beginning of the array + accounting order (which must be known in advance).
Do you have some unknown piece of memory, and in it you need to find a UTF-16 string?

Vyacheslav Vyacheslav 377 one 13 · Answer 1 · 2016-10-05T04:05:27

No It is not possible to unambiguously determine the character token in UTF-16. And to establish the order of bytes need a BOM. For example, the following three characters D700, D700, D700, when shifted by one byte, will give XXD7.00D7.00D7, which are also characters from the first diapason. You can certainly resort to analyzing the data, but the result will be odnoznany.

Nick Volynkin ♦ Nick Volynkin 24.6k 14 94 175 · Answer 2 · 2017-01-09T19:35:59

You can use the same heuristic that is used to determine the order of bytes in the absence of a BOM: count spaces ( 0x0020 ). A space is the most common symbol in texts in most languages (I suspect that this is not the case in languages with hieroglyphic writing).

So, there are two parameters:

We read from even or odd byte
Direct or inverse byte order

Need to go through four options. It is very likely that where there are many entries 0x0020 .

How to determine low and high byte in UTF-16

2 answers 2

More articles: