The vast majority of processors and communication channels use byte-addressing. As for the UTF-16 code, determine the low byte or high byte. UTF-8 is very simple.

  • one
    The easiest way - by BOM - PashaPash ♦
  • I did not mean the order of bytes in a symbol but its boundary in a byte array. - Vyacheslav
  • Obviously, it will be a multiple of 2 bytes from the beginning of the array + accounting order (which must be known in advance). What is the general problem? Do you have some unknown piece of memory, and in it you need to find a UTF-16 string? - PashaPash ♦
  • And if the beginning is unknown? There is no problem with ASCII and UTF-8. - Vyacheslav
  • If the beginning is unknown - it means no luck. - PashaPash ♦

2 answers 2

No It is not possible to unambiguously determine the character token in UTF-16. And to establish the order of bytes need a BOM. For example, the following three characters D700, D700, D700, when shifted by one byte, will give XXD7.00D7.00D7, which are also characters from the first diapason. You can certainly resort to analyzing the data, but the result will be odnoznany.

    You can use the same heuristic that is used to determine the order of bytes in the absence of a BOM: count spaces ( 0x0020 ). A space is the most common symbol in texts in most languages ​​(I suspect that this is not the case in languages ​​with hieroglyphic writing).

    So, there are two parameters:

    • We read from even or odd byte
    • Direct or inverse byte order

    Need to go through four options. It is very likely that where there are many entries 0x0020 .