I read that in UTF-16, two different byte orders (endianness) appeared because two different byte orders exist in the processor architecture :
A system compatible with x86 processors is called little endian, and with m68k and SPARC processors, it is big endian.
That is, the same number 0x1234ABCD is encoded by a sequence of bytes:
- little endian:
12 34 56 78 - big endian:
78 56 34 12
Accordingly, when decoding a sequence of bytes into a sequence of numbers (or unicode code points), the byte order used in encoding must be taken into account. (This is a somewhat amateurish statement, but I still cannot formulate it better).
For example, if we encode "Hello 😃" in UTF-16:
# big endian: П р и в е т ( ) 😃 04 1F 04 40 04 38 04 32 04 35 04 42 00 20 D8 3D DE 03 # little endian: П р и в е т ( ) 😃 1F 04 40 04 38 04 32 04 35 04 42 04 20 00 03 DE D3 D8 It seems to be all obvious. We associate a code point with a certain number according to the encoding algorithm, and then we write this number in accordance with the byte order accepted in the system.
Now UTF-8:
П р и в е т ( ) 😃 D0 9F D1 80 D0 B8 D0 B2 D0 B5 D1 82 20 F0 9F 98 83 # в двоичной системе счисления: 11010000 10011111 11010001 10000000 11010000 10111000 11010000 10110010 11010000 10110101 11010001 10000010 # по первому биту сразу видно, что этот code point закодирован одним байтом 00100000 # а здесь первый байт начинается с 4 единиц, значит будет 3 trailing byte'а 11110000 10011111 10011000 10000011 The encoding algorithm has changed, but the processor architecture remains the same! We still get a number that takes from 1 to 4 bytes. Why doesn’t it bother us with UTF-8 that bytes will be written like this?
П р и в е т ( ) 😃 9F D0 80 D1 B8 D0 B2 D0 B5 D0 82 D1 20 83 98 9F F0 Addition:
Asking this question, I already knew that UTF-8 uses single-byte code units, and UTF-16 uses two-byte ones. I will try to clarify that I was not clear.
There is a symbol "😃". When coding it in the UTF-8 algorithm, the sequence of bytes F0 9F 98 83 obtained. This is also a number, a four-byte word, and can be used to compare or sort strings encoded in UTF-8 (although there is little use for such sorting). In the above form, it has a big-endian order, which means systems with a big-endian architecture can gain an advantage in working with it. But what about little endian? How will there be a comparison? For example, we will compare “😃” ( F0 9F 98 83 ) and “😐” ( F0 9F 98 90 ). I have two suggestions:
- Big-endian systems work with characters encoded in UTF-8, as with 1, 2, 3, 4-byte words and get an advantage in speed of operations. That is, in them it is enough to compare
F09F9883andF09F9890as four-byte words. Little-endian systems are forced to compare byte-by-byte or turn the word twice. - Any architecture works with characters encoded in UTF-8 strictly as with byte sequences, not using words more than 1 byte. That is, pairs of bytes are compared:
FOandFO,9Fand9F,98and98,83and90. At the same time, the potential advantage of comparing two words is lost, but for any architecture the algorithm works in the same way.
0xEF, 0xBB, 0xBFand it is optional. - Nick Volynkin ♦