I read that in UTF-16, two different byte orders (endianness) appeared because two different byte orders exist in the processor architecture :

A system compatible with x86 processors is called little endian, and with m68k and SPARC processors, it is big endian.

That is, the same number 0x1234ABCD is encoded by a sequence of bytes:

  • little endian: 12 34 56 78
  • big endian: 78 56 34 12

Accordingly, when decoding a sequence of bytes into a sequence of numbers (or unicode code points), the byte order used in encoding must be taken into account. (This is a somewhat amateurish statement, but I still cannot formulate it better).

For example, if we encode "Hello 😃" in UTF-16:

 # big endian: П р и в е т ( ) 😃 04 1F 04 40 04 38 04 32 04 35 04 42 00 20 D8 3D DE 03 # little endian: П р и в е т ( ) 😃 1F 04 40 04 38 04 32 04 35 04 42 04 20 00 03 DE D3 D8 

It seems to be all obvious. We associate a code point with a certain number according to the encoding algorithm, and then we write this number in accordance with the byte order accepted in the system.

Now UTF-8:

 П р и в е т ( ) 😃 D0 9F D1 80 D0 B8 D0 B2 D0 B5 D1 82 20 F0 9F 98 83 # в двоичной системе счисления: 11010000 10011111 11010001 10000000 11010000 10111000 11010000 10110010 11010000 10110101 11010001 10000010 # по первому биту сразу видно, что этот code point закодирован одним байтом 00100000 # а здесь первый байт начинается с 4 единиц, значит будет 3 trailing byte'а 11110000 10011111 10011000 10000011 

The encoding algorithm has changed, but the processor architecture remains the same! We still get a number that takes from 1 to 4 bytes. Why doesn’t it bother us with UTF-8 that bytes will be written like this?

 П р и в е т ( ) 😃 9F D0 80 D1 B8 D0 B2 D0 B5 D0 82 D1 20 83 98 9F F0 

Addition:

Asking this question, I already knew that UTF-8 uses single-byte code units, and UTF-16 uses two-byte ones. I will try to clarify that I was not clear.

There is a symbol "😃". When coding it in the UTF-8 algorithm, the sequence of bytes F0 9F 98 83 obtained. This is also a number, a four-byte word, and can be used to compare or sort strings encoded in UTF-8 (although there is little use for such sorting). In the above form, it has a big-endian order, which means systems with a big-endian architecture can gain an advantage in working with it. But what about little endian? How will there be a comparison? For example, we will compare “😃” ( F0 9F 98 83 ) and “😐” ( F0 9F 98 90 ). I have two suggestions:

  1. Big-endian systems work with characters encoded in UTF-8, as with 1, 2, 3, 4-byte words and get an advantage in speed of operations. That is, in them it is enough to compare F09F9883 and F09F9890 as four-byte words. Little-endian systems are forced to compare byte-by-byte or turn the word twice.
  2. Any architecture works with characters encoded in UTF-8 strictly as with byte sequences, not using words more than 1 byte. That is, pairs of bytes are compared: FO and FO , 9F and 9F , 98 and 98 , 83 and 90 . At the same time, the potential advantage of comparing two words is lost, but for any architecture the algorithm works in the same way.
  • four
    Because in UTF-8, as I understand it, we know exactly which byte is the first, and so on - there it is a stream of bytes, not two-byte values. - Harry
  • 2
    UTF-8 has a BOM (byte order mark), which is responsible for endianness; if it is absent at the beginning of the stream, then by default I do not remember which one, but some of the given ones. - etki
  • @Harry yes, there really can be distinguished leading byte and trailing byte and even the number of trailing byte can be determined by the number of units in the leading byte. But judging by the documentation, there seems to be nothing like this happening at all and always strictly big endian order. - Nick Volynkin
  • @Etki BOM for UTF-8 is 0xEF, 0xBB, 0xBF and it is optional. - Nick Volynkin
  • one
    unicode.org/faq/utf_bom.html#bom5 @Harry is right - etki

5 answers 5

The fact is that UTF-8 and UTF-16 are usually stored in memory unpacked, in the same form as it comes in the stream (for example, in a file). [Well, if it is unpacked, this consideration plays a role at the time of unpacking.]

Storage in itself, of course, does not create any problem. The problem is created by processing, for example, comparing characters.

In UTF-8, you read the input stream by byte and interpret them sequentially. Accordingly, the resulting code point value is obtained unambiguously and does not depend on the byte order of the machine: the result of the casting to the code point is uniquely determined, and it is used when comparing.

But in UTF-16 you read the input stream of two bytes , and for comparison, in the usual case, you do not need to calculate the code point at all. If you have a two-byte word in a native encoding that does not correspond to a surrogate pair (and this is the main, most frequent case), then for comparison you can simply use its value, it is equal to its code point. But if the encoding is not native, you will need to rearrange the bytes.

If UTF-16 were given a specific order of bytes that make up a double byte (and thus endianness is set), then the platforms on which this order is not native would be a loss: they would have to perform additional actions (permutation bytes) when reading and writing a stream. With two options for encoding, applications can use the format that is native on their platform, thereby obtaining a speed gain.

Keeping the bytes in memory in a non-native order is a bad idea: it turns out they are much more expensive to sort and compare. With the native order, in the usual case, only a check for a surrogate pair is needed, and with a non-native one, byte swapping. For example, for comparing 1C 55 and 1B 77 in a big endian-sense, a little endian-system cannot do without permutation of bytes. Because if you compare without rearrangement, then 0x551C and 0x771B will be compared, and the result will be incorrect. The same for sorting.


Update the answer to update the question.

As far as I understand, when processing UTF-8, we do not know in advance how many bytes a particular character will occupy. Therefore, we are forced to work with a stream of bytes, and not a stream of native words. If we knew that our character is always encoded with four bytes, we could either just compare in the native way, or, for improper byte orientation, copy both four-byte words into temporary variables, expand them and compare with the native comparison of four-byte words. But this is also hampered by the fact that our four bytes are in a random position in the stream, and therefore, most likely they are not aligned with the 4-byte boundary. On many architectures ( except, however, x86 ), such access is not allowed, and you will have to “dig out” the bytes in parts. Thus, it is easier and more efficient to simply compare the bytes one by one.

In UTF-16, by the way, there are fewer possible cases: either a symbol from one codepoint can be compared there, which can be compared natively or with one turn, if not guessed byte order, or from two (where it’s probably better to compare again on a double byte word).

  • If everything is so simple and it is enough to read the input stream byte in order not to have problems with byte order, why was it necessary to invent UTF-16BE & UTF-16LE? Just read byte. Why with UTF-16 this restriction was insurmountable? - Nick Volynkin
  • one
    @NickVolynkin: The idea behind UTF-16 is to work with double-byte "words." When reading, they can be rearranged to the native order, but this will be ineffective. - VladD
  • @NickVolynkin: And keeping bytes in a non-native order is inconvenient for purposes like comparing and sorting. - VladD
  • It seems we understand differently big endian. Isn't the order in it 12 34 ? - Nick Volynkin
  • one
    @NickVolynkin: Well, if some endianness is adopted in the processor, then with it the utf-16 is easier and more convenient. And rather. Therefore, the developers chose what is more convenient for them (that is, native for their architecture endianness). But if the format becomes cross-platform, then you’ll have to read non-native endianness on other platforms. - VladD

Why is the byte order problem in UTF-16, but not in UTF-8?

Because the code unit is 8 bits ( one byte) in UTF-8 and 16 bits ( two bytes) in UTF-16. Depending on the byte order inside the code unit, there are utf-16le and utf-16be encodings that can be used on the same computer regardless of the endianness CPU):

 >>> 'я'.encode('utf-16le').hex() '4f04' >>> 'я'.encode('utf-16be').hex() '044f' 

The character я (U + 44f) is encoded in UTF-16 into the same 16-bit number: 1103 == 0x44f , which for utf-16 is the same as the number ( Unicode code point ) of the character in Unicode (for BMP characters). The 16-bit number itself can be represented in memory as two bytes: 4f 04 (from low to high order bytes) or 04 4f (from high to low order bytes).

 >>> 'я'.encode('utf-8').hex() 'd18f' 

я (U + 44f) is encoded in UTF-8 using two 8-bit numbers 209 == 0xd1 and 143 == 0x8f . In general, UTF-8 can use from 1 to 4 octets (8-bit numbers) for each character (Unicode code point).

 >>> '😂'.encode('utf-16le').hex() '3dd802de' >>> '😂'.encode('utf-16be').hex() 'd83dde02' >>> '😂'.encode('utf-8').hex() 'f09f9882' 

The symbol 😂 (U + 1f602) is encoded in utf-16 using two 16-bit words (utf-16 code units): 0xd83d and 0xde02 ( utf-16 surrogate pair ). The representation of a word as a byte depends on the selected order of bytes (le, be), but the order of the words themselves does not change.

(U + 1f602) is encoded in utf-8 using four octets (utf-8 code units): 0xf0 , 0x9f , 0x98 , 0x82 . The representation of an octet as an 8-bit byte is obviously independent of order (one octet is one byte).

The sequence of code units (uttets for utf-8 and 16-bit words for utf-16) used to encode the selected character is uniquely determined by the selected encoding — in particular, the order of code units cannot be changed in both utf-16 and utf-8 encodings .


Both items from the addendum to your question are incorrect. It should not be confused as a result in the form of bytes is presented when exchanging with the outside world or inside different parts of the program (when writing to disk, sending over the network, calling an API) and which instructions the CPU uses to work with data, executing a specific algorithm. The fact that the octets cannot be rearranged in the utf-8 result does not mean that the actual algorithms cannot work with larger units. For example, memcpy() obviously preserves the order of bytes, while the actual implementation can work with whole words (for example, with 64-bit words).

  • A little clarified the question, please see. - Nick Volynkin
  • one
    @NickVolynkin is clear from my answer at least: 1- the result of coding does not depend on the endianness CPU. Code samples continue to work whatever CPU you have 2- "code units cannot be changed". For utf-8, the code unit is an octet. You can not change the order of the octets, so you can not change the order of bytes (in C, bytes can not be less than 8 bits). - jfs
  • Thank! I am sorry for not responding to the last comment - I’m completely in the work, I’m missing everything. Be sure to read the supplement on Monday. - Nick Volynkin

By itself, the problem with endianness arises due to the different generally accepted visualization of values ​​in memory in processor registers:

Here is how the bytes in the processor register are usually numbered:

  [ 12 34 AB CD ] разряд 3 2 1 0 

Older digits are written on the left - as is traditionally developed in mathematics.

At the same time, in the same mathematics there is a tradition to draw axes and segments and other sets from left to right. And everyone imagines memory as a long, endless array of bytes.

  [ ?? ?? ?? ?? ...] адрес 0 1 2 3 

And processor developers have two solutions:

  • save the 0th digit to the 0th address, the 1st to the 1st, and so on, by sacrificing visualization, but having gained a speed and ease of operations like "subtract from the memory 0th byte to the 0th digit (little -endian) "
  • expand value while saving, for debugging convenience

Within one system there is no problem. The main thing is to write values ​​into memory just as you read them from there - and you can not think about endiannes.


And everything goes well, as long as you do not need to transfer the string to another computer (directly, over the network, or indirectly, as a file). The creators of network protocols agreed in advance how to transfer individual bytes. Those. if you transfer from x86 over the network bytes 1, 2, 3, 4, 5, 6 - then any motor scooter will receive them over the network in the order 1, 2, 3, 4, 5, 6 . It is tightly hammered into standards, at all levels, from TCP / IP to Ethernet.

But about the transmission of pairs of bytes or quadruple bytes - there is no agreement.


UTF-8 works with a stream of bytes. Suppose you want to write to disk or send hello over the network. This, in terms of the encoder, looks like this:

  • pass D0
  • pass 9F
  • pass D1
  • pass 80 ...

The receiving party (the standard!) Is guaranteed to read them in the same order - D0 9F D1 80 ...

Again, when writing to memory one byte, no reversals occur, and in memory this value is represented as

 [ D0 9F D1 80 ] 

This happens only because in Russian (English) it is customary to write letters to the right, which coincides randomly with conventional visualization of memory.

Therefore, a UTF-8 string is simply written to memory — and it is ready for transmission. This is the result of agreements at the level of network protocols / protocols of communication with the disk - they also work at the byte level, and not at the bit level.


Ok, now we wanted to say the same hi, but in UTF-16:

 0x041F 0x0440 0x0438 0x0432 0x0435 0x0442 

The encoder in UTF-16 does not bother and transmits these two-byte words over the network / into memory / to disk. By the word at a time. And expects that the first word will be transmitted / recorded first, the second - the second, etc.

  • pass 0x041F
  • pass 0x0440
  • pass 0x0438

How they will be recorded / transmitted depends on the endiansess. For LE, the processor does not bother

  [ 1F, 04 ] [ 40, 04 ] [ 38, 04] адрес 0 1 2 3 4 5 

For big endian, he diligently unfolds every word:

  [ 04, 1F ] [ 04, 40 ] [ 04, 38] адрес 0 1 2 3 4 5 

There are no agreements on the transfer of two-byte words over the network, as well as agreements on their storage on the disk. Therefore, BOM is required in UTF-16.


In fact, the same problem exists with bits when using UTF-8 (and indeed when transferring any bytes to anywhere). For example, you want to transfer D0 bytes across a network. He is 11010000 . Will you pass it as 0, 0, 0, 0, 1, 0, 1, 1 ? Or how 1, 1, 0, 1, 0, 0, 0, 0 ?

You do not face this problem for various reasons: - There is no need to visualize the bits during storage. - Access to the actual storage format at the physical level is closed (the memory and the disk do not allow to address and read individual bits). - Hard standardization - a pre-agreed order of transmission of bits within a byte in a particular network protocol allows you to work at the byte level.

It is enough to take any transmission method in which the order of bits is visible (for example, try to make a piece of hardware that is via a COM port) - and the problem will manifest itself.

  • A little clarified the question, please see. - Nick Volynkin

That is, the same number 0x1234ABCD is encoded by a sequence of bytes:

  • little endian: 12 34 56 78
  • big endian: 78 56 34 12

Accordingly, when decoding ...

There is no decoding. There are two representations of numbers in memory.

http://ideone.com/wsOkXK

 #include <cstdio> int main() { volatile int x = 0x1234ABCD; const unsigned char *p = (unsigned char *)&x; printf("%0*X\n", sizeof(int) << 1, x); for (unsigned q=0; q<sizeof(int); ++q) printf("%02X", p[q]); return 0; } 
 1234ABCD CDAB3412 

In UTF16, I really wanted to do "as written in the file, they read it into memory and read it into the array int16" - that’s 2 options. And in UTF8, the characters have different lengths and they don’t map on integer types - that’s why there were no reasons to get two representations, because they are just bytes. Anyway, parsing a variable length sequence from the end is some kind of perversion :)

  • A little clarified the question, please see. - Nick Volynkin
  • А в UTF8 символы имеют разную длину и на целочисленные типы никак не мапятся - - this is similar to the answer to my addition. That is, it is algorithmically unprofitable, for example, to put any character on a four-byte integer? Or to put on the minimum sufficient and as a result to compare, for example, single-byte with two-byte? - Nick Volynkin
  • @NickVolynkin, mapit can and can be, but if you store a single-byte character in 4 bytes of inte, you still need to write 1 byte to the file, not 4 - that is, just writing as it is in memory will not work. - Qwertiy
  • @NickVolynkin mapit on a four-byte integer is possible - but it will already be UTF-32. - Pavel Mayorov

In both cases, the order of sampling from the memory of numbers encoding a symbol is the same. In the case of UTF-8, these numbers are single-byte and the problems of the high and low bytes, for obvious reasons, do not arise. When it comes to UTF-16, it turns out that the order of writing bytes in memory is different in different architectures. For example, in Intel architecture, the low byte goes first, and in ARM, the high byte.