I started my project, I want to immediately use the best practices.

I utf8_general_ci that one of utf8_general_ci or utf8_unicode_ci . I bow to utf8_unicode_ci .

But I came across information that it is already a bit out of date and it is worth using utf8mb4_general_ci and utf8mb4_unicode_ci .

Advise what encoding to choose for the database.

  • 2
    stackoverflow.com/a/766996/5441700 Bottom line - use utf8mb4_unicode_ci . You must also have a connection in utf8mb4 mode with the utf8mb4 . - Visman
  • Thanks, got it. - Dilun7495
  • one
    Plus utf8mb4, modern engines are gradually moving to it. For example, in Laravel 5.4 they switched to utf8mb4 and indicated that it "supports emoji storage in the database". Where in the modern world without emoticons, eh? )) Even on a githaba they were entered. And the question is very popular, perhaps I will translate it into Russian. - AK

1 answer 1

Free translation of the question: What's the difference between utf8_general_ci and utf8_unicode_ci .

Both of these encodings ( utf8_general_ci and utf8_unicode_ci ) work with UTF-8 characters, the difference is in sorting the strings and comparing them.

Note: since MySQL version 5.5.3, it is preferable to use utf8mb4 , rather than utf8 . Both are UTF-8 encoded, but the older uft8 has uft8 -specific UTF-8 character restrictions above 0xFFFD.

Comparison of individual parameters.

Accuracy

  • utf8mb4_unicode_ci based on the Unicode standard for sorting and string comparison, which more accurately sorts strings in a wide range of languages ​​/ alphabets.

  • utf8mb4_general_ci does not implement all Unicode sorting rules, what
    often entails an undesirable result in some situations for
    certain languages ​​/ characters.

Performance

  • utf8mb4_general_ci faster in comparing and sorting because it contains a large number of optimizations.

    On modern servers, this increment of speed will always be, but only slightly. Optimizations were conceived at a time when server capacity was significantly less than today.

  • utf8mb4_unicode_ci , which uses Unicode rules for sorting and comparing, honestly uses more sophisticated algorithms for precise sorting for a wide number of languages ​​and using special characters. These rules take into account the specific conventions for the language, not always sorting goes in accordance with the "alphabetical" order.

In principle, for a group of so-called. "European" languages ​​do not make much difference between strict Unicode sorting and utf8mb4_general_ci simplified sorting, but a few differences:

For example, Unicode sorts "ß" as well as "ss", and "Œ" as "OE" as people do, while utf8mb4_general_ci sorts them as separate characters (presumably as "s" and "e" respectively).

Some Unicode characters are defined as insignificant, which means that they should not affect the sort order and the comparison should proceed to the next character. And utf8mb4_unicode_ci handles these characters correctly.

For a group of non-European languages, such as Asian languages ​​or languages ​​with a different alphabet, there are many more differences between sorting Unicode and simplified sorting in utf8mb4_general_ci . The way utf8mb4_general_ci is utf8mb4_general_ci will depend on the particular language. For some languages, the difference may be very insufficient.

What to use?

It makes little sense to prefer utf8mb4_general_ci for performance reasons, because on modern processors, the difference will not play the role of a bottleneck.

There may be some kind of performance difference in some highly specialized situations and if this is your case you should be aware of this.

Previously, some experts recommended using utf8mb4_general_ci except when precise sorting is necessary and more important than sinking performance. Today, more attention is paid to the precise support of internationalization than to a slight slump in productivity.

And I’ll add that even if your application should support only English, it can be a situation where the application will enter the names of people and often entered names should contain characters that are found in other languages, so it is important to use correct sorting rules . Using Unicode in all places where it is possible will help you develop better applications.

  • Fine. But along the way, the question arose: I can not register this encoding in php: setlocale (LC_ALL, 'en_RU.utf8mb4'); mb_internal_encoding ('UTF-8 mb4'); How to be? - Dilun7495
  • one
    @ Dilun7495, functions for working with multibyte strings are well handled by 4 byte UTF-8 with mb_internal_encoding('UTF-8'); And even more commands can be given mb_internal_encoding('UTF-8'); mb_http_output('UTF-8'); mb_http_input('UTF-8'); mb_language('uni'); mb_regex_encoding('UTF-8'); mb_internal_encoding('UTF-8'); mb_http_output('UTF-8'); mb_http_input('UTF-8'); mb_language('uni'); mb_regex_encoding('UTF-8'); ;) With setlocale(LC_ALL, 'ru_RU.UTF-8'); should be similar (but there may be a problem: different OSes may require different locales). Just in mysql before it was trimmed UTF-8, now they made it more complete (4 byte) . - Visman
  • Visman, thank you. I will try - Dilun7495