In programming languages, there are string methods that work with a register (case) of characters. As a rule, you can:

  • Check if the character is upper or lower case.
  • Convert a string to uppercase or lowercase only.

And then there are regular expressions with the case-insensitive key:

# например так '/[aeiouy]/i' # или так: '(?i:[aeiouy])' 

How does this even work? How do we know the correspondence between the characters?


The question arose in the process of writing the answer to another question: A regular expression for determining a vowel or a consonant letter at the beginning of a line

  • and for not? :-) - Grundy
  • @Grundy probably there is something similar in other encodings. - Nick Volynkin
  • one
    @Grundy, but for non-unicode there is, for example, a bit table of signs ( man isupper etc), and a bunch of other external sources that can even dynamically change depending on the locale :) - PinkTux
  • one
    @avp would you like to write an answer about locales? - Nick Volynkin
  • one
    @NickVolynkin, alas, I am not so deeply versed in this, to teach others. The idea, of course, is good and as they say, this is a fruitful opening idea . If there is time, I'll figure it out and write (for now, I will remember this topic, marking it with an asterisk). - avp 7:49 pm

1 answer 1

This information is an integral part of the Unicode standard.

Most of the information is in the UnicodeData.txt file. It is a table of values ​​separated by a sign ; (i.e. almost ). From the documentation on its structure, we are interested in the following columns (there are 15 of them, from 0 to 14):

  • 0 Code value. Hex number of character
  • 1 Character name. Title.
  • 12 Uppercase Mapping. The corresponding (lowercase) capital character.
  • 13 Lowercase Mapping. The corresponding (uppercase) lowercase character.
  • 14 Titlecase Mapping. The corresponding register character is titlecase. This is a special register for cases where there is a special spelling for a capital letter in a word written in lowercase. Example: Dz (this is one character!).

String examples:

 0410;CYRILLIC CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0430; 0430;CYRILLIC SMALL LETTER A;Ll;0;L;;;;;N;;;0410;;0410 

For А corresponding lowercase symbol is а . Accordingly, for а corresponding uppercase and titlecase symbol is А

And here is the aforementioned Dz . He has all three matches different. (The line is broken for readability)

 01F2;LATIN CAPITAL LETTER D WITH SMALL LETTER Z; Lt;0;L;<compat> 0044 007A;;;;N;;;01F1;01F3;01F2 

All matches in the UnicodeData.txt file are unambiguous, that is, 1-1. For exceptions, when there is more than one character, there is SpecialCasing.txt . Additional information for register conversions is in CaseFolding.txt .

More information can be found in the FAQ - Character Properties, Case Mappings & Names FAQ