How do you know the correspondence between uppercase and lowercase (uppercase and lowercase) characters?

Question

In programming languages, there are string methods that work with a register (case) of characters. As a rule, you can:

Check if the character is upper or lower case.
Convert a string to uppercase or lowercase only.

And then there are regular expressions with the case-insensitive key:

# например так '/[aeiouy]/i' # или так: '(?i:[aeiouy])'

How does this even work? How do we know the correspondence between the characters?

The question arose in the process of writing the answer to another question: A regular expression for determining a vowel or a consonant letter at the beginning of a line

@Grundy probably there is something similar in other encodings.
@Grundy, but for non-unicode there is, for example, a bit table of signs ( man isupper etc), and a bunch of other external sources that can even dynamically change depending on the locale :)
@NickVolynkin, alas, I am not so deeply versed in this, to teach others.
The idea, of course, is good and as they say, this is a fruitful opening idea .
If there is time, I'll figure it out and write (for now, I will remember this topic, marking it with an asterisk).

Accepted Answer · 2016-07-20T13:42:28

This information is an integral part of the Unicode standard.

Most of the information is in the UnicodeData.txt file. It is a table of values separated by a sign ; (i.e. almost csv ). From the documentation on its structure, we are interested in the following columns (there are 15 of them, from 0 to 14):

0 Code value. Hex number of character
1 Character name. Title.
12 Uppercase Mapping. The corresponding (lowercase) capital character.
13 Lowercase Mapping. The corresponding (uppercase) lowercase character.
14 Titlecase Mapping. The corresponding register character is titlecase. This is a special register for cases where there is a special spelling for a capital letter in a word written in lowercase. Example: ǲ (this is one character!).

String examples:

 0410;CYRILLIC CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0430; 0430;CYRILLIC SMALL LETTER A;Ll;0;L;;;;;N;;;0410;;0410

For А corresponding lowercase symbol is а . Accordingly, for а corresponding uppercase and titlecase symbol is А

And here is the aforementioned ǲ . He has all three matches different. (The line is broken for readability)

 01F2;LATIN CAPITAL LETTER D WITH SMALL LETTER Z; Lt;0;L;<compat> 0044 007A;;;;N;;;01F1;01F3;01F2

All matches in the UnicodeData.txt file are unambiguous, that is, 1-1. For exceptions, when there is more than one character, there is SpecialCasing.txt . Additional information for register conversions is in CaseFolding.txt .

More information can be found in the FAQ - Character Properties, Case Mappings & Names FAQ

How do you know the correspondence between uppercase and lowercase (uppercase and lowercase) characters?

1 answer 1

More articles: