This information is an integral part of the Unicode standard.
Most of the information is in the UnicodeData.txt file. It is a table of values separated by a sign ;
(i.e. almost csv ). From the documentation on its structure, we are interested in the following columns (there are 15 of them, from 0 to 14):
- 0 Code value. Hex number of character
- 1 Character name. Title.
- 12 Uppercase Mapping. The corresponding (lowercase) capital character.
- 13 Lowercase Mapping. The corresponding (uppercase) lowercase character.
- 14 Titlecase Mapping. The corresponding register character is titlecase. This is a special register for cases where there is a special spelling for a capital letter in a word written in lowercase. Example: Dz (this is one character!).
String examples:
0410;CYRILLIC CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0430; 0430;CYRILLIC SMALL LETTER A;Ll;0;L;;;;;N;;;0410;;0410
For А
corresponding lowercase symbol is а
. Accordingly, for а
corresponding uppercase and titlecase symbol is А
And here is the aforementioned Dz
. He has all three matches different. (The line is broken for readability)
01F2;LATIN CAPITAL LETTER D WITH SMALL LETTER Z; Lt;0;L;<compat> 0044 007A;;;;N;;;01F1;01F3;01F2
All matches in the UnicodeData.txt file are unambiguous, that is, 1-1. For exceptions, when there is more than one character, there is SpecialCasing.txt . Additional information for register conversions is in CaseFolding.txt .
More information can be found in the FAQ - Character Properties, Case Mappings & Names FAQ
man isupper
etc), and a bunch of other external sources that can even dynamically change depending on the locale :) - PinkTux