Suppose I have the string "hello" in UTF8. This string is in the C-string char *str
.
I want to know how I can determine if, for example, the first character of this string is alphabetic (A-Za-zA-Yaa-I).
In my other question, I’d seemingly clarified, “how do I know how many bytes the UTF8 character in the char * line occupies”, from which follows a simple branch:
1) if a single-byte character is simply to be verified by
if ('A' <= currentByte && currentByte <= 'z') {}
2) if the character is double-byte - it needs to be checked for Russian letters - this is where the question begins: I can make a caste of this 2-byte character in a 2-byte number and compare it with all the hexadecimal Russian alphabet codes that are known.
3) if the character is more than 2-byte - ignore this case, since the Latin + Russian alphabet is placed in 1 + 2 bytes in UTF8.
What I don’t understand is how to check the hexadecimal code of a two-byte character to match all hexadecimal codes of the Russian alphabet fast and compact?
I would be grateful for the help. Thank.
Here is my naive attempt:
const char *bytes = source.bytes; // Здесь лежит UTF-8 строка из двух символов - "ая"; for (int i = 0; i < source.length; i++) { unsigned char currentByte = (unsigned char)bytes[i]; size_t charLength = utf8_char_length[currentByte]; if (charLength == 1) { printf("character %c %d\n", currentByte, currentByte); if ('A' <= currentByte && currentByte <= 'z') { printf("latin %c\n", currentByte); } } else if (charLength == 2) { unsigned short twobytesymbol = *(unsigned short *)(bytes + i); printf("(2 bytes) %X\n", twobytesymbol); i++; } else { continue; } } Вывод: (2 bytes) B0D0 (2 bytes) 8FD1
I do not understand why in the table of Unicode codes: the codes of the letters "a" and "I" have hexadecimal codes inverted relative to mine:
"a": d0 b0 , "i": d1 8f
Because of this, I can not humanly check the entire range of the Russian alphabet. I suspect that there is some kind of evidence lurking, which I do not understand yet.
'A' <= currentByte && currentByte <= 'z'
check is not quite true: there are still non-letter characters between small and large letters. - VladD