How to determine if a UTF8 character is alphabetic?

Question

Suppose I have the string "hello" in UTF8. This string is in the C-string char *str .

I want to know how I can determine if, for example, the first character of this string is alphabetic (A-Za-zA-Yaa-I).

In my other question, I’d seemingly clarified, “how do I know how many bytes the UTF8 character in the char * line occupies”, from which follows a simple branch:

1) if a single-byte character is simply to be verified by

 if ('A' <= currentByte && currentByte <= 'z') {}

2) if the character is double-byte - it needs to be checked for Russian letters - this is where the question begins: I can make a caste of this 2-byte character in a 2-byte number and compare it with all the hexadecimal Russian alphabet codes that are known.

3) if the character is more than 2-byte - ignore this case, since the Latin + Russian alphabet is placed in 1 + 2 bytes in UTF8.

What I don’t understand is how to check the hexadecimal code of a two-byte character to match all hexadecimal codes of the Russian alphabet fast and compact?

I would be grateful for the help. Thank.

Here is my naive attempt:

 const char *bytes = source.bytes; // Здесь лежит UTF-8 строка из двух символов - "ая"; for (int i = 0; i < source.length; i++) { unsigned char currentByte = (unsigned char)bytes[i]; size_t charLength = utf8_char_length[currentByte]; if (charLength == 1) { printf("character %c %d\n", currentByte, currentByte); if ('A' <= currentByte && currentByte <= 'z') { printf("latin %c\n", currentByte); } } else if (charLength == 2) { unsigned short twobytesymbol = *(unsigned short *)(bytes + i); printf("(2 bytes) %X\n", twobytesymbol); i++; } else { continue; } } Вывод: (2 bytes) B0D0 (2 bytes) 8FD1

I do not understand why in the table of Unicode codes: the codes of the letters "a" and "I" have hexadecimal codes inverted relative to mine:

"a": d0 b0 , "i": d1 8f

Because of this, I can not humanly check the entire range of the Russian alphabet. I suspect that there is some kind of evidence lurking, which I do not understand yet.

@Stanislaw Pankevich, if you are interested in the letters of a particular language (or family), then there are no special problems.
Most letter characters (for example, Cyrillic) occupy a continuous unicode range.
Translate utf-8 to ucs-32 and compare to hit (preliminarily evaluate the alphabet you are interested in in the unicode table) into the desired range (plus it for purely Russian).
Of course, you can try to turn on the desired localization (setlocale ()) and check iswalpha (), but localizations are not always configured on a specific machine.
@avp: And also äöüÄÖÜß for German, French for French, çęıöşÇĞİÖŞ for Turkish, ąćęłńńóśźżĄĆĘŁŃÓŚŹŻ for Polish ...
@VladD, as far as I remember, all these latin- 123 also go in a row (like Greek).
Say, in Cyrillic, first all uppercase, then lowcase, and in latin (if memory does not change) even low codes, and odd upper (or vice versa).
Now the question for me is how to pull out its hexadecimal code by two bytes of a character so that it coincides with the corresponding code from the Unicode table, and it is not inverted (as I have now).
I obviously confuse something and I will be happy with admonishing advice.
@Stanislaw Pankevich: By the way, the 'A' <= currentByte && currentByte <= 'z' check is not quite true: there are still non-letter characters between small and large letters.

Accepted Answer · 2014-02-24T20:42:39

@Stanislaw Pankevich , a simple start

 #include <stdio.h> #include <stdlib.h> char utf8len[256] = { // len = utf8len[c] & 0x7 cont = utf8len[c] & 0x8 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0 - 15 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 16 - 31 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 32 - 47 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 48 - 63 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 64 - 79 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 80 - 95 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 96 - 111 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 112 - 127 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, // 80 - 8f 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, // 90 - 9f 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, // a0 - af 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, // b0 - bf 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // c0 - cf 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // d0 - df 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, // e0 - ef 4, 4, 4, 4, 4, 4, 4, 4, // f0 - f7 5, 5, 5, 5, // f8, f9, fa, fb 6, 6, // fc, fd 0, 0 // fe, ff }; #define UTF8LEN(c) (utf8len[(unsigned char)(c)] & 0x7) #define UTF8CONT(c) (utf8len[(unsigned char)(c)] & 0x8) int main (int ac, char *av[]) { char *s = "Б№1АГД"; while (*s) { int ucode; printf ("[%s] %d\n", s, UTF8LEN(*s)); if ((UTF8LEN(*s) == 2) && UTF8CONT(s[1])) { ucode = ((*s & 0x1f) << 6) | (s[1] & 0x3f); printf ("ucode = 0x%x\n", ucode); s++; } s++; } }

everything is based on the length table of utf-8 characters determined by the first byte and checking that the next bytes in this utf-8 character must start with 0x80

 avp@avp-ubu1:~/hashcode$ gcc utf.c avp@avp-ubu1:~/hashcode$ ./a.out [Б№1АГД] 2 ucode = 0x411 [№1АГД] 3 [  1АГД] 0 [ 1АГД] 0 [1АГД] 1 [АГД] 2 ucode = 0x410 [ГД] 2 ucode = 0x413 [Д] 2 ucode = 0x414 avp@avp-ubu1:~/hashcode$

I do not check ascii here and print only Russian codes (more precisely, any 2-byte characters).

You can add some useful macros and get

 #define UTF8LEN(c) (utf8len[(unsigned char)(c)] & 0x7) #define UTF8CONT(c) (utf8len[(unsigned char)(c)] & 0x8) #define RUSUCODE(s) ({ char *_s = (s); \ (((*_s & 0x1f) << 6) | (_s[1] & 0x3f)); }) #define ISRUSUC(c) ( { int _uc = (c); \ ((0x410 <= _uc && _uc <= 0x44f) || _uc == 0x401 || _uc == 0x451); }) #define ISRUS(s) (ISRUSUC(RUSUCODE(s))) int main (int ac, char *av[]) { char *rus = "АяёЁ"; int Uaz = RUSUCODE(rus), Lya = RUSUCODE(rus + 2), Ujo = RUSUCODE(rus + 6), Ljo = RUSUCODE(rus + 4); printf ("А 0x%x я 0x%x ё 0x%x Ё 0x%x\n", Uaz, Lya, Ljo, Ujo); char *s = "Б№1АГД"; while (*s) { int ucode; printf ("[%s] %d\n", s, UTF8LEN(*s)); if ((UTF8LEN(*s) == 2) && UTF8CONT(s[1])) { ucode = RUSUCODE(s); printf ("ucode = 0x%x %s ", ucode, ISRUS(s) ? "Да" : "No"); if (ISRUSUC(ucode)) puts("rus letter"); else puts(" ???"); s++; } s++; } }

But, it is better to immediately focus on handling any utf-8 characters.

(As my friend, a mechanic said, “you have to do well, it’s going to work out for itself”).

In fact, by getting the length of utf-8 characters from UTF8LEN (), you can immediately increase s by it (or by 1, reporting an error if the length is zero).

Although, I think everything is already clear.

I need time to let all this through myself)) Thank you so much!
I nevertheless tried to invert the bytes of double-byte characters with custom into 2-byte unsigned char, and it turned out that all ranges according to the table of hex codes are perfectly possible to set and check for Russian characters.
But it turned out that your solution with translation in ucs works a little faster:> ucode = ((* s & 0x1f) << 6) |

Answer 2 · 2014-02-24T15:06:59

Translate to wchar_t and use iswalpha ? Defines any letters, not only Russian / English.

(In C ++, there is still isalpha with a locale as a parameter, is there such a thing in pure C?)

I'm trying to figure out how to do it manually purely through character codes.
This is partly an exercise, partly ancillary code for one test task that I want to perform.

Alexei Averchenko Alexei Averchenko 1,029 five eleven · Answer 3 · 2015-06-10T13:15:19

The answer @avp is incredibly cool, but still safer to use libraries:

glib is a rather heavy, but useful library with a bunch of everything, including working with Unicode, is popular in Linux, because it serves as the basis for GNOME.
icu is a special library specifically about Unicode, for example, Apple uses the NSString under the hood.

ICU, by the way, is probably the only current solution claiming to be complete.

How to determine if a UTF8 character is alphabetic?

3 answers 3

More articles: