Find out the character code utf-8

Question

Perhaps this is a very stupid question, but something I did not find a function that returns the character code UTF-8.

I know that we can get the ASCII code with the following code:

char our_symbol = 'v'; int number; number = (int) our_symbol;

How to get code from UTF-8?

Clarifications: There is a character code sequence that we want to recognize.

Suppose this character 'а' is a Cyrillic character.

If we translate its bytes into decimal code, we get 1байт - 208, 2байт - 176

@VladD is not interested in the character code in UTF-8, I understand that the characters in this encoding can be multibyte, so I ask how to find out the character code in UTF-8
Do you need to know the code of one character once or do you need to process the byte stream of these characters and find their codes?
@MaximPro: Well, there is no such thing as “character code in UTF8”.
There is a Unicode codepoint number and there are bytes by which this very codepoint is encoded in UTF8.

Accepted Answer · 2016-11-24T08:27:56

Here is a simple and refined option:

 std::string utf8Symbol = u8"Ф"; for(const auto& byte : utf8Symbol) std::cout << std::hex << (byte & 0xFF) << ' '; std::cout << '\n';

There are practically no differences from single-byte encodings: if you already have a method for extracting individual characters, you simply take each of them and extract all the bytes, one by one.

@Qwertiy, because std::string doesn't produce anything at all.
Contrary to its name, std::string not a string, in the sense that we are all used to.

Answer 2 · 2016-11-24T09:59:19

http://ideone.com/5IdxoG

 #include <cstdio> int main() { const char *str = u8"Я строка в UTF-8. がダウンロードできません"; printf("%s", str); for (unsigned char *p=(unsigned char *)str; *p; ++p) printf(*p >> 6 == 2 ? " %.02X" : "\n%.02X", *p); return 0; }

Here we use the fact that for all bytes except the first one, the two most significant bits are 10.

Cool, but I would like to comment on the code *p >> 6 == 2 .
@VladD, Wikipedia: "For the remaining octets, the two most significant bits are 10 (10xxxxxx)."

Majestio Majestio 2,917 1 golden mark 5 silver marks 32 bronze marks · Answer 3 · 2016-11-24T10:12:30

Well, in addition UCS-2 -> UTF8 -> Codes :)

After all, you can store lines in code in different ways ... http://ideone.com/bkNiH5 :

 #include <iostream> #include <string> #include <locale> #include <codecvt> #include <iomanip> int main() { // широкие символы std::wstring wstr = L"Я строка в UCS-2. がダウンロードできません"; // широкие символы в UTF-8 std::wstring_convert<std::codecvt_utf8<wchar_t>> conv; std::cout << "Chr | UTF-8\n============\n"; for(const auto &c:wstr) { std::string u8str = conv.to_bytes(c); std::cout << u8str << " : "; for(const uint8_t &i:u8str) std::cout << std::hex << std::setfill('0') << std::setw(2) << static_cast<int>(i) << ' '; std::cout << std::dec << '\n'; } return 0; }

There is no guarantee that the literal L"" will give UCS-2, for example, in gcc sizeof(wchar_t) == 4 , therefore, the most likely encoding is UCS-4.
But, IMHO, we must choose from the situation - UTF-8 gives a compact representation, UCS-2 / UCS-4 gives a greater processing speed.

Find out the character code utf-8

3 answers 3

More articles: