Perhaps this is a very stupid question, but something I did not find a function that returns the character code UTF-8.

I know that we can get the ASCII code with the following code:

char our_symbol = 'v'; int number; number = (int) our_symbol; 

How to get code from UTF-8?

Clarifications: There is a character code sequence that we want to recognize.

Suppose this character 'а' is a Cyrillic character.

If we translate its bytes into decimal code, we get 1байт - 208, 2байт - 176

  • Do you mean the Unicode number? UTF8 is just a Unicode character encoding method. - VladD
  • @VladD is not interested in the character code in UTF-8, I understand that the characters in this encoding can be multibyte, so I ask how to find out the character code in UTF-8 - MaximPro
  • Where do you get the symbols from? Do you need to know the code of one character once or do you need to process the byte stream of these characters and find their codes? - tutankhamun
  • 2
    @MaximPro: Well, there is no such thing as “character code in UTF8”. There is a Unicode codepoint number and there are bytes by which this very codepoint is encoded in UTF8. - VladD
  • 2
    Describe your task more widely. I think this is Error XY - tutankhamun

3 answers 3

Here is a simple and refined option:

 std::string utf8Symbol = u8"Ф"; for(const auto& byte : utf8Symbol) std::cout << std::hex << (byte & 0xFF) << ' '; std::cout << '\n'; 

There are practically no differences from single-byte encodings: if you already have a method for extracting individual characters, you simply take each of them and extract all the bytes, one by one.

  • std :: string utf8Symbol = u8 "F"; Is this a typo at the end? - MaximPro
  • @MaximPro, in a sense? No typos there. - ixSci
  • one
    @MaximPro, this means a utf-8 character string literal. - ixSci
  • one
    @MaximPro, I have an article on this topic. - ixSci
  • 2
    @Qwertiy, because std::string doesn't produce anything at all. std::string is just a little over array char . Contrary to its name, std::string not a string, in the sense that we are all used to. - ixSci

http://ideone.com/5IdxoG

 #include <cstdio> int main() { const char *str = u8"Я строка в UTF-8. がダウンロードできません"; printf("%s", str); for (unsigned char *p=(unsigned char *)str; *p; ++p) printf(*p >> 6 == 2 ? " %.02X" : "\n%.02X", *p); return 0; } 

Here we use the fact that for all bytes except the first one, the two most significant bits are 10.

  • Cool, but I would like to comment on the code *p >> 6 == 2 . - VladD
  • @VladD, Wikipedia: "For the remaining octets, the two most significant bits are 10 (10xxxxxx)." :) - Qwertiy
  • Yeah, maybe make a response? - VladD
  • one
    @VladD added. - Qwertiy

Well, in addition UCS-2 -> UTF8 -> Codes :)

After all, you can store lines in code in different ways ... http://ideone.com/bkNiH5 :

 #include <iostream> #include <string> #include <locale> #include <codecvt> #include <iomanip> int main() { // широкие символы std::wstring wstr = L"Я строка в UCS-2. がダウンロードできません"; // широкие символы в UTF-8 std::wstring_convert<std::codecvt_utf8<wchar_t>> conv; std::cout << "Chr | UTF-8\n============\n"; for(const auto &c:wstr) { std::string u8str = conv.to_bytes(c); std::cout << u8str << " : "; for(const uint8_t &i:u8str) std::cout << std::hex << std::setfill('0') << std::setw(2) << static_cast<int>(i) << ' '; std::cout << std::dec << '\n'; } return 0; } 
  • There is no guarantee that the literal L"" will give UCS-2, for example, in gcc sizeof(wchar_t) == 4 , therefore, the most likely encoding is UCS-4. Use u"" if you are guaranteed to get UCS-2 - ixSci
  • I agree, as a solution, use std :: u32string. But, IMHO, we must choose from the situation - UTF-8 gives a compact representation, UCS-2 / UCS-4 gives a greater processing speed. All at once - not in our life :) - Majestio