Convert UTF-8 to UCS

Question

You must read the line from the file and convert it from UTF-8 to UTF-16LE manually. To do this, as I understand it, you must first translate it into UCS. By assignment, you must do it manually. Googled, but did not find anything sensible, maybe, truth, proglazel. Can anyone suggest an algorithm for transferring to UCS?

The UTF-8 format is described here , and the UTF-16LE format, with examples of converting to UTF-32 and back, is here .

Answer 1 · 2016-04-15T16:26:03

If we are talking about BMP ( Basic multilingual plane ), then UTF-16 actually becomes UCS-2 , and you can simply read the code point from UTF-8 and encode it immediately into a UTF-16 symbol. If this is a BMP , then the size of the character will not exceed 16 бит . If you need to process the entire unicode table, you first need to encode into UTF-32 , simply by reading the code point value from UTF-8 , and then transcode UTF-32 into UTF-16 , creating surrogate pairs for characters greater than 0xD7FF .

The action algorithm is as follows:

We read the first character

Find out the length of the code point :

 constexpr const uint8_t utf8_length_data[256] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1 }; inline constexpr uint8_t utf8DecodeLength(uint8_t c) { return utf8_length_data[c]; }

We recognize the value mask for the first byte (for the rest, always 0x3f ):

 static constexpr inline uint8_t utf8DecodeMask(uint8_t codePointLength) { switch (codePointLength) { case 1: return 0x7f; break; case 2: return 0x1f; break; case 3: return 0x0f; break; case 4: return 0x07; break; case 5: return 0x03; break; case 6: return 0x01; break; } return 0; }

We calculate the value of the code point , processing byte by codePointLength characters of the source string.

 static constexpr inline char32_t utf8Decode(string str, size_t offset, uint8_t codePointLength, uint8_t mask) { char32_t ret = str[offset] & mask; for (uint8_t c = 1; c < codePointLength; ++c) { ret <<= 6; ret |= (str[offset + c] & 0x3f); } return ret; }

If we return UCS-2 , then we simply discard the two high bytes and write two lower bytes to the result.
If we return a full UTF-16 , then we check if the value is less than or equal to 0xD7FF , we immediately write it to the result, otherwise we create a surrogate pair of two UTF-16 characters, it’s easy, you can take a look at Wikipedia .

PS I apologize for the C ++ code, I don’t know Java well.

If you add code to the list, you must use eight spaces instead of four.

Viacheslav Viacheslav 121 2 bronze marks · Answer 2 · 2016-04-15T15:40:50

You read the lines from the file stream and encode the bytes from your line to UTF-8 to UTF -16LE and create a new line. The new line will be in UTF - 16LE. Example of changing the string encoding: String g2 = "text in UTF-8"; String decode = new String(g2.getBytes("UTF-16LE")); String g2 = "text in UTF-8"; String decode = new String(g2.getBytes("UTF-16LE"));

Convert UTF-8 to UCS

2 answers 2

More articles: