You must read the line from the file and convert it from UTF-8 to UTF-16LE manually. To do this, as I understand it, you must first translate it into UCS. By assignment, you must do it manually. Googled, but did not find anything sensible, maybe, truth, proglazel. Can anyone suggest an algorithm for transferring to UCS?
2 answers
If we are talking about BMP ( Basic multilingual plane ), then UTF-16 actually becomes UCS-2 , and you can simply read the code point from UTF-8 and encode it immediately into a UTF-16 symbol. If this is a BMP , then the size of the character will not exceed 16 бит . If you need to process the entire unicode table, you first need to encode into UTF-32 , simply by reading the code point value from UTF-8 , and then transcode UTF-32 into UTF-16 , creating surrogate pairs for characters greater than 0xD7FF .
The action algorithm is as follows:
- We read the first character
Find out the length of the
code point:constexpr const uint8_t utf8_length_data[256] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1 }; inline constexpr uint8_t utf8DecodeLength(uint8_t c) { return utf8_length_data[c]; }We recognize the value mask for the first byte (for the rest, always
0x3f):static constexpr inline uint8_t utf8DecodeMask(uint8_t codePointLength) { switch (codePointLength) { case 1: return 0x7f; break; case 2: return 0x1f; break; case 3: return 0x0f; break; case 4: return 0x07; break; case 5: return 0x03; break; case 6: return 0x01; break; } return 0; }We calculate the value of the
code point, processing byte bycodePointLengthcharacters of the source string.static constexpr inline char32_t utf8Decode(string str, size_t offset, uint8_t codePointLength, uint8_t mask) { char32_t ret = str[offset] & mask; for (uint8_t c = 1; c < codePointLength; ++c) { ret <<= 6; ret |= (str[offset + c] & 0x3f); } return ret; }If we return
UCS-2, then we simply discard the two high bytes and write two lower bytes to the result.- If we return a full
UTF-16, then we check if the value is less than or equal to0xD7FF, we immediately write it to the result, otherwise we create a surrogate pair of twoUTF-16characters, it’s easy, you can take a look at Wikipedia .
PS I apologize for the C ++ code, I don’t know Java well.
- Why didn't my indents before the code work? - SBKarr
- This is a feature of the editor. If you add code to the list, you must use eight spaces instead of four. - Nicolas Chabanovsky ♦
You read the lines from the file stream and encode the bytes from your line to UTF-8 to UTF -16LE and create a new line. The new line will be in UTF - 16LE. Example of changing the string encoding: String g2 = "text in UTF-8"; String decode = new String(g2.getBytes("UTF-16LE")); String g2 = "text in UTF-8"; String decode = new String(g2.getBytes("UTF-16LE"));