I decided to check how the String getBytes() method getBytes() when the code point of the letter has a value greater than 127. I took for example a string with Latin and Cyrillic characters, which I converted into an array of bytes:
byte[] bytes = "abcdefghijz<аБ".getBytes();
Got the following bytes:
[97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 122, 60, -32, -63]
As can be seen, due to the overflow of the type byte code point, а=224 and Б=193 shifted to the negative area -32 and -63 , respectively.
Next, convert the byte array to the char array.
char[] chars=new char[bytes.length]; for (int i=0;i<bytes.length;i++) chars[i]=(char)bytes[i]; I get
[a, b, c, d, e, f, g, h, i, j, z, <, ?, ?]
As expected, the char for negative values is undefined, and instead of them does the IDE draw characters ?
Next, I check how Java works with these char when writing to a file. The first way:
FileOutputStream fos=new FileOutputStream("chars.txt"); BufferedOutputStream bos=new BufferedOutputStream(fos); bos.write(bytes,0,bytes.length); bos.close(); When using FileOutputStream program correctly displays Cyrillic letters, instead of question marks in the text file there are letters а and Б , and when opening a file in a hexadecimal editor, there are correct byte values 224 and 193 instead of negative numbers, which the program originally issued
The second way to write to the file:
FileWriter fos=new FileWriter("chars.txt"); fos.write(chars,0,chars.length); fos.close(); When using FileWriter , question marks are displayed in place of Cyrillic characters in the text file, and when opening a file in a hex editor, the question mark is displayed as byte 63 (not 224 and 193 , which were embedded in the string initially)
This raises the following questions:
1) I expected FileWriter correctly display Cyrillic characters, which in the standard Windows encoding ( windows-1251 ) are placed in one byte. Why is it wrong? I know that FileWriter is a character stream, not a byte stream. But after all, character streams also operate on bytes. If the byte of the symbol а=224 originally laid, then shouldn't the program have to write the same byte to the file at a low level? When viewed in a text editor, something else may appear, if the encoding is incorrectly chosen and byte 224 corresponds to some hieroglyph. But why does FileWriter write a completely different byte?
2) Why does the class String getBytes() method, if in Java it has the wrong range, which is traditionally used in computer technology? It seems that nobody uses negative numbers in computer technology?
Java's primitive byte data type is always defined as consisting of 8 bits and being a signed data type, holding values from −128 to 127.- Eugene KrivenjagetBytesmethod inStringwe want to get the code points of the characters from the standard encoding. but when the codepoint goes beyond 127, something else is returned that it is not clear how to work with - Dmitry