I decided to check how the String getBytes() method getBytes() when the code point of the letter has a value greater than 127. I took for example a string with Latin and Cyrillic characters, which I converted into an array of bytes:

byte[] bytes = "abcdefghijz<аБ".getBytes();

Got the following bytes:

[97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 122, 60, -32, -63]

As can be seen, due to the overflow of the type byte code point, а=224 and Б=193 shifted to the negative area -32 and -63 , respectively.

Next, convert the byte array to the char array.

 char[] chars=new char[bytes.length]; for (int i=0;i<bytes.length;i++) chars[i]=(char)bytes[i]; 

I get

[a, b, c, d, e, f, g, h, i, j, z, <, ?, ?]

As expected, the char for negative values ​​is undefined, and instead of them does the IDE draw characters ?

Next, I check how Java works with these char when writing to a file. The first way:

 FileOutputStream fos=new FileOutputStream("chars.txt"); BufferedOutputStream bos=new BufferedOutputStream(fos); bos.write(bytes,0,bytes.length); bos.close(); 

When using FileOutputStream program correctly displays Cyrillic letters, instead of question marks in the text file there are letters а and Б , and when opening a file in a hexadecimal editor, there are correct byte values 224 and 193 instead of negative numbers, which the program originally issued

The second way to write to the file:

 FileWriter fos=new FileWriter("chars.txt"); fos.write(chars,0,chars.length); fos.close(); 

When using FileWriter , question marks are displayed in place of Cyrillic characters in the text file, and when opening a file in a hex editor, the question mark is displayed as byte 63 (not 224 and 193 , which were embedded in the string initially)

This raises the following questions:

1) I expected FileWriter correctly display Cyrillic characters, which in the standard Windows encoding ( windows-1251 ) are placed in one byte. Why is it wrong? I know that FileWriter is a character stream, not a byte stream. But after all, character streams also operate on bytes. If the byte of the symbol а=224 originally laid, then shouldn't the program have to write the same byte to the file at a low level? When viewed in a text editor, something else may appear, if the encoding is incorrectly chosen and byte 224 corresponds to some hieroglyph. But why does FileWriter write a completely different byte?

2) Why does the class String getBytes() method, if in Java it has the wrong range, which is traditionally used in computer technology? It seems that nobody uses negative numbers in computer technology?

  • Eight bits as a data type can be defined differently in different programming languages. If you are writing in Java, you must know that it Java's primitive byte data type is always defined as consisting of 8 bits and being a signed data type, holding values from −128 to 127. - Eugene Krivenja
  • Feel free to ask - what does "reduced" mean? - Olexiy Morenets
  • @OleksiyMorenets maybe I misunderstand how to work with bytes in java, but they do not have the full range of values ​​of that byte, which is operated in computer science. instead of 0 (dash) 255, for some reason, -128 (dash) 127 are used. It is logical that when we call the getBytes method in String we want to get the code points of the characters from the standard encoding. but when the codepoint goes beyond 127, something else is returned that it is not clear how to work with - Dmitry
  • Yes, and the answer to my question? - Oleksiy Morenets
  • @ OlexiyMorenets reduced - it means incomplete, reduced. Due to the fact that in Java bytes is a sign, it can take two times less positive values - Dmitry

1 answer 1

 byte[] bytes = "abcdefghijz<аБ".getBytes(); 

This is not quite the full version, the full version is this:

 byte[] bytes = "abcdefghijz<аБ".getBytes(charset); //charset - кодировка строки 

that is, the bytes returned are dependent on the encoding used. In the first case, the default encoding is set on the system (usually Win-1251).

The string abcdefghijz<аБ - what encoding is written in? Apparently, in UTF-8 - from there you have different interpretations.

On the second question:

It seems that nobody uses negative numbers in computer technology?

leave on your conscience. Bytes are in bytes in Africa, and how to display them by positive or negative numbers (in decimal terms) is just a display method.

Update

OutputStream works with a ready-made raw byte array, in contrast to it, an OutputStreamWriter recodes a ready-made array of bytes into a char according to the CharSet created during its creation. In the case of FileWriter , which is the successor of OutputStreamWriter , then according to the documentation:

Byte-buffer size are acceptable. To construct the OutputStreamWriter on a FileOutputStream.

PS I advise you to carefully consider my remark:

The string abcdefghijz<аБ - what encoding is written in?

Van mor update

char and byte not identical. To make it clear - take a purely Russian letter Ж (the same big)

  • in Unicode, this is 0x04 0x16 (two bytes)
  • in Win-1251 - 0xC6 (1 byte)
  • in KOI-8 - 0xD6 (1 byte)

etc.

Now imagine - there is a set of bytes and we read it. Attention question: how do we understand which of the characters in this set of bytes? That's right - without an a priori knowledge of the encoding, we cannot learn (or rather, we can say so only by certain deductions / constructions - like here ) - respectively, we need to input the recoding table of bytes into characters / Char - this is called transcoding.

  • "that is, the bytes produced depend on the encoding used," the bytes that the toString() method toString() suit me. They were correctly converted to text with a FileOutputStream , right? So the problem lies in the FileWriter class. For some reason, it does byte substitution. - Dmitry
  • Let me explain - it doesn't matter to me which characters are displayed in the *.txt file. I open this file with a hex editor and look only at the written bytes. Mapping a byte to a specific character is already a shamanism with encodings, which is not currently being considered. So, what we have: bytes, obtained using the toString() method. The FileOutputStream class correctly copies bytes to a file, unchanged. The FileWriter class takes the original bytes and writes OTHER bytes to the file. The question is why?) - Dmitry
  • see the answer update - Barmaley
  • to me, the conversion mechanism from bytes to char is incomprehensible. there is no concept of char in the computer. there the information is stored in bits (bytes, if you like). And bytes it is in Africa bytes, and how to display it is already a secondary issue. - Dmitry
  • Another update - Barmaley