There is a method for reading a large text file, but the code does not seem to me correct, and I would like to make it somehow get rid of this if(amountData != -1)... This if as a result of the following problem: when the file reading came to an end, and the last iteration of the loop remained, the buffer size significantly exceeded the number of bytes that were left for reading, and as a result I had at the end of the variable I placed the data that I read from the file a large amount squares (not decrypted characters). I understand that these squares are those null that were not filled in the buffer (the rest of the data in the file was not enough), and in order to get rid of them I began to check how much data I actually read. And it works.

I would like to abandon this backup, and use the tools of the standard API, but I do not know how. Help me please.

 private void read() { String path = "/Users/pavel/Desktop/test/target_text.txt"; try (BufferedInputStream in = new BufferedInputStream( new FileInputStream(path))) { byte[] bytes = new byte[1024]; int amountData = in.read(bytes, 0, 1024); while (amountData != -1 && amountData == 1024) { sb.append(new String(bytes, "UTF8")); amountData = in.read(bytes, 0, 1024); } if (amountData != -1) { byte[] residue = new byte[amountData]; System.arraycopy(bytes, 0, residue, 0, residue.length); sb.append(new String(residue, "UTF8")); } System.out.println(sb); } catch (IOException e) { e.printStackTrace(); } } 

    4 answers 4

    In general, colleague, you have 2 problems:

    1. You already have a buffered stream and against it, why bother once again to make a garden with buffer bytes ?
    2. A more global problem is that you have a text file encoded in UTF-8, which you read as a byte stream. At the same time, in the UTF-8 encoding, the character length is in places 2 bytes, in places and 1 byte, not counting the alignment bits, and so on. bullshit That is, by reading it as a byte stream with a buffer size of 1024 bytes, you are guaranteed to cut the boundaries of characters in some places because of what you get squares.

    It should be read as a stream of UTF-8 characters, then there will be no problems:

      final static int BUFFER_SIZE=1024; char[] buffer=new char[BUFFER_SIZE]; StringBuffer sb=new StringBuffer(); int size; InputStreamReader in = new InputStreamReader(new FileInputStream(path), "UTF-8"); do { size=in.read(buffer, 0, BUFFER_SIZE); if(size > 0) sb.append(buffer, 0, size); } while(size==BUFFER_SIZE); 

    and this is exactly the case when you need to use the postfix loop

    • one
      so you lose the tail of the file, if the number of characters is not a multiple of 1024 (and if the stream pulls data from the network, then filling the buffer is not guaranteed at all). There is a typical while ( (size = in.read(buffer)) != -1 ) { sb.append( buffer, 0, size ); } while ( (size = in.read(buffer)) != -1 ) { sb.append( buffer, 0, size ); } why invent something else? - zRrr
    • one
      @zRrr are you sure that the file's tail is lost? - sit down "deuce" :))) - Barmaley
    • one
      Yes, I'll go sit. In fact, on the contrary, for a file with a multiple of 1024 number of characters (including a file of zero length), sb.append will give an exception, since size will be equal to -1 . A note in case of loss of the tail in case of incomplete filling of the buffer remains in force. - zRrr
    • one
      @zRrr According to sb.append() - you are right, I am not a compiler, after all, and not a tester. I wrote on my knees the general idea of ​​implementation. And about the loss of the tail of the file, you continue to be wrong. - Barmaley
    • one
      You have FileInputStream wrapped in a BufferedInputStream - this is the first buffer, the second buffer is your bytes array - Barmaley

    If the method works only with a text file, I suggest using character input / output streams. In your case it will look like this:

      String path = "/Users/pavel/Desktop/test/target_text.txt"; StringBuilder sb = new StringBuilder(); try (BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(path), Charset.forName("UTF-8")))) { String line; while ((line = reader.readLine()) != null) { sb.append(line); sb.append("\n"); } System.out.println(sb); } catch (IOException e) { e.printStackTrace(); } 

    Character I / O streams wrap byte streams and save you from such problems.

    • Adding rows is faster than append. - And
    • @And, yes, you are right. Thank. - iGreetYou

    If you need to work with bytes, then the following code will suit you:

     private void read() { StringBuilder sb = new StringBuilder(); String path = "/Users/pavel/Desktop/test/target_text.txt"; try (BufferedInputStream in = new BufferedInputStream( new FileInputStream(path))) { int bufferSize = 1024; byte[] bytes = new byte[bufferSize]; int amountData; while ((amountData = in.read(bytes, 0, bufferSize)) != -1) sb.append(new String(bytes, 0, amountData StandardCharsets.UTF_8)); System.out.println(sb.toString()); } catch (IOException e) { e.printStackTrace(); } } 

    But, if your task is only to display lines on the screen, then I would recommend using higher-level api, for example, using streaming

     String path = "/Users/pavel/Desktop/test/target_text.txt"; String content = Files .lines(Paths.get(path), StandardCharsets.UTF_8) .collect(Collectors.joining(System.lineSeparator())); System.out.println(content); 

    The Files.lines method is lazy , it does not load the entire file, but loads the lines as needed.

    • one
      utf8 is a multibyte encoding, so the bytes of one character can be read in different loop passes and will not be correctly converted to characters. It is better to use InputStreamReader or CharsetDecoder - zRrr
    • Yes, there is such a thing, I didn’t think that it could arise - Artem Konovalov

    Without getting into the pocket for an answer, I can offer this option:

     int amountData = in.available()/1024; for(int n=0; n<amountData; n++){ in.read(bytes, 0, 1024); } in.read(bytes, 0, in.available()); 
    • And what does available () return? And why this percent sign before 1024:% 1024 ;? - Pavel
    • one
      1) Returns how many bytes can be read from the file. 2) This is an integer division. In fact, I determine how many times 1024 bytes can be counted, read them, and then the remnants. - Riĥard Brugekĥaim
    • one
      1) For the entire process, it is called 2 times. 2) This is the same process as in the File Properties window, where its size is displayed. - Riĥard Brugekĥaim
    • one
      @ Riĥard Brugekĥaim, Actually, % is the modulo operator. @Pavel, the method is available - it looks at how many bytes are left unreadable - this means that when dividing by module, it will be allocated each time <= 1024 bytes at a time. - And
    • one
      And yes, confused. - Riĥard Brugekĥaim