This is quite a serious question, since the answer depends on the OS, hardware, and so on. However, there are some general recommendations that can help:
Input-Output operations are best done:
- Asynchronously. This reduces the number of context switches (and Wait results to it when waiting for the OS to respond)
- With a certain number of threads (when reading in parallel a large amount of data, the disk itself cannot cope, that is, we again spend extra time switching contexts, synchronization, etc., but now on the side of the disk controller)
- Streaming (using Streaming), that is, avoiding File.lines, etc. The main reason - 10 files on 1 Gigabyte can eat about 10 GB. If the files are parsed on the go, then the required amount of memory is reduced. Moreover, a large array (or a string, it doesn’t matter) is written immediately to the latest generation (to avoid memory movement), which negatively affects both memory consumption and performance.
In short: use the answer above, but slightly modified (there is no point in processing the lines themselves in a separate parallel block). Moreover, concurrency must be limited, tests for hardware are already needed.
File("/tmp") .walkTopDown() .asSequence() .asStream() .parallel(N) .map { Files.lines(it.toPath()).split(" ").stream() } .flatMap { it }
In detail: you need to apply all the points above. In fact, you will get a study on working with IO in Java, similar to this for .Net