How to optimize the operation of reading from a file and splitting a line into words

Question

There is a code:

folderEntries.forEach { entry -> entry.listFiles().filter { it.isFile }.forEach { files.add(it) } } files.subList((files.size * threads / threadCount), (files.size * (threads + 1) / threadCount)).forEach { input.addAll(input.lastIndex + 1, Files.readAllLines(it.toPath())) } for (str in input) { buf = str.split(" ") outBuf.add("y = ${(Math.atan((buf[2].toDouble() / 4)) - buf[3].toInt() * 62) / (buf[0].toInt() * buf[0].toInt() - buf[1].toInt())}") }

Which works slower than we would like. I would be grateful for the help in optimization.

Manushin Igor Manushin Igor 617 3 eleven · Accepted Answer · 2018-03-04T17:44:08

This is quite a serious question, since the answer depends on the OS, hardware, and so on. However, there are some general recommendations that can help:

Input-Output operations are best done:

Asynchronously. This reduces the number of context switches (and Wait results to it when waiting for the OS to respond)
With a certain number of threads (when reading in parallel a large amount of data, the disk itself cannot cope, that is, we again spend extra time switching contexts, synchronization, etc., but now on the side of the disk controller)
Streaming (using Streaming), that is, avoiding File.lines, etc. The main reason - 10 files on 1 Gigabyte can eat about 10 GB. If the files are parsed on the go, then the required amount of memory is reduced. Moreover, a large array (or a string, it doesn’t matter) is written immediately to the latest generation (to avoid memory movement), which negatively affects both memory consumption and performance.

In short: use the answer above, but slightly modified (there is no point in processing the lines themselves in a separate parallel block). Moreover, concurrency must be limited, tests for hardware are already needed.

  File("/tmp") .walkTopDown() .asSequence() .asStream() .parallel(N) .map { Files.lines(it.toPath()).split(" ").stream() } .flatMap { it }

In detail: you need to apply all the points above. In fact, you will get a study on working with IO in Java, similar to this for .Net

asm0dey asm0dey 136 3 · Answer 2 · 2018-02-13T04:31:17

This suggests a solution with parallel streams such as

 File("/tmp") .walkTopDown() .asSequence() .asStream() .parallel() .map { Files.lines(it.toPath()) } .flatMap { it } .parallel() .map { it.split(" ").stream() } .flatMap { it }

How to optimize the operation of reading from a file and splitting a line into words

2 answers 2

More articles: