There are 400,000 xml files. Most of the weight does not exceed 2KB. The Java application should read them from disk, process (now using stax parser) and load them into various collections.

How many threads should I use for this purpose? Some people say that it is inefficient to use more than one stream for reading from a disk, others, on the contrary, include quite a lot of streams.


Added:

@Arhad @KoVadim @Monk if you run the program for the first time (ie, there is no warmup), then this process takes as much as an hour for me, on other machines - no more than 15 minutes. After several launches, the processing takes me about 15 minutes too. Yesterday I tried forkjoinpool (each folder goes to a separate thread) and fixedthreadpool (just throwing all the files in the form of a runnable task in a row) - zero sense.

Exhibited 4 threads (so many cores from the processor). Maybe you can somehow increase the amount of one-time information feed from disk to memory, the files are defragmented, you have to go directly on the media in a row, if I understand correctly.

About the separation of logic: I thought about it at the very beginning. But it can hardly be applied to the stax parser.

  • I think the Producer-Consumer pattern is right for you: one stream reads files and puts them into a thread-safe queue, other threads select them from the queue and parse. - Alexander Petrov
  • @Alexander Petrov yes, it would be nice to use it, but I don’t know if it is relevant for the stax parser? - TheSN pm
  • With small file sizes, it is easier to read them in the DOM and then process them. - Alexander Petrov
  • producer / consumer doesn't care at all what kind of logic it contains. The main thing is that there are designated inputs-outputs. And the logic of "transformation" can be any. - andreycha

3 answers 3

In short, there is no universal answer to this question. Arm yourself with a profiler and look for maximum performance among various numbers of files that can be simultaneously opened.

The fact is that:

  • on the one hand, HDD and SSD cannot be polled simultaneously in several streams, even if these streams belong to different processes;
  • but on the other hand, the operating system typically does a read-ahead readjustment into a buffer located in an unallocated area of ​​RAM. However, it is almost impossible to predict the success of such caching due to the combination of a huge number of constantly changing factors.
  • 3
    I think even a profiler is not needed here. 400,000 files to upload - this is unlikely to be faster than a few seconds. Therefore, you can measure any instrument. Regarding the number of threads - I think they will be either the number of cores + 1, or the number of cores / 2 + 1. - KoVadim

Separate logic. Load files in one stream, process downloaded files in another.

The disk will not be idle while it is being processed, as if you were working in one thread, but it will not try to download files into several threads, which rarely makes sense without low-level processing.

    Many streams are not good. 1 thread is also not good. You should have a ney system implemented, a typical pair of threads reads (that is, it waits until the operating system provides data from hard) 2.a second pair is already working with a pair of data provided from 1 thread. (but you still have decoding yes, I think it should be attached to the first pair).