Task. To read the files from the folder, for each to perform some actions, and the result for each file to write to the database, which, we will assume, is on another computer. It is necessary to increase the performance of the algorithm by parallelizing the task between the processor cores. It is also worth noting that when the algorithm is running, a third-party application is invoked and the result is read through return codes. But this is not the only action performed to get the result.
The course of my thoughts is this. The task is divided into three operations:
- reading file
- getting the result from the read
- writing the result to the database
It is worthwhile to parallelize the second operation only, because the first is the output from the disk subsystem, the third is working with the network, and the second is mainly with the processor. Moreover, most of the time is spent on the second operation, and writing to the database and reading the file are much faster.
So I’ll get two Producer-Consumer queues:
- producer --- a stream that performs the operation of reading files from a folder. Consumers --- threads that get results from reading
- producers --- streams that get the result from the read. Consumer --- a stream writing the result to the database
The first question concerns the number of consumer flows for the first stage. In a similar question How is it better to parallelize a task? it was recommended to set the number of threads performing the main work equal to the number of processor cores. But shouldn't this number be reduced in order to “release” one core for the streams of the first and third operations?
The second question is about the fact that the operation of getting a result involves calling a third-party application. Will it destroy the effectiveness of task parallelization?