Improving performance in the chain “read from disk - processing - writing to the database”

Question

Task. To read the files from the folder, for each to perform some actions, and the result for each file to write to the database, which, we will assume, is on another computer. It is necessary to increase the performance of the algorithm by parallelizing the task between the processor cores. It is also worth noting that when the algorithm is running, a third-party application is invoked and the result is read through return codes. But this is not the only action performed to get the result.

The course of my thoughts is this. The task is divided into three operations:

reading file
getting the result from the read
writing the result to the database

It is worthwhile to parallelize the second operation only, because the first is the output from the disk subsystem, the third is working with the network, and the second is mainly with the processor. Moreover, most of the time is spent on the second operation, and writing to the database and reading the file are much faster.

So I’ll get two Producer-Consumer queues:

producer --- a stream that performs the operation of reading files from a folder. Consumers --- threads that get results from reading
producers --- streams that get the result from the read. Consumer --- a stream writing the result to the database

The first question concerns the number of consumer flows for the first stage. In a similar question How is it better to parallelize a task? it was recommended to set the number of threads performing the main work equal to the number of processor cores. But shouldn't this number be reduced in order to “release” one core for the streams of the first and third operations?

The second question is about the fact that the operation of getting a result involves calling a third-party application. Will it destroy the effectiveness of task parallelization?

while an effective variant of parallel processing, I see the organization of the conveyor with the parallel execution of the stages of this conveyor.
Well, actually writing to the database can also be further parallelized.
The second operation - it all depends on the third-party application.
And in general, threads are hardly needed here: you should run several (by the number of cores) instances of that application, and not several threads / tasks.
The third operation: here again, it all comes down to waiting for a query to the database.
@ 4per if the record of the result implies the execution of one request, then of course you do not need to parallelize anything, and if there are several requests and they are not tied to each other, then they can be run in parallel, this will speed up the work.
"the number of threads performing the main work equal to the number of processor cores" - it should be understood that if the main work requires some kind of response, i.e.
most of the time will be expected, the number of threads can be increased without serious consequences.
If your number of threads is equal to the number of cores, then each thread will be idle for 2-10 seconds.
If you increase the number of threads, then the idle time will decrease.
But at the same time, you need to understand that resources are beginning to be spent on switching between threads.
When selecting the number of threads, start the task manager and look at the workload of the CPU.

Accepted Answer · 2016-12-10T13:10:48

If the interest is academic or you know for sure which hardware will work for it, and you have access to this hardware, then it may be worthwhile to conduct benchmarks. But if this is a task for some real application, then I see no point. Today, the application will be launched on a weak machine, tomorrow on a strong one, the day after tomorrow they will purchase new processors and everything will change. Therefore, I see no reason not to trust the standard sheduler and invent bicycles. I propose to look at the next very simple pseudocode

void Main() { var files = Enumerable.Range(0, 100).Select(x => { var fname = $"{x}.txt"; Console.WriteLine($"Reading {fname}"); return fname; }); WriteToDb(files.AsParallel().Select(x=>Process(x))); // Если хочется начать писать в БД до того, как все файлы будут обработаны. Я бы выбрал этот вариант //WriteToDb(files.AsParallel().WithMergeOptions(ParallelMergeOptions.NotBuffered).Select(x=>Process(x))); // Если уж сильно хочется контролить количество потоков. Не советую этого делать, лучше побеспокойтесь о памяти. // WriteToDb(files.AsParallel().WithDegreeOfParallelism(Environment.ProcessorCount).Select(x=>Process(x))); } string Process(string fname) { Console.WriteLine($"processing {fname}"); for (var i = 0; i < 1000000000; i++) { } return $"{fname} processed"; } void WriteToDb(IEnumerable<string> results) { Console.WriteLine("Start writing to DB"); foreach (var result in results) { Console.WriteLine($"Saving {result}"); } }

Pay attention to the CPU load, I'm sure it will crawl up to 100% of all processors / cores. This is not only easier than synchronizing everything manually, it also (in theory) should be more efficient than manually creating a heap of threads, since the standard sheduler can reuse already created threads in the pool.

UPD . I will add what is happening.

 WriteToDb(files.AsParallel().Select(x=>Process(x)));

The process is as follows: we read several files in parallel, we process them in parallel, we add the result. When all the results for all files are ready (or not all - PLINQ decides) - we write to the database. The number of threads that work in parallel is determined by the sheduler.

 // Если хочется начать писать в БД до того, как все файлы будут обработаны. Я бы выбрал этот вариант //WriteToDb(files.AsParallel().WithMergeOptions(ParallelMergeOptions.NotBuffered).Select(x=>Process(x)));

The process is as follows: we read several files in parallel, we process them in parallel, we write the results in parallel to the database. The number of threads that work in parallel is determined by the sheduler.

 // Если уж сильно хочется контролить количество потоков. Не советую этого делать, лучше побеспокойтесь о памяти. // WriteToDb(files.AsParallel().WithDegreeOfParallelism(Environment.ProcessorCount).Select(x=>Process(x)));

The process is as follows: we read several files in parallel, we process them in parallel, we add the result. When all the results for all files are ready (or not all - PLINQ decides) - we write to the database. The number of parallel working is set here WithDegreeOfParallelism .

Also, these options can be combined.

we read a portion, we process, we read a portion, we process.
Portion sizes depend on how many threads Parallel LINQ allocates.
@ 4per читаем порцию, обрабатываем, читаем порцию, обрабатываем. Размеры порции зависят от того, сколько потоков выделит Parallel LINQ
читаем порцию, обрабатываем, читаем порцию, обрабатываем. Размеры порции зависят от того, сколько потоков выделит Parallel LINQ - YES. Через каждые сотню делаем остановку для записи в БД - no, wrong.

Answer 2 · 2016-12-07T01:26:50

If you don’t know where the program will work, in what conditions and on which hardware, but only approximately what is going on - this stage can be parallelized, then a benchmark should be held to find out the optimal number of threads.

There are several strategies for such a benchmark.

To conduct a benchmark at the first, or each, run, fixing the optimal number of threads. Here you can save time if you maintain the optimal value for the known characteristics of iron - the amount of memory and processor - recalculating the optimum only if the iron has changed.
If the exclusive use of iron is not intended, and therefore the available computing resources are not constant, then it will be optimal to dynamically determine the optimal number of threads for all procedures. Approximate algorithm: we start one process, we write down execution time. We start the second - we write down time. Look at the difference - not increased? We start one more process. Again, watch the time. We start again. Measure again. Increased? Stop, remove the extra process. After a while, we try to start one additional process again.

It is clear that strategies can and should be combined. For example, if you know for sure that it does not make sense to start 16 GB of memory more than 8 processes, then in a dynamic algorithm it is worth stopping at 8 processes.

"it is clear that you don’t know" I know that processing takes most of all, for it the speed of the processor core is the limiter, and this is true for processing in my code and in the called third-party application.
Reading from disk and writing to the database takes much less time, but you cannot store all files in memory.
It would be great in passing to remove from it all that is not related to the subject matter.
It should be clear from them that you have done all the preparatory work, but have not found a solution.
From your question now it is not clear - have you tried to solve your problem somehow, and how?
What did not work out? .. While your thoughts are visible, but they are not in themselves an attempt to practically solve the problem.
If I did not think anything specific, I would not add it to the question.
But now I have once again looked at my question and still do not see anything specific.
Also in my question is present "Share the result of your search and tell us what you found and why the answers you found did not suit you."
- I do not know how to set the correct number of threads for the main operation, and I don’t know how a third-party application call affects this.

Improving performance in the chain “read from disk - processing - writing to the database”

2 answers 2

More articles: