How to process an array of fast parallel computing?

Question

I decided to throw a simple task: perform some operations on a large data block. And this is all parallelized. This can be done in two ways: 1. Parallel call a short procedure. But then there will be a lot of time spent on transferring control to another kernel. It is so? Well, it makes no sense to parallelize the operation 2 + 2. 2. Parallel processing of data blocks. It is logical to assume. that when parallel procedures run will be difficult, the effect of parallelism will be much higher. However, there were very strange results. For some reason, the initial effect of parallelism is negative. Here is the code for a simple program that demonstrates everything clearly:

using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Linq; using System.Text; using System.Threading.Tasks; using System.Windows.Forms; using System.Diagnostics; using System.Threading; namespace WindowsFormsApplication2 { public partial class Form1 : Form { public Form1() { InitializeComponent(); } double [] _dArray = new double[10000000]; double[] _dArray2 = new double[10000000]; int _iThreads = 8; int _iSizeBlock; private void button1_Click(object sender, EventArgs e) { _iSizeBlock = _dArray.Length / _iThreads;//размер блока //заполним массив случайно Random r = new Random(); for (int i = 0; i < _dArray.Length; i++) { _dArray[i] = r.NextDouble(); _dArray2[i] = _dArray[i]; } richTextBox1.Text = "1 итерация:\r\n"; for (int i = 1; i <= 8; i++) { ParallelOptions options = new ParallelOptions { MaxDegreeOfParallelism = i }; Stopwatch st1 = new Stopwatch(); st1.Start(); Parallel.For(0, _dArray.Length, options, parallelOne); st1.Stop(); richTextBox1.Text += i.ToString() + " поток, время: " + st1.Elapsed.TotalSeconds.ToString() + "\r\n"; } richTextBox1.Text += "Блок итераций:\r\n"; for (int i = 1; i <= 8; i++) { ParallelOptions options = new ParallelOptions { MaxDegreeOfParallelism = i }; Stopwatch st1 = new Stopwatch(); st1.Start(); Parallel.For(0, i, options, ParallelBlock); st1.Stop(); richTextBox1.Text += i.ToString() + " поток, время: " + st1.Elapsed.TotalSeconds.ToString() + "\r\n"; } } private void ParallelBlock(int iIndex) { int iStart = iIndex * _iSizeBlock; int iEnd = iStart + _iSizeBlock; //iIndex - номер блока for (int i = iStart; i < iEnd; i++) { _dArray[i] = Someoperations(_dArray[i]); } } private void parallelOne(int iIndex) { _dArray[iIndex] = Someoperations(_dArray[iIndex]); } private double Someoperations(double dInput) { double Result = Math.Sin(dInput) * Math.Log(dInput + 10); Result = Math.Pow(Result, 10); Result += Math.Abs(Math.Cos(Result)); Result += Math.Sqrt(Result); Result = Math.Pow(Result, 2); return Result; } } }

And here is the result.

1 iteration:
1 stream, time: 2.5947303
2 stream time: 1,5046816
3 stream time: 1.2435103
4 stream time: 1,1743574
5 stream time: 1.8177255
6 stream time: 1.8564871
7 stream, time: 1,7038264
8 stream time: 1,7404472

Block iterations:

1 stream, time: 1,2824387
2 stream time: 1.2592897
3 stream time: 1,3303499
4 stream time: 1.3710368
5 stream time: 1.4195757
6 stream time: 1.4460356
7 stream time: 1.5213963
8 stream, time: 1.6072681

As you can see, in the second case, the result is very bad. That is, paralleling slowly works. Why is that? After all, logically, the second way of parallelization should be better than the first? I found this on the Internet.

 Thread.BeginThreadAffinity(); foreach(ProcessThread pt in Process.GetCurrentProcess().Threads) { int utid = GetCurrentThreadId(); if (utid == pt.Id) { pt.ProcessorAffinity = (IntPtr)(_iCPUThread); // Set affinity for this AllIterations(); } } Thread.EndThreadAffinity();

Maybe this way you can somehow more optimally solve my problem and the code will speed up? Can you tell? And then something like a dead end. Is it possible to parallelize the code that I wrote above using the example given above by me? The main difficulty lies in passing the index to the stream. After all, I process the array by index. Thank you.

This is actually not necessary at the moment, since in almost all implementations of the CLR, managed thread is implemented using the system thread.

Answer 1 · 2016-10-09T20:58:53

You are right that if operations are easy, fast, then it is desirable to perform them in blocks. But the fact is that the calculations in the Someoperations method are really hard. If you replace the body of the method with return 0; , the benefit from block calculations becomes apparent.

And most importantly, you have a logical error in the code in the line:

 Parallel.For(0, i, options, ParallelBlock);

It is not surprising that the processing time increases sequentially, because the first iteration processes only part of the array (one eighth, in your case), the second iteration processes two parts, etc. The parameter i needs to be replaced with _iThreads .

VladD VladD 183k sixteen 223 432 · Answer 2 · 2016-10-09T23:14:57

As indicated in the @Alexander Petrov answer, you have a logical error in your code, so the results are not quite correct. In the revised code, the processing time decreases with increasing number of threads involved.

On my machine, the following results are obtained:

 1 итерация: 1 поток, время: 1,7375645 2 поток, время: 0,9127861 3 поток, время: 0,6447709 4 поток, время: 0,5280516 5 поток, время: 0,5156717 6 поток, время: 0,5069659 7 поток, время: 0,4636803 8 поток, время: 0,4298237 неограниченое число потоков, время: 0,4348061 Блок итераций: 1 поток, время: 2,6115381 2 поток, время: 1,3137321 3 поток, время: 0,9390005 4 поток, время: 0,6965802 5 поток, время: 0,6166681 6 поток, время: 0,5237621 7 поток, время: 0,4599443 8 поток, время: 0,4131483

Microsoft advises for your case (a lot of data, a simple short processing loop) to use Partitioner.Create , in which your cycle is divided into parts, and pieces are given to processing your code.

For your case, it looks like this:

 var rangePartitioner = Partitioner.Create(0, _dArray.Length); Parallel.ForEach(rangePartitioner, (range, loopState) => ParallelBlock2(range.Item1, range.Item2)); void ParallelBlock2(int iStart, int iEnd) { //Console.WriteLine($"[{iStart}, {iStart}+{iEnd-iStart})"); for (int i = iStart; i < iEnd; i++) { _dArray[i] = Someoperations(_dArray[i]); } }

This code runs in 0.470345 seconds, which is not bad at all. This code can also be accelerated by prompting the preferred split length:

 var rangePartitioner = Partitioner.Create(0, _dArray.Length, _dArray.Length / Environment.ProcessorCount);

In this case, the Partitioner, apparently, does not waste time on the possible balancing of the pieces, and gives the optimal result, in my test it was 0.4100153 seconds.

PLINQ also does a good job:

 ParallelEnumerable.Range(0, _dArray.Length).ForAll(ParallelOne);

runs through me in 0.4405042 seconds.

Full test code:

 class Program { static void Main(string[] args) => new Program().Run(); double[] _dArray = new double[10000000]; int _iSizeBlock; void Run() { // прогрев ParallelBlock(0); ParallelBlock2(0, 1); ParallelOne(0); //заполним массив случайно Random r = new Random(); for (int i = 0; i < _dArray.Length; i++) { _dArray[i] = r.NextDouble(); } Stopwatch sw; Console.WriteLine("1 итерация:"); for (int i = 1; i <= 8; i++) { ParallelOptions options = new ParallelOptions { MaxDegreeOfParallelism = i }; sw = Stopwatch.StartNew(); Parallel.For(0, _dArray.Length, options, ParallelOne); sw.Stop(); Console.WriteLine($"{i} поток, время: {sw.Elapsed.TotalSeconds}"); } sw = Stopwatch.StartNew(); Parallel.For(0, _dArray.Length, ParallelOne); sw.Stop(); Console.WriteLine($"неограниченое число потоков, время: {sw.Elapsed.TotalSeconds}"); Console.WriteLine("Блок итераций:"); for (int i = 1; i <= 8; i++) { ParallelOptions options = new ParallelOptions { MaxDegreeOfParallelism = i }; lock(this) _iSizeBlock = _dArray.Length / i; sw = Stopwatch.StartNew(); Parallel.For(0, i, options, ParallelBlock); sw.Stop(); Console.WriteLine($"{i} поток, время: {sw.Elapsed.TotalSeconds}"); } Console.WriteLine("PLINQ:"); sw = Stopwatch.StartNew(); ParallelEnumerable.Range(0, _dArray.Length).ForAll(ParallelOne); sw.Stop(); Console.WriteLine($"PLINQ, время: {sw.Elapsed.TotalSeconds}"); sw = Stopwatch.StartNew(); var rangePartitioner = Partitioner.Create(0, _dArray.Length); Parallel.ForEach(rangePartitioner, (range, loopState) => ParallelBlock2(range.Item1, range.Item2)); sw.Stop(); Console.WriteLine($"Simple partitioning, время: {sw.Elapsed.TotalSeconds}"); sw = Stopwatch.StartNew(); var rangePartitioner2 = Partitioner.Create(0, _dArray.Length, _dArray.Length / Environment.ProcessorCount); Parallel.ForEach(rangePartitioner2, (range, loopState) => ParallelBlock2(range.Item1, range.Item2)); sw.Stop(); Console.WriteLine($"Core count guided partitioning, время: {sw.Elapsed.TotalSeconds}"); Console.ReadKey(intercept: true); } void ParallelBlock(int iIndex) { int iSizeBlock; lock (this) iSizeBlock = _iSizeBlock; int iStart = iIndex * iSizeBlock; int iEnd = iStart + iSizeBlock; //iIndex - номер блока for (int i = iStart; i < iEnd; i++) { _dArray[i] = Someoperations(_dArray[i]); } } void ParallelBlock2(int iStart, int iEnd) { //Console.WriteLine($"[{iStart}, {iStart}+{iEnd-iStart})"); for (int i = iStart; i < iEnd; i++) { _dArray[i] = Someoperations(_dArray[i]); } } void ParallelOne(int iIndex) { _dArray[iIndex] = Someoperations(_dArray[iIndex]); } double Someoperations(double dInput) { double Result = Math.Sin(dInput) * Math.Log(dInput + 10); Result = Math.Pow(Result, 10); Result += Math.Abs(Math.Cos(Result)); Result += Math.Sqrt(Result); Result = Math.Pow(Result, 2); return Result; } }

Its result (x64, Release, outside Visual Studio):

 1 итерация: 1 поток, время: 1,7375645 2 поток, время: 0,9127861 3 поток, время: 0,6447709 4 поток, время: 0,5280516 5 поток, время: 0,5156717 6 поток, время: 0,5069659 7 поток, время: 0,4636803 8 поток, время: 0,4298237 неограниченое число потоков, время: 0,4348061 Блок итераций: 1 поток, время: 2,6115381 2 поток, время: 1,3137321 3 поток, время: 0,9390005 4 поток, время: 0,6965802 5 поток, время: 0,6166681 6 поток, время: 0,5237621 7 поток, время: 0,4599443 8 поток, время: 0,4131483 PLINQ: PLINQ, время: 0,4405042 Simple partitioning, время: 0,470345 Core count guided partitioning, время: 0,4100153

How to process an array of fast parallel computing?

2 answers 2

More articles: