I decided to throw a simple task: perform some operations on a large data block. And this is all parallelized. This can be done in two ways: 1. Parallel call a short procedure. But then there will be a lot of time spent on transferring control to another kernel. It is so? Well, it makes no sense to parallelize the operation 2 + 2. 2. Parallel processing of data blocks. It is logical to assume. that when parallel procedures run will be difficult, the effect of parallelism will be much higher. However, there were very strange results. For some reason, the initial effect of parallelism is negative. Here is the code for a simple program that demonstrates everything clearly:

using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Linq; using System.Text; using System.Threading.Tasks; using System.Windows.Forms; using System.Diagnostics; using System.Threading; namespace WindowsFormsApplication2 { public partial class Form1 : Form { public Form1() { InitializeComponent(); } double [] _dArray = new double[10000000]; double[] _dArray2 = new double[10000000]; int _iThreads = 8; int _iSizeBlock; private void button1_Click(object sender, EventArgs e) { _iSizeBlock = _dArray.Length / _iThreads;//Ρ€Π°Π·ΠΌΠ΅Ρ€ Π±Π»ΠΎΠΊΠ° //Π·Π°ΠΏΠΎΠ»Π½ΠΈΠΌ массив случайно Random r = new Random(); for (int i = 0; i < _dArray.Length; i++) { _dArray[i] = r.NextDouble(); _dArray2[i] = _dArray[i]; } richTextBox1.Text = "1 итСрация:\r\n"; for (int i = 1; i <= 8; i++) { ParallelOptions options = new ParallelOptions { MaxDegreeOfParallelism = i }; Stopwatch st1 = new Stopwatch(); st1.Start(); Parallel.For(0, _dArray.Length, options, parallelOne); st1.Stop(); richTextBox1.Text += i.ToString() + " ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: " + st1.Elapsed.TotalSeconds.ToString() + "\r\n"; } richTextBox1.Text += "Π‘Π»ΠΎΠΊ ΠΈΡ‚Π΅Ρ€Π°Ρ†ΠΈΠΉ:\r\n"; for (int i = 1; i <= 8; i++) { ParallelOptions options = new ParallelOptions { MaxDegreeOfParallelism = i }; Stopwatch st1 = new Stopwatch(); st1.Start(); Parallel.For(0, i, options, ParallelBlock); st1.Stop(); richTextBox1.Text += i.ToString() + " ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: " + st1.Elapsed.TotalSeconds.ToString() + "\r\n"; } } private void ParallelBlock(int iIndex) { int iStart = iIndex * _iSizeBlock; int iEnd = iStart + _iSizeBlock; //iIndex - Π½ΠΎΠΌΠ΅Ρ€ Π±Π»ΠΎΠΊΠ° for (int i = iStart; i < iEnd; i++) { _dArray[i] = Someoperations(_dArray[i]); } } private void parallelOne(int iIndex) { _dArray[iIndex] = Someoperations(_dArray[iIndex]); } private double Someoperations(double dInput) { double Result = Math.Sin(dInput) * Math.Log(dInput + 10); Result = Math.Pow(Result, 10); Result += Math.Abs(Math.Cos(Result)); Result += Math.Sqrt(Result); Result = Math.Pow(Result, 2); return Result; } } } 

And here is the result.

1 iteration:

1 stream, time: 2.5947303

2 stream time: 1,5046816

3 stream time: 1.2435103

4 stream time: 1,1743574

5 stream time: 1.8177255

6 stream time: 1.8564871

7 stream, time: 1,7038264

8 stream time: 1,7404472

Block iterations:

1 stream, time: 1,2824387

2 stream time: 1.2592897

3 stream time: 1,3303499

4 stream time: 1.3710368

5 stream time: 1.4195757

6 stream time: 1.4460356

7 stream time: 1.5213963

8 stream, time: 1.6072681

As you can see, in the second case, the result is very bad. That is, paralleling slowly works. Why is that? After all, logically, the second way of parallelization should be better than the first? I found this on the Internet.

 Thread.BeginThreadAffinity(); foreach(ProcessThread pt in Process.GetCurrentProcess().Threads) { int utid = GetCurrentThreadId(); if (utid == pt.Id) { pt.ProcessorAffinity = (IntPtr)(_iCPUThread); // Set affinity for this AllIterations(); } } Thread.EndThreadAffinity(); 

Maybe this way you can somehow more optimally solve my problem and the code will speed up? Can you tell? And then something like a dead end. Is it possible to parallelize the code that I wrote above using the example given above by me? The main difficulty lies in passing the index to the stream. After all, I process the array by index. Thank you.

  • 2
    How many processor cores do you have? What's up with hyper threading? - Pavel Mayorov
  • I have a processor of 12 logical cores. 6 of which are physical. i7-3930k - Dmitry
  • Thread.BeginThreadAffinity does the wrong thing. It nails logical thread to the system thread. This is actually not necessary at the moment, since in almost all implementations of the CLR, managed thread is implemented using the system thread. - VladD

2 answers 2

You are right that if operations are easy, fast, then it is desirable to perform them in blocks. But the fact is that the calculations in the Someoperations method are really hard. If you replace the body of the method with return 0; , the benefit from block calculations becomes apparent.

And most importantly, you have a logical error in the code in the line:

 Parallel.For(0, i, options, ParallelBlock); 

It is not surprising that the processing time increases sequentially, because the first iteration processes only part of the array (one eighth, in your case), the second iteration processes two parts, etc. The parameter i needs to be replaced with _iThreads .

    As indicated in the @Alexander Petrov answer, you have a logical error in your code, so the results are not quite correct. In the revised code, the processing time decreases with increasing number of threads involved.

    On my machine, the following results are obtained:

     1 итСрация: 1 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 1,7375645 2 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,9127861 3 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,6447709 4 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,5280516 5 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,5156717 6 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,5069659 7 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,4636803 8 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,4298237 Π½Π΅ΠΎΠ³Ρ€Π°Π½ΠΈΡ‡Π΅Π½ΠΎΠ΅ число ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ², врСмя: 0,4348061 Π‘Π»ΠΎΠΊ ΠΈΡ‚Π΅Ρ€Π°Ρ†ΠΈΠΉ: 1 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 2,6115381 2 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 1,3137321 3 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,9390005 4 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,6965802 5 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,6166681 6 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,5237621 7 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,4599443 8 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,4131483 

    Microsoft advises for your case (a lot of data, a simple short processing loop) to use Partitioner.Create , in which your cycle is divided into parts, and pieces are given to processing your code.

    For your case, it looks like this:

     var rangePartitioner = Partitioner.Create(0, _dArray.Length); Parallel.ForEach(rangePartitioner, (range, loopState) => ParallelBlock2(range.Item1, range.Item2)); void ParallelBlock2(int iStart, int iEnd) { //Console.WriteLine($"[{iStart}, {iStart}+{iEnd-iStart})"); for (int i = iStart; i < iEnd; i++) { _dArray[i] = Someoperations(_dArray[i]); } } 

    This code runs in 0.470345 seconds, which is not bad at all. This code can also be accelerated by prompting the preferred split length:

     var rangePartitioner = Partitioner.Create(0, _dArray.Length, _dArray.Length / Environment.ProcessorCount); 

    In this case, the Partitioner, apparently, does not waste time on the possible balancing of the pieces, and gives the optimal result, in my test it was 0.4100153 seconds.

    PLINQ also does a good job:

     ParallelEnumerable.Range(0, _dArray.Length).ForAll(ParallelOne); 

    runs through me in 0.4405042 seconds.

    Full test code:

     class Program { static void Main(string[] args) => new Program().Run(); double[] _dArray = new double[10000000]; int _iSizeBlock; void Run() { // ΠΏΡ€ΠΎΠ³Ρ€Π΅Π² ParallelBlock(0); ParallelBlock2(0, 1); ParallelOne(0); //Π·Π°ΠΏΠΎΠ»Π½ΠΈΠΌ массив случайно Random r = new Random(); for (int i = 0; i < _dArray.Length; i++) { _dArray[i] = r.NextDouble(); } Stopwatch sw; Console.WriteLine("1 итСрация:"); for (int i = 1; i <= 8; i++) { ParallelOptions options = new ParallelOptions { MaxDegreeOfParallelism = i }; sw = Stopwatch.StartNew(); Parallel.For(0, _dArray.Length, options, ParallelOne); sw.Stop(); Console.WriteLine($"{i} ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: {sw.Elapsed.TotalSeconds}"); } sw = Stopwatch.StartNew(); Parallel.For(0, _dArray.Length, ParallelOne); sw.Stop(); Console.WriteLine($"Π½Π΅ΠΎΠ³Ρ€Π°Π½ΠΈΡ‡Π΅Π½ΠΎΠ΅ число ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ², врСмя: {sw.Elapsed.TotalSeconds}"); Console.WriteLine("Π‘Π»ΠΎΠΊ ΠΈΡ‚Π΅Ρ€Π°Ρ†ΠΈΠΉ:"); for (int i = 1; i <= 8; i++) { ParallelOptions options = new ParallelOptions { MaxDegreeOfParallelism = i }; lock(this) _iSizeBlock = _dArray.Length / i; sw = Stopwatch.StartNew(); Parallel.For(0, i, options, ParallelBlock); sw.Stop(); Console.WriteLine($"{i} ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: {sw.Elapsed.TotalSeconds}"); } Console.WriteLine("PLINQ:"); sw = Stopwatch.StartNew(); ParallelEnumerable.Range(0, _dArray.Length).ForAll(ParallelOne); sw.Stop(); Console.WriteLine($"PLINQ, врСмя: {sw.Elapsed.TotalSeconds}"); sw = Stopwatch.StartNew(); var rangePartitioner = Partitioner.Create(0, _dArray.Length); Parallel.ForEach(rangePartitioner, (range, loopState) => ParallelBlock2(range.Item1, range.Item2)); sw.Stop(); Console.WriteLine($"Simple partitioning, врСмя: {sw.Elapsed.TotalSeconds}"); sw = Stopwatch.StartNew(); var rangePartitioner2 = Partitioner.Create(0, _dArray.Length, _dArray.Length / Environment.ProcessorCount); Parallel.ForEach(rangePartitioner2, (range, loopState) => ParallelBlock2(range.Item1, range.Item2)); sw.Stop(); Console.WriteLine($"Core count guided partitioning, врСмя: {sw.Elapsed.TotalSeconds}"); Console.ReadKey(intercept: true); } void ParallelBlock(int iIndex) { int iSizeBlock; lock (this) iSizeBlock = _iSizeBlock; int iStart = iIndex * iSizeBlock; int iEnd = iStart + iSizeBlock; //iIndex - Π½ΠΎΠΌΠ΅Ρ€ Π±Π»ΠΎΠΊΠ° for (int i = iStart; i < iEnd; i++) { _dArray[i] = Someoperations(_dArray[i]); } } void ParallelBlock2(int iStart, int iEnd) { //Console.WriteLine($"[{iStart}, {iStart}+{iEnd-iStart})"); for (int i = iStart; i < iEnd; i++) { _dArray[i] = Someoperations(_dArray[i]); } } void ParallelOne(int iIndex) { _dArray[iIndex] = Someoperations(_dArray[iIndex]); } double Someoperations(double dInput) { double Result = Math.Sin(dInput) * Math.Log(dInput + 10); Result = Math.Pow(Result, 10); Result += Math.Abs(Math.Cos(Result)); Result += Math.Sqrt(Result); Result = Math.Pow(Result, 2); return Result; } } 

    Its result (x64, Release, outside Visual Studio):

     1 итСрация: 1 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 1,7375645 2 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,9127861 3 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,6447709 4 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,5280516 5 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,5156717 6 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,5069659 7 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,4636803 8 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,4298237 Π½Π΅ΠΎΠ³Ρ€Π°Π½ΠΈΡ‡Π΅Π½ΠΎΠ΅ число ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ², врСмя: 0,4348061 Π‘Π»ΠΎΠΊ ΠΈΡ‚Π΅Ρ€Π°Ρ†ΠΈΠΉ: 1 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 2,6115381 2 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 1,3137321 3 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,9390005 4 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,6965802 5 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,6166681 6 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,5237621 7 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,4599443 8 ΠΏΠΎΡ‚ΠΎΠΊ, врСмя: 0,4131483 PLINQ: PLINQ, врСмя: 0,4405042 Simple partitioning, врСмя: 0,470345 Core count guided partitioning, врСмя: 0,4100153