There is a need to perform calculations on a three-dimensional array. Among which there are FFT and median search (the most expensive).

An attempt to introduce a multi-threaded processing was negative for acceleration. Given that the program does not even produce a result record (this, presumably, could slow down the streams). Apparently, a lot of CPU time is spent on the original threading.

Are there any critical errors in the organization of threads in the program or for such a task it is impossible to get a performance boost at all from the threads?

from queue import Queue from threading import Thread import numpy as np _size = 256 # создание массива комплексных чисел размерностью 3х3х3 arr = np.random.rand(_size, _size, _size) \ + np.random.rand(_size, _size, _size) * 1j def single(arr): # функция, которая выполняется в одном потоке for fd in range(arr.shape[0]): for sd in range(arr.shape[1]): spec = np.fft.fftshift(abs(np.fft.fft(arr[fd, sd, :]))) amax = spec.argmax() val = 20*np.log10(spec[amax]) - 20*np.log10(np.median(spec)) # число потоков nwork = 4 def multith(arr): # функция, выполняющаяся в nwork потоках def selffun(arr): spec = np.fft.fftshift(abs(np.fft.fft(arr))) amax = spec.argmax() val = 20*np.log10(spec[amax]) - 20*np.log10(np.median(spec)) def worker(): while True: item = _queue.get() selffun(item) _queue.task_done() def source(arr): # генератор заданий for fd in range(arr.shape[0]): for sd in range(arr.shape[1]): yield arr[fd, sd, :] _queue = Queue() for i in range(nwork): th = Thread(target=worker) th.setDaemon(True) th.start() for item in source(arr): _queue.put(item) _queue.join() 

Result:

 %timeit single(arr) 1 loop, best of 3: 4.61 s per loop nwork = 4 %timeit multith(arr) 1 loop, best of 3: 7.45 s per loop nwork = 2 %timeit multith(arr) 1 loop, best of 3: 6.31 s per loop 
  • look at the numexpr library. This can also be interesting. Can you show how your numpy print(np.show_config()) ? If you use numpy which uses MKL from Intel, then it should itself parallelize the calculations, in my opinion - MaxU
  • blas_mkl_info: NOT AVAILABLE. And in numexpr there is no median or Fourier. - mkkik
  • if you are working on windows try these builds ? - MaxU
  • @MaxU using linux - mkkik
  • one
    look at Anaconda ... here's another - MaxU

2 answers 2

Using threads will not speed up the code for computing, as there is a GIL in python. To speed up the code, use Process from the multiprocessing module or ProcessPoolExecutor from the concurrent module.

  • Thank. Do you need threads in python only for tasks where there is input / output or web requests? - mkkik
  • one
    No, you're using a queue. And for the case with ProcessPoolExecutor, it will return the list to you. If you need to write a large amount of data, then you make a process that reads the data from the queue and writes to the file. - Avernial
  • 2
    @mkkik numpy can release GIL, so some calculations (such as A=B+C ) can be sped up by using multiple threads. Cython is also simple and convenient for this (nogil construct). You can also avoid unnecessary copying using a common array ( multiprocessing.Array ), even when several processes are used. Numpy can also omp (not Python) threads use and install CPU affinity !! If there is a need to speed up the work of the code, then you need to directly ask about it (paralleling is not required to speed up the code in the general case, regardless of the language used). - jfs
  • one
    @jfs, yes, of course, you need code acceleration. All of the above somewhere can be read in a systematic way? - mkkik
  • one

This is not an answer to your question, I just want to write about the results that were achieved thanks to the information provided in the comments to the question and the answer. Perhaps it will be useful for someone.

  1. numpy c MKL

    The current versions of the Anaconda distribution include the numpy library compiled with MKL support and the mkl-service package. But the execution time of my function (median, FFT) without paralleling remained the same (as well as without MKL ). I saw articles that describe the python + numpy + scipy build process specifically for working with MKL , but in this case you need an Intel compiler, which is paid.

  2. concurrent and multiprocessing

    When replacing threads with ProcessPoolExecutor from concurrent without changing the rest of the program code with max_workers = 2 , the execution time approached the single-threaded result - 4.57 s. With an increase in the number of processes, the execution time increased.

    With a similar replacement to the Pool of multiprocessing :

    multiprocessing.Pool (2): 2.78 s

    multiprocessing.Pool (4): 1.74 s

    multiprocessing.Pool (8): 1.3 s

  3. multiprocessing.Process and shared multiprocessing.RawArray

    Based on this example .

     import ctypes, itertools import multiprocessing as mp import numpy as np _size = 256 arr = np.random.rand(_size, _size, _size) \ + np.random.rand(_size, _size, _size) * 1j def selffun(arr, sl, arrD): d = np.reshape(np.frombuffer(arrD), (_size, _size)) spec = np.fft.fftshift(abs(np.fft.fft(arr[sl[0], sl[1], :]))) amax = spec.argmax() d[sl[0], sl[1]] = 20*np.log10(spec[amax]) - 20*np.log10(np.median(spec)) def worker(arr, q, arrD): while True: item = q.get() if item is None: break selffun(arr, item, arrD) q.task_done() q.task_done() def main(arr): a, b = arr.shape[:-1] arrD = mp.RawArray(ctypes.c_double, a*b) nCPU = mp.cpu_count() queue = mp.JoinableQueue() for item in itertools.product(range(a), range(b)): queue.put(item) for i in range(nCPU): queue.put(None) workers = [] for i in range(nCPU): _worker = mp.Process(target=worker, args=(arr, queue, arrD)) workers.append(worker) _worker.start() queue.join() return np.reshape(np.frombuffer(arrD), (a, b)) if __name__ == '__main__': main(arr) 

Result: 1.83 s ( nCPU = 8 , Intel® Core ™ i7-3770 CPU @ 3.40GHz × 8)

The best time result was obtained with multiprocessing.Pool , but has not yet figured out how to use Pool.map with a shared array.

  • one
    great comparison! Surely it will come in handy for “nampearers” working with large data arrays ... - MaxU
  • In this case, the data array is not large. And for other tasks, this result may be useless. I wrote to make it clear where to start trying. - mkkik