Question on OpenAcc. For some reason a piece of code:

#include <openacc.h> ... #define NSZ (1<<16) ... //#pragma acc kernels for (i=0; i<NSZ; i++) C[i]=A[i]+B[i]; 

with the directive ... kernels is 20% complete .. 15% slower than without it

compiles with gcc -fopenacc -msse2 options ...

about the same code using OpenCl is performed 1.5..2 times faster compiler version 5.1.0 NVIDIA GT 950 video card I'm doing something wrong?

I would also like to receive links to Russian-language documents on the use of GPU with examples (OpenAcc, OpenCl and others ...)

    1 answer 1

    Your program tries to do something like the following.

    1. Transfer 2 arrays to a video card
    2. Quickly fold them on the side of the video card
    3. Transfer 1 array back from there

    As you can see, in steps 1 and 3, the arrays are still processed sequentially. As a result, savings are achieved only due to the acceleration of the operation of addition - but at the central processor it is already so fast (faster addition only shifts by 1).

    It turns out that overhead costs may even exceed the benefits of using a video card.

    In order to get a real acceleration - you need to do more complex calculations or transfer less data.

    And for such simple cycles, it is better to use vectorization (mmx, sse, etc.) Perhaps, you received acceleration on OpenCl just because of these techniques (here I, unfortunately, are not an expert).

    For example, _mm_add_epi32 might suit _mm_add_epi32