Taking the first steps in working with openCL (in this matter it does not matter but using the Nvidia GPU), I ran into a question.

I have the simplest program that works with a one-dimensional array. In kernel, a value is calculated using several cells of this array, even if, for example, an array of 1000 elements, and in kernel, only 9 of them are used. I pass a global input buffer to kernel which contains the entire array. The task is to go through each cell and calculate values ​​for a new, similar array.

My question is how, when starting the calculations, transfer to each workgroup only demanded elements (part of the array), so that each time not to work with the __global buffer and use __local, or to transfer to each thread only those elements of the array that it needs to with __private memory and potentially accelerate thereby the work of the program.

Here is a piece of the program that calls kernel

// Get the workgroup size size_t workgroup_size; err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &workgroup_size, NULL); if (err != CL_SUCCESS) { log_error("Unable to get kernel work-group size"); } // Send the massive to the OpenCL stack err = clEnqueueWriteBuffer(queue, input, CL_TRUE, 0, sizeof(massive), massive, 0, NULL, NULL); if (err != CL_SUCCESS) { log_error("Unable to enqueue buffer"); } // Run the kernel on every element in the massive err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &board_size, &workgroup_size, 0, NULL, NULL); if (err) { log_error("Unable to enqueue kernel"); } 

The kernel function declaration looks like this.

 __kernel void life(constant int* input, global int* output, const unsigned int massive_size) 

If that does not beat with sticks, I am a novice.

    0