I have one three-dimensional block (or several, but this is not important, since they are independent). I need to synchronize the streams in this block as follows: first all streams with coordinate z = 0 must be executed, then z = 1 and so on. That is, it is necessary to synchronize the flows in layers along the z axis. I already have a few people cite the following code as an example:
__global__ void Kernel(){ for(int z = 0; z < zmax; z++){ if(threadIdx.z == z){ //code __syncthreads(); } } }
If for clarity, to reduce this all, then instead of a cycle there will be something like this
if(threadIdx.z == 0){ __syncthreads(); } if(threadIdx.z == 1){ __syncthreads(); } и т. д.
Such a code in theory guarantees synchronous execution of threads with the same z, but does not guarantee the order of z. After all, all streams with z = 1 can come first to the synchronization point, and then the layer where z = 1 will be executed before where z = 0.
Is my reasoning correct? If so, how to do as I want. Or is it impossible?