Why does this CUDA core produce results that are different from the results of the CPU version?

Question

I ported this piece of code to CUDA:

if(_layersCount > 1) { for(int i=_layersCount-2;i>=0;i--) { for(int j=0;j<_neuronsPerLayerCount[i];j++) // cuda kernel { localGradients[indexByLayerAndNeuron(i, j)] = 0; for(int k=0;k<_neuronsPerLayerCount[i+1];k++) { localGradients[indexByLayerAndNeuron(i, j)] += _neuronsInputsWeights[indexByLayerNeuronAndInput(i+1, k, j)] * localGradients[indexByLayerAndNeuron(i+1, k)]; } localGradients[indexByLayerAndNeuron(i, j)] *= derivatives[indexByLayerAndNeuron(i, j)]; } } }

Result:

  if(_layersCount > 1) { for(int i=_layersCount-2;i>=0;i--) { // calculateLocalGradientsForAnotherLayers blocksCount = floor((double) _neuronsPerLayerCount[i] / threads.x) + 1; blocks = dim3(blocksCount, 1); calculateLocalGradientsForAnotherLayers <<<blocks, threads>>> (deviceLocalGradients, _neuronsInputsWeights, deviceDerivatives, _neuronsPerLayerCount[i], _neuronsInPreviousLayers[i], _neuronsInPreviousLayers[i+1], _neuronsPerLayerCount[i+1], _inputsInPreviousLayers[i], _inputsInCurrentLayer[i]); } }

Kernel calculateLocalGradientsForAnotherLayers:

 __global__ void calculateLocalGradientsForAnotherLayers(double * localGradients, double * neuronsInputsWeights, double * derivatives, int neuronsCount, int neuronsInPreviousLayers, int neuronsInPreviousLayersWithCurrent, int neuronsInNextLayer, int inputsInPreviousLayers, int inputsInCurrentLayer) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if(idx < neuronsCount) { int neuron = neuronsInPreviousLayers + idx; localGradients[neuron] = 0; // this to Kernel, then reduce localGradients. for(int k=0;k<neuronsInNextLayer;k++) { localGradients[neuron] += neuronsInputsWeights[inputsInPreviousLayers + k*inputsInCurrentLayer + idx] * localGradients[neuronsInPreviousLayersWithCurrent + k]; } localGradients[neuron] *= derivatives[neuron]; } }

But I see differences in the results of calculations, starting with the second decimal. What could be the problem? All other kernels except for this work fine.

My card is GF555M. It supports double precision.

Robotex robotex 341 2 silver marks 14 bronze marks · Answer 1 · 2012-11-21T02:08:03

I found the problem. It turned out banal typo. Instead of the string:

 calculateLocalGradientsForAnotherLayers <<<blocks, threads>>> (deviceLocalGradients, _neuronsInputsWeights, deviceDerivatives, _neuronsPerLayerCount[i], _neuronsInPreviousLayers[i], _neuronsInPreviousLayers[i+1], _neuronsPerLayerCount[i+1], _inputsInPreviousLayers[i], _inputsInCurrentLayer[i]);

should have written:

 calculateLocalGradientsForAnotherLayers <<<blocks, threads>>> (deviceLocalGradients, _neuronsInputsWeights, deviceDerivatives, _neuronsPerLayerCount[i], _neuronsInPreviousLayers[i], _neuronsInPreviousLayers[i+1], _neuronsPerLayerCount[i+1], _inputsInPreviousLayers[i+1], _inputsInCurrentLayer[i+1]);

mega mega 4,966 1 golden mark 14 silver marks 38 bronze marks · Answer 2 · 2012-11-16T13:49:48

I haven’t really penetrated the code so far, but according to the totality of your topics, if the discrepancies are really in hundredths, then perhaps the “floating” parameters of calculations on CUDA often migrate from one memory to another (CPU <-> GPU), well, let's say - these are weights. Try to keep them always on the side of the GPU.

Or, if this does not work out, you can experiment with the rounding options:

either only on the GPU, if they can be controlled,
either on the CPU side (in both versions of the library), i.e. "correct" coefficients (reset the error) after the iteration of calculations (which migrate in the GPU version).

All data with which the kernel operates is loaded into memory at the very beginning and retrieved only at the very end.
The only thing that is sent there from the host at startup is the integer constants.
I'm here on stackoverflow hinted at the race condition in this core, but I don’t see it point-blank
> All data with which the kernel operates is loaded into memory at the very beginning and only retrieved at the very end.
> The only thing that is sent there from the host at startup is the integer constants. Calculations are made in double .
When loading and unloading double<->int , disagreements may occur between the GPU and the FPU .
You can experiment with accuracy when rounding FPU (there are only 3 options) and achieve the same results.

Why does this CUDA core produce results that are different from the results of the CPU version?

2 answers 2

More articles: