There is a neural network with the following structure: enter image description here I consistently perform the neural network training, driving through it a set of training data, namely:

1) Direct signal passing (inputs) through each matrix of scales (yes, yes, adder, activation function, all matters ...)

2) Calculation of the output layer error (matching of outputs and targets)

3) Calculation of the error of hidden layers (the derivative of the activation function, taking into account the effect of the matrix of weights on the error, all as it should)

4) Correction of weighting factors (taking into account the calculated error)

These actions in the learning process are performed N times (100 ...... 1000000)

Question? How can I organize calculations in parallel? for the learning cycle for each of the sets of input values ​​implies working with the current values ​​of the weights matrices. To update the weights for one set of inputs (x1 .... xn), you need to calculate the outputs (Y-1 ... Ym) on these weights, perform the correction and then drive the next inputs (x ....) ... well, with the corresponding set of targets (y ......) of course

// neural network is already written in C / C ++ in plans to use several threads to speed up work (Please do not throw links to CUDA, I know that it is similar to C and everything works there, I can’t understand what exactly is being calculated in parallel and how these the results affect each other to obtain a result set of weights matrices)

UPD : how it works approximately now ( abstract layer)

for (out = 0 ; out < outputsNeuerons ; out++) { float sum = 0.0; for (inp = 0 ; inp < inputsNeuerons ; inp++) { //inputs - то что вошло в слой sum += inputs[inp] * weightsMatrix[inp][out]; } sum += weightsMatrix[inputsNeuerons][out]; //outputs - то что вышло из слоя outputs[out] = sigmoid( sum ); } 
  • one
    An obvious option would be to parallel the calculations for each layer. Throw n / number of neuron streams into each stream. - VTT
  • that is, at the same time counting the direct passage of, for example, 2 input vectors (even 2 streams), we can simultaneously calculate the error from the outputs and calculate the error on each layer. But what to do with the amendment? If the set of matrices is one and the set of corrections is already 2. Is it correct to calculate the values ​​of corrections for different inputs (x ..... xN) without amending the "previous calculation"? Will the work of the gradient descent in the correction of the coefficients? - Gunik 2:44 pm
  • I have never mentioned the simultaneous processing of two input vectors. - VTT
  • one
    I painted - distribute the calculations for the neurons of each layer between n streams. Calculations for each next layer begin after the end of all calculations of the previous layer. - VTT
  • one
    I don’t know how to explain anymore ... Here you have a few circles in the picture relating to the first layer. Suppose 6 pieces. The first thread processes the circles of the first layer with numbers 1, 2, 3; the second stream - circles of the first layer with numbers 4, 5, 6 - Parallelism! - VTT

1 answer 1

Block diagram of the parallel operation of neurons from different streams

We divide the input vector into 2 (conditionally) parts, each half is considered to be in its own stream, the resulting vectors are summed and sent to the activation function.

// Comments in the comments (and yes the sum of the two vectors in the picture is not correctly recorded)

For one thread:

 for (out = 0 ; out < outputsNeuerons ; out++) { outputs[out] = 0.0; for (inp = 0 ; inp < inputsNeuerons/2 ; inp++) { outputs[out] += inputs[inp] * weightsMatrix[inp][out]; } outputs[out] += weightsMatrix[inputsNeuerons][out]; } 

For another thread:

 for (out = 0 ; out < outputsNeuerons ; out++) { outputs[out] = 0.0; for (inp = inputsNeuerons/2; inp < inputsNeuerons; inp++) { outputs[out] += inputs[inp] * weightsMatrix[inp][out]; } outputs[out] += weightsMatrix[inputsNeuerons][out]; } 

// if it's nonsense, then at least let me know

  • результирующие вектора суммируем - no. 1) Inside the current layer there is a certain number of perceptrons (neurons, roughly speaking), the output of which is recorded in a vector transmitted to the next layer at the input. 2) The value of each element of the vector depends only on the previous layer and does not depend on its neighbors. 3) Thanks to independence from neighbors, we can break the vector into pieces and send each piece to be calculated into its own stream. Since the streams work only with their elements and do not climb to the neighbors, we get acceleration as many times as we have streams. - ߊߚߤߘ 7:03 pm
  • 4) That is, the vector of the output values ​​of the layer is obtained exactly the same as in the single-threaded version, just its elements are calculated not one after the other, but simultaneously (almost). - ߊߚߤߘ 7:05 pm
  • (answer to a fresh revision) Yes, that's right. - ߊߚߤߘ
  • @Arhad did I understand correctly that we were cutting the input vector into N equal parts (equal to the number of streams). And each part will work in parallel with its matrixes, inaccessible to others, by sections? - Gunik 7:18 pm