Gradually answering all your questions.
And what is shown in the picture?
Forget about this picture. This is the most incomprehensible explanation of the principle of the SNA that I have ever seen. In general, I advise you not to look at the pictures, I personally could understand their principle of work only on video. By the way, the best step-by-step explanation (if you don’t know English, turn on the subtitles).
We take a layer 28 * 28 and take only 5 * 5 matrix from it, build a new layer 24 * 24 on it and again take the matrix 5 * 5 and so on until the last layer with the answer?
Not certainly in that way. We take NxN matrices (usually no more than 8 and not necessarily square) which are called filters and process our image using this filter. At the exit get a photo but a little modified (the whole point of how filters work). We do this several times. How much depends on how deeply we want to leave, in other words for each task its own value.
So how to create it?
No need to build a bike. Use ready-made libraries. TensorFlow works very well with convolutional networks. Here you can very quickly learn how to use it.
In essence, the meaning of such a network is the creation of a large number of neural networks equal to the number of layers, right?
Not. In fact, there are 2 NS. Convolutional and at the very end a network of direct distribution. And the point here is to “minimize” the photo by consistently applying various filters. The farther the layer, the more abstract the filter.
In any case, I advise you first to watch the video about which I spoke, and then it will become much clearer to you about what I wrote here.