I understand speech recognition using neural networks. I found quite a lot of information about the device of the networks themselves, but with examples of their work it is more complicated.
Suppose I have some kind of network with N inputs and M outputs. And there is a beep with some text. How to make it all work together?
We received, say, using the Fourier transform of the spectrum. Well, we can somehow cut the sound signal into some small pieces. Again, how do you choose their piece length? And what to do next? What to apply to the network input? And what does the output network usually give? Letters corresponding to sounds or something else? (Assume the task is to get the spoken sentence in text form)
I am interested in the algorithm itself, as it all happens from a person uttering a phrase before receiving this phrase in text form. The information I found usually covers either the work of the neural network or the Fourier transform. But how it all works together is not very clear.
