I understand speech recognition using neural networks. I found quite a lot of information about the device of the networks themselves, but with examples of their work it is more complicated.

Suppose I have some kind of network with N inputs and M outputs. And there is a beep with some text. How to make it all work together?

We received, say, using the Fourier transform of the spectrum. Well, we can somehow cut the sound signal into some small pieces. Again, how do you choose their piece length? And what to do next? What to apply to the network input? And what does the output network usually give? Letters corresponding to sounds or something else? (Assume the task is to get the spoken sentence in text form)

I am interested in the algorithm itself, as it all happens from a person uttering a phrase before receiving this phrase in text form. The information I found usually covers either the work of the neural network or the Fourier transform. But how it all works together is not very clear.

    1 answer 1

    The topic is too extensive for a short answer - your question is a reason for a good research. However, no one forbids putting forward hypotheses.

    Neural networks must accept a fixed-size feature vector as input. So you need to somehow normalize the length of the input vector. In the question you indicated that there are Fourier coefficients after the text transformation. Texts are very different lengths and to get the same number of coefficients, the duration of the signal must also be the same. In addition, finding a corpus for training in texts will be a problem - such a corpus will have monstrous dimensions and an unacceptably long training time. Instead of text, you can take individual words - the text can be divided into words, focusing on the pauses between words. Following this logic, you can go down even lower and recognize individual syllables or sounds. Here is an example for the word EXACTLY :

    smooth

    Here is a signal with a normalized amplitude + plus I cut a pause at the beginning of the recording. As you can see, between the syllables a very tangible pause and the whole text can be consistently cut into such segments - in my opinion, it is easier to work with such small pieces of information than completely with words. Further, these pieces can be aligned by duration (by the number of samples), calculate the coefficients and train or recognize. Learning is also a separate big question, like recognition, and more precisely, what to do after recognition. For example, the word was recognized incorrectly and it is not in the dictionary.