I am looking for a library for the recognition of Russian speech (ASR) in audio recordings of up to 30 minutes. You need to work offline (i.e., without using API services).
What was found and what problems arose:
- Kaldi, more specifically a python wrapper called pykaldi. Honestly, I could not figure it out. As I understand it, mathematical tools for sound processing have been implemented, but I am familiar with them superficially, so I will be happy with good instructions for use.
- PocketSphinx. Here the problem turned out to be in recognition quality - it was disgusting. There were questions: does this library work with long audio at all? (Saw countless work with a limited set of commands. For example, for a smart home). In the tutorial there is a description of the "Adaptation" of the acoustic model of the language, will it affect the quality of recognition?
Actually, are there any other options? I never rule out that I missed something obvious.
PS There is an extensive data set of the form Audio + Text from this audio, which, perhaps, can be used to adjust the accuracy (for example, for the Russian model in pocketsphinx