I am looking for a library for the recognition of Russian speech (ASR) in audio recordings of up to 30 minutes. You need to work offline (i.e., without using API services).

What was found and what problems arose:

  1. Kaldi, more specifically a python wrapper called pykaldi. Honestly, I could not figure it out. As I understand it, mathematical tools for sound processing have been implemented, but I am familiar with them superficially, so I will be happy with good instructions for use.
  2. PocketSphinx. Here the problem turned out to be in recognition quality - it was disgusting. There were questions: does this library work with long audio at all? (Saw countless work with a limited set of commands. For example, for a smart home). In the tutorial there is a description of the "Adaptation" of the acoustic model of the language, will it affect the quality of recognition?

Actually, are there any other options? I never rule out that I missed something obvious.

PS There is an extensive data set of the form Audio + Text from this audio, which, perhaps, can be used to adjust the accuracy (for example, for the Russian model in pocketsphinx

    1 answer 1

    Russian model for Kaldi download here .

    To decode a long file, it must first be divided into pywebrtcvad , then fed to kaldi via os.system .

    Pykaldi is not needed, the interface is too sophisticated, you can try py-kaldi-simple .

    • Thanks for the tip. I would like to clarify, in py-kaldi-asr it is said that "kaldi's online nnet3-chain decoders" are used, is this an appeal to api Kaldi inside the library? - Revynel
    • Yes, you can use the software interface Kaldi - Nikolay Shmyrev