Abstract [eng] |
The aim of this work is to investigate the possibilities of application and functions of Kaldi and TensorFlow packages in automatic speech recognition systems. The collection of audio recordings of Lithuanian names VARDAI_18_3 was selected for the research - 21 voice recordings of the narrator with 22 Lithuanian names and 4 nouns, uttered by each narrator 20 times. The total number of audio recordings is 10920. The recordings of 18 narrators are used for training and the remaining 3 narrators for testing. This was followed by a cross-check of the sound with voice recordings, (signal / noise) level 5 dB. The final work compares the results of speech recognition of the same names obtained with the Kaldi and TensorFlow software packages. Kaldi package recognition methods are analyzed - monophone, triphonone, LDA + MLLT, SAT, SGMM2, TDNN-pnorm, TDNN-tanh and it is researched which method has the lowest word recognition error. Preparation for the research begins with describing Kaldi descriptive sound files: spk2gender.txt, utt2spk.txt, text.txt, wav.scp, corpus.txt, who describe the gender of the narrator, the interfaces of the recordings with the narrators, with textual transcription and real audio recording. The task files are also described: cmd.sh, path.sh, and run.sh, which is the main program startup file. 6 software files are prepared for the TensorFlow package: generate_test_vrd.py, generate_train_eval_new.py, deep_speech.py, deep_speech_model.py, decoder.py, get_predictions.py. These files are used for data preparation, language modeling, training, decoding, and getting recognition results. Studies with the name recordings and the Kaldi package have shown that the error of name recognition without any adittional noise is very similar given the different research methods. As a result, the sound recordings of the names were injected 5 dB of white noise. The best test result was obtained by TDNN-tanh method, the recognition accuracy is 98.53%. The studies with the Tensorflow package are based on the DeepSpeech2 model algorithm and the neural network recognition method. The best recognition accuracy is 87,24 %. At the end of the work, recognition results of Kaldi and TensorFlow are compared. |