Abstract [eng] |
The aim of this project is to clarify the perspectives of the Kaldi toolkit in the context of automatic voice recognition. The first sound package to be worked on was called Medical Terms, which consisted of 30 narrators. After examining this sound system with all possible methods, ideal results were obtained by ~ 99.99%, so it was decided to noise the dictionary of medical terms with 5dB white noise. After making noise and starting to research, the results were close to 100%, so it was decided to move to a wider, open source sound system, LIEPA. In this work, methods for recognizing a narrator are singled out and evaluated. The process of automatic voice recognition involves several steps, including the processing of the loudspeaker's voice, the extraction of features, and the training and verification of the narrator. The main goal of this master's thesis is to investigate the functionality of the Kaldi package using different methods for language recognition. Object and methods of the work - Kaldi software package teaches automatic speech recognition, using the common dictionary LIEPA, using Lithuanian 350 narrators. The dictionary consists of 11346 words, its database consists of 310000 words. A computer with an Ubuntu operating system is used. The studies are performed using monophonic, triphonic, LDA + MLLT, LDA + MLLT + SAT, SGMM2 and TDNN pnorm. methods Kaldi package preparation files, wav.scp, text.txt, utt2spk.txt, corpus.txt, spk2gender. These packages are provided for the preparation and testing of the study structure. Kaldi illustrative audio documents: spk2gender.txt, wav.scp, text.txt, utt2spk.txt, corpus.txt were provided on request, which are required for the preparation and testing of the framework, as well as the initial run.sh recording depicting the capacity utilized. After starting to analyze the LIEPA sound system, this one was divided into 3 different parts: LIEPA_ZOD, LIEPA_SEK, LIEPA_SAK. Examining each of the above sections, we can observe that by examining both sequences, words, or sentences, we can observe that the more accurate recognition results obtained using the word error rate compared to the sentence error rate. |