Abstract [eng] |
The amount of text data on the Internet is continuously increasing. However, some online users are making mistakes when writing text. In case of Lithuanian language, the main reason of text being not grammatically correct is not using diacritics. Noisy text can be a problem to achieve satisfying results for most of NLP tasks. This thesis is focused to research deep learning based diacritics restoration methods for Lithuanian language. 2 models are created using Sequence to Sequence (Seq2Seq) model (character restoration accuracy – 98,12%) and transformer ByT5 model (character restoration accuracy – 99,65%). By using these trained models diacritics are restored for customer review text data. Then sentiment classification is made by using clean text, and text with restored diacritics by these deep learning models. Also, different dimensionalities for text vectorization are tested. Logistic regression, random forest and ByT5 fine-tuned models are used for sentiment classification sentiment classification. Results are compared by using AUC score. ByT5 model fine-tuned for sentiment classification gave the highest AUC score (0,975). Diacritics restoration did not have any significant increase in sentiment classification using machine learning models and different dimensionalities. Even though, ByT5 model for sentiment classification showed significant improvement when comparing with machine learning models (p-value almost 0). |