| Abstract [eng] |
In the contemporary retail context, user-generated content is becoming increasingly more influential on consumer behavior, reflecting both emotional tone and sentiment. Aiming to extract strategic insights for further business decisions, companies rely on opinion mining, also known as sentiment analysis. Although the application of sentiment analysis is challenging in less widely spoken languages, such as Lithuanian, due to linguistic complexity and limited NLP resources. This thesis compares classical and modern vectorization and classification methods and their combinations, evaluating their effectiveness and applicability for cross-domain sentiment analysis within the Lithuanian retail sector. A dataset with various user text reviews for Lithuanian retail companies was used for the thesis experiments, consisting of reviews from the 2011-2025 year period. The reviews were categorized into seven different retail domains (E-Marketplace, E-Tech, E- Niche, Groceries, Clothing, Beauty and Other). Comparing different vectorization methods, including the classic latent semantic analysis, LSA, and modern embedding methodics (Jina, E5, GTE and XLM-RoBERTa), each of these models were combined with three traditional classification methods: regularized Logistic Regression, Support Vector Machines and Random Forest. XLM-RoBERTa was also evaluated seperately, as a standalone classifier. These methods were assessed mainly using ROC AUC and PRC AUC metrics. After categorizing the models based on their vectorization dimensionality, the overall most effective method for Lithuanian sentiment analysis tasks was identified. The modern E5 vectorization model, paired with the regularized logistic regression classifier achieved the best classification accuracy, e. g. 92,6 % (ROC AUC = 0,978, PRC AUC = 0,998), when trained on the grocery sector and tested on electronics store domains. The best identified method, trained on grocery domain data, also excelled, when classifying different domain reviews, such as apparel and beauty products, and even other, less semantically related testing domains. These findings confirm, that modern vectorization models can be successfully applied to low-resource and morphologically complex languages and cross-domain retail scenarios. |