Abstract [eng] |
Thesis work focuses on e-reviews of Lithuanian various business fields companies' customers. For achieving the best classification combination, the hypothesis about accuracy and input data variables dimensionality correlation was tested. In the order, eight vectors embedding methods were tried: Bag of Word method’s modifications, Paragraph Vector Distributed Memory method, Latent Semantic Indexation, Latent Dirichlet allocation, Random Projections, Sent2Vec, and BERT. The classifier’s main component of the sentiment polarity detection task was Random Forest, Logistic Regression, linear Support-Vector machine or gradient boosting machine learning algorithms. All the possible classification combinations were compared within kappa accuracy score. In addition, additional lexicon-based methods were utilized in this work. In most cases, machine learning algorithm and document embedding combinations showed better results than lexicon-based ones. The most accurate results were achieved by a distributed bag of word vectorization method and gradient boosting algorithm. |