Title Lietuviškų naujienų grupavimas pasitelkiant dokumentų vektorizavimus /
Translation of Title Clustering of Lithuanian news articles using document embeddings.
Authors Stankevičius, Lukas
Full Text Download
Pages 64
Keywords [eng] document clustering ; Lithuanian news articles ; text preprocessing ; document embeddings ; doc2vec
Abstract [eng] In this work document embeddings are studied for Lithuanian news clustering. Total of 82793 news articles from three Lithuanian news websites are collected. The best values for doc2vec parameters, namely number of epochs and vector size, are estimated. TF-IDF weighted scheme is compared with doc2vec models trained on full and partial datasets. Text preprocessing techniques as number of max features, stop words removal and lemmatization are studied separately for both TF-IDF and doc2vec models. Database consisting of more than 2 million words forms and 72587 lemmas was composed for lemmatization. This work depicts how optimized doc2vec is much better than usual TF-IDF weighting scheme and elaborate different optimal text preprocessing approach for doc2vec text representation model.
Dissertation Institution Kauno technologijos universitetas.
Type Master thesis
Language Lithuanian
Publication date 2019