Title |
Lietuviškų naujienų grupavimas pasitelkiant dokumentų vektorizavimus / |
Translation of Title |
Clustering of Lithuanian news articles using document embeddings. |
Authors |
Stankevičius, Lukas |
Full Text |
|
Pages |
64 |
Keywords [eng] |
document clustering ; Lithuanian news articles ; text preprocessing ; document embeddings ; doc2vec |
Abstract [eng] |
In this work document embeddings are studied for Lithuanian news clustering. Total of 82793 news articles from three Lithuanian news websites are collected. The best values for doc2vec parameters, namely number of epochs and vector size, are estimated. TF-IDF weighted scheme is compared with doc2vec models trained on full and partial datasets. Text preprocessing techniques as number of max features, stop words removal and lemmatization are studied separately for both TF-IDF and doc2vec models. Database consisting of more than 2 million words forms and 72587 lemmas was composed for lemmatization. This work depicts how optimized doc2vec is much better than usual TF-IDF weighting scheme and elaborate different optimal text preprocessing approach for doc2vec text representation model. |
Dissertation Institution |
Kauno technologijos universitetas. |
Type |
Master thesis |
Language |
Lithuanian |
Publication date |
2019 |