Title |
Clustering of Lithuanian news articles / |
Authors |
Pranckaitis, Vilius ; Lukoševičius, Mantas |
Full Text |
|
Is Part of |
CEUR workshop proceedings: IVUS 2017: International conference on information technology: proceedings of the IVUS international conference on information technology, Kaunas, Lithuania, April 28, 2017 / edited by: R. Damaševičius, T. Krilavičius, A. Lopata, Ch. Napoli, M. Woźniak.. Aachen : CEUR-WS. 2017, vol. 1856, p. 27-32.. ISSN 1613-0073 |
Keywords [eng] |
document clustering ; feature selection ; k-means ; hierarchical clustering ; Lithuanian news articles |
Abstract [eng] |
There is arguably more research done on clustering of English texts than of any other language. In this article, the process of clustering Lithuanian news articles is studied. For text preprocessing, the effect of stemming, term frequency metrics and feature filtering is investigated. In addition, following clustering algorithms are compared: k–means, bisecting k–means, and three linkage method variations of hierarchical clustering. The results show that k–means algorithm gives best overall results and that only one of the three hierarchical algorithms produces comparably good results. Term frequency–inverse document frequency (TF–IDF) with stemming significantly increased clustering quality compared to not doing stemming and/or using TF. Feature filtering by IDF helped to optimize the k–means algorithm, but reduced the quality when using hierarchical clustering. |
Published |
Aachen : CEUR-WS |
Type |
Conference paper |
Language |
English |
Publication date |
2017 |
CC license |
|