Title Clustering of Lithuanian news articles /
Authors Pranckaitis, Vilius ; Lukoševičius, Mantas
Full Text Download
Is Part of CEUR workshop proceedings: IVUS 2017: International conference on information technology: proceedings of the IVUS international conference on information technology, Kaunas, Lithuania, April 28, 2017 / edited by: R. Damaševičius, T. Krilavičius, A. Lopata, Ch. Napoli, M. Woźniak.. Aachen : CEUR-WS. 2017, vol. 1856, p. 27-32.. ISSN 1613-0073
Keywords [eng] document clustering ; feature selection ; k-means ; hierarchical clustering ; Lithuanian news articles
Abstract [eng] There is arguably more research done on clustering of English texts than of any other language. In this article, the process of clustering Lithuanian news articles is studied. For text preprocessing, the effect of stemming, term frequency metrics and feature filtering is investigated. In addition, following clustering algorithms are compared: k–means, bisecting k–means, and three linkage method variations of hierarchical clustering. The results show that k–means algorithm gives best overall results and that only one of the three hierarchical algorithms produces comparably good results. Term frequency–inverse document frequency (TF–IDF) with stemming significantly increased clustering quality compared to not doing stemming and/or using TF. Feature filtering by IDF helped to optimize the k–means algorithm, but reduced the quality when using hierarchical clustering.
Published Aachen : CEUR-WS
Type Conference paper
Language English
Publication date 2017
CC license CC license description