Clustering of Lithuanian news articles

Vilius Pranckaitis; Mantas Lukoševičius

Title	Clustering of Lithuanian news articles
Authors	Pranckaitis, Vilius ; Lukoševičius, Mantas
Full Text
Is Part of	CEUR workshop proceedings: IVUS 2017: International conference on information technology: proceedings of the IVUS international conference on information technology, Kaunas, Lithuania, April 28, 2017 / edited by: R. Damaševičius, T. Krilavičius, A. Lopata, Ch. Napoli, M. Woźniak.. Aachen : CEUR-WS. 2017, vol. 1856, p. 27-32.. ISSN 1613-0073
Keywords [eng]	document clustering ; feature selection ; k-means ; hierarchical clustering ; Lithuanian news articles
Abstract [eng]	There is arguably more research done on clustering of English texts than of any other language. In this article, the process of clustering Lithuanian news articles is studied. For text preprocessing, the effect of stemming, term frequency metrics and feature filtering is investigated. In addition, following clustering algorithms are compared: k–means, bisecting k–means, and three linkage method variations of hierarchical clustering. The results show that k–means algorithm gives best overall results and that only one of the three hierarchical algorithms produces comparably good results. Term frequency–inverse document frequency (TF–IDF) with stemming significantly increased clustering quality compared to not doing stemming and/or using TF. Feature filtering by IDF helped to optimize the k–means algorithm, but reduced the quality when using hierarchical clustering.
Published	Aachen : CEUR-WS
Type	Conference paper
Language	English
Publication date	2017
CC license

„Clustering of Lithuanian news articles“