Abstract [eng] |
This work studies document clustering application for clustering news articles from three major Lithuanian news sites. Different aspects of clustering are studied, including feature selection and comparison of k‑means and hierarchical clustering algorithms. This study proposes a metric for measuring how well particular words describe the contents of the cluster. In addition, a two level clustering method was proposed, combining hierarchical and k‑means algorithms. The results show that TF–IDF with stemming produce significantly better results than simple TF and/or no stemming. Also, k‑means produced better quality clustering than hierarchical methods and was less sensitive to feature space reduction. The proposed two level clustering showed promising results, however, clustering quality didn’t match the one produced by k‑means algorithm. |