Title |
Analysis of clustering methods performance across multiple datasets / |
Authors |
Lukauskas, Mantas ; Ruzgas, Tomas |
DOI |
10.15388/DAMSS.12.2021 |
ISBN |
9786090706732 |
eISBN |
9786090706749 |
Full Text |
|
Is Part of |
DAMSS 2021: 12th conference on data analysis methods for software systems, Druskininkai, Lithuania, December 2–4, 2021 / Lithuanian computer society, Vilnius university Institute of data science and digital technologies, Lithuanian academy of sciences.. Vilnius : Vilnius university press, 2021. p. 45-46.. ISBN 9786090706732. eISBN 9786090706749 |
Abstract [eng] |
As the amount of data increases each year, these amounts of data become increasingly difficult to analyze. Currently, a variety of different machine learning algorithms are proposed for data analysis to help make different versions, and research and other activities require solutions. Probably the two most significant types of machine learning are supervised learning and unsupervised learning. If there is no prior knowledge of the data class, unsupervised learning is required. One of the most commonly used forms of unsupervised learning is clustering. Clustering is often described as a particular process that seeks to find data contained in hidden relationships. It is unnecessary to know the class in advance to find these connections, which allows the data in the main groups to be distinguished. Data clustering can be performed using various methods, but they are all divided into four main groups: partitioning methods, hierarchical methods, density-based methods, and grid-based methods. Partitioning methods are described as methods that are flexible and are based on the iterative division of data points into clusters and the subsequent redistribution of these points between groups. The most commonly known and one of the most widely used methods is k-means. Hierarchical clustering is a recursive partitioning of a dataset into successively smaller clusters. Hierarchical methods work by creating a hierarchy of groups. Density-based clustering is a nonparametric approach where the clusters are high-density areas of the density p (x). Grid-based clustering is the last class of clustering methods. This class of methods works by dividing the entire data space into a grid structure with a certain number of cells. Clustering is then performed with these cells instead of individual points, and for this reason, It can significantly reduce the computation time. This work aims to compare different groups of clustering methods and particular methods using different data sets and evaluate their performance. This work also seeks to include methods that are better known to everyone and much less commonly used. Finally, this work will help to provide some guidance on when specific methods are best suited. |
Published |
Vilnius : Vilnius university press, 2021 |
Type |
Conference paper |
Language |
English |
Publication date |
2021 |
CC license |
|