Is Part of |
DAMSS 2022: 13th conference on data analysis methods for software systems, Druskininkai, Lithuania, December 1–3, 2022 / Lithuanian computer society, Vilnius university Institute of data science and digital technologies, Lithuanian academy of sciences.. Vilnius : Vilnius university press, 2022. p. 56-57.. ISBN 9786090707944. eISBN 9786090707951 |
Abstract [eng] |
Data research is widely used in various fields such as business, production, online trade, consumer services, and other fields. Due to such a large application of data mining, the field is receiving a lot of attention. Data clustering is an unsupervised type of machine learning that is also widely used in data mining. In data clustering, the main goal is to divide objects into separate, unknown groups to have as many similar objects as possible in one group. Making such groups allows you to find hidden relationships between data. Data clustering is applied in such areas as bioinformatics, feature selection, pattern recognition and others. Although there are many different methods in data clustering, data clustering itself is a complex task. Due to different data structures, different clustering methods work well only under certain conditions, so the need for these methods remains high. One of the most used data clustering methods is the k-means method, which is relatively simple, but can work effectively in good conditions. Most clustering methods perform poorly in the presence of outliers in the data, and the previously mentioned k-means method suffers from this drawback, as do GMM, BGMM, and some other methods. Recently, various researchers have been paying a lot of attention to different density estimation methods, as well as robust modifications of these methods, such as soft constrained neural networks, and others. Due to such a demand for density estimation, this paper aims to evaluate the accuracy of a new clustering method based on modified inversion formula density estimation. The obtained results show that this developed method is competitive compared to the currently most popular methods (K-means, GMM, BGMM). Based on the clustering results, it can be observed that the MIDEv2 method works the best with generated data with noise in all datasets (0.5%, 1%, 2%, 4% noise). The interesting point is that a new method can cluster the data even if the data do not have noise/outliers, for example, one of the most popular Iris dataset. However, there are also possible shortcomings of the method, since in case of large dimensions (d>15) the method is difficult to apply, but this problem will be solved in further iterations of the method. |