Capstone Projects
Permanent URI for this collection
Browse
Browsing Capstone Projects by Subject "Cluster Analysis"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
Item Open Access Pairwise Overlap and Misclassification in Cluster Analysis(Nazarbayev University School of Science and Technology, 2015) Akynkozhayev, BirzhanSeparation of data into distinct groups is one of the most important tools of learning and means of obtaining valuable information from data. Cluster analysis studies the ways of distributing objects into groups with similar characteristics. Real-world examples of such applications are age separation of a population, loyalty grouping of customers, classification of living organisms into kingdoms, etc. In particular, cluster analysis is an important objective of data mining, which focuses on studying ways of extracting key information from data and converting it into some more understandable form. There is no single best algorithm for producing data partitions in cluster analysis, but many that perform well in various circumstances (Jain, 2008). Many popular clustering algorithms are based on an iterative partitioning method, where single items are moved step-by-step from one cluster to another based on optimization of some parameter. One of such algorithms, which will be mentioned in this paper is K-means algorithm, where data points are partitioned based on optimization of sum of squared distances within clusters (MacQueen, 1967). Another large class of algorithms are based on finite mixture model clustering methods. For example, stochastic emEMclustering method, which will also be covered in this article, is based on maximum likelihood estimation of statistical model parameters (Melnykov & Maitra). Misclassification of data is not a rare situation in cluster analysis. For instance, we can observe that several points have been misclassified on the previous figure (Figure 1) of true partition (a) versus the solution found by the K-means algorithm (b). Various factors lead to misclassification in clustering algorithms. The main goal of this paper is to analyze the effect of pairwise overlap, number of dimensions of data, and number of clusters on misclassification. The simplest case where misclassification can occur is when there are two clusters. The overlap is exact in this case, thus, we proceeded to use one of the simplest algorithms – K-means. At the higher number of clusters, when overlap is estimated, we considered more complex emEM algorithm