_{1}

^{*}

Clustering is an important unsupervised classification method which divides data into different groups based some similarity metrics.
*K*-means becomes an increasing method for clustering and is widely used in different application. Centroid initialization strategy is the key step in
*K*-means clustering. In general,
*K*-means has three efficient initialization strategies to improve its performance
*i.e.*, Random,
*K*-means++ and PCA-based
*K*-means. In this paper, we design an experiment to evaluate these three strategies on UCI ML hand-written digits dataset. The experiment result shows that the three
*K*-means initialization strategies find out almost identical cluster centroids, and they have almost the same results of clustering, but the PCA-based
*K*-means strategy significantly improves running time, and is faster than the other two strategies.

Machine Learning, in general, is a power tool to predict the properties of unknown data based on a set of training data with or without labels. Generally, there are two types of learning methods: one is unsupervised learning and the other is supervised learning. In supervised learning, training data has explicit labels (called labeled data). However, in some cases, it is difficult to obtain the labeled training data. Unsupervised learning is the best choice to classify similar patterns into the same group without labeled data. As the results of clustering, data in the same group has higher similarity metric to each other than to those in other groups, and each group have a centroid to represent the group. In the predicting phase, the unknown data will be assigned to the group which has the minim distance between predicted data and group centroid. K-means algorithm performs good comparing with other clustering algorithms and it has good robustness [

As shown in Equation (1), in K-means clustering, the number of group K is predetermined. By initialing K centroids, distance metric can be calculated. For instance, the Euclidean distance between the point and centroids are calculated as shown in (2). Then, it changes the group centroids and repeats the above steps. The algorithm tries to minimize sum-squared-error criterion (SSE) of total distance metric greedily, in such a way that K-means finds out the group centroids and predicts the unknown data to the nearest group centroids.

E = ∑ K = 1 K ∑ x ∈ C k d 2 ( x , m k ) (1)

d 2 ( x , m k ) = ∑ n = 1 N ( x n − m k n ) 2 (2)

Studies have shown that: the performance of K-means is strongly depending on the initialization strategies of centroid locations [

In this section, we introduce the three dominate K-means initialization strategies. We can see that the three strategies have different influence on the results of clustering.

Random Partition initialization method [

However, Celebi et al. [

K-means++ [

K-means++ initial strategy dose not only speed up convergence, but also provides a better solution compared with random K-means solution.

For the aforementioned methods, it is in the raw or original high dimensional space where the task of searching for better clustering has been performed. Recent work [

In this experiment we compare above three initialization strategies for K-means in terms of runtime and quality of the results on.

Some digits samples in UCI ML hand-written digits datasets are shown as

Bitmaps of handwritten digits which derive from 43 people are divided into two parts: 30 samples for the training set and 13 samples for the test set.

Every digit is 32 × 32 bitmap, and then it is separated into 4 × 4 non-overlapping blocks. Each block records the number of one pixel.

An input matrix of 8 × 8 for each digit is generated and each matrix element is an integer in the range 0.16.

Thus, this dataset has 1797 8 × 8 images and every image is vectorized 64 feature vector with ground true labels.

Since the dataset provides basic facts, we can apply different cluster quality metrics to evaluate the goodness of fit of the cluster labels to the basic facts. It has influence in the initialization strategies of K-means.

Inertia or within-cluster sum of squares distance is a key measure to evaluate the internally coherent of clustering. The sum of squared distance is calculated between each point and its nearest centroid.

In fact, the result of clustering should satisfy homogeneity. It means that each point only belongs to a cluster. This rule should be also independent of labels. The range of score should be standardized between 0.0 and 1.0.

Completeness measure how well the K-means algorithm assigns all the data points with a given label to the same group. Meanwhile, the score should be standardized from 0.0 to 1.0.

Specifically, V-measure measures the harmonic criteria whether it has satisfied the homogeneity and completeness. In addition, the score is from 0.0 to 1.0.

The Silhouette Coefficient for a sample is defined as:

silhouette = a − b max ( a , b )

where a is the mean of intra-cluster distance, b indicates the nearest-cluster distance. Moreover, the range of the parameter is −1 ~ 1. Specifically, 1 is the best result and −1 is the worst result. The higher the score of Silhouette Coefficient is, the more suitable the model satisfies the defined clusters.

In this experiment, we compare the performance of three the classical initialization strategies based on the above-mentioned criteria. A PC with Intel® Core™ i7-6700 CPU @ 3.40 GHz × 8 is used to run this experiment.

In order to show the clustering results in 2D coordinates, we use PCA to reduce the dataset dimension to 2D, and transform the feature vector with length 64 to the 2D subspace. The reduced dataset is plotted as dot marker, and the clustering centroids are put on the figure with different markers as be showed in

Form

As shown in

Init time | Inertia | Homo | Compl | V-meas | Silhouette | |
---|---|---|---|---|---|---|

K-means++ | 0.24 s | 69432 | 0.602 | 0.650 | 0.625 | 0.146 |

Random | 0.17 s | 69694 | 0.669 | 0.710 | 0.689 | 0.147 |

PCA-based | 0.02 s | 71820 | 0.673 | 0.715 | 0.693 | 0.150 |

despite the separation distance is small. Homo and compl indicator are all in the range of 0.0 and 1.0 with near values, which means the results can be receivable. The values of V-means (0.625, 0.689 and 0.693) state that the accuracy of homo and compl is successfully calculated. These four evaluation indicators confirm that the three classical clustering algorithms have acceptable clustering results on test dataset.

One noticeable thing is the running time. From the evaluation results in

In this work, we design an experiment to evaluate the performance of three classical K-mean initialization strategies on UCI ML hand-written digits dataset: Random, K-means++ and PCA-based K-means. The experiment results show that the three K-means initialization strategies find out almost identical cluster centroids, and they have the similar accuracy of clustering. However, PCA-based K-means strategy significantly improves running time. Moreover, PCA-based K-means strategy has a better performance than other strategies, thus it is more effective for clustering. In further studies, more machine learning models like neural networks can be investigated and compared with models used in this paper.

Li, B.Y. (2018) An Experiment of K-Means Initialization Strategies on Handwritten Digits Dataset. Intelligent Information Management, 10, 43-48. https://doi.org/10.4236/iim.2018.102003