Initial Value Filtering Optimizes Fast Global K-Means

K-means clustering algorithm is an important algorithm in unsupervised learning and plays an important role in big data processing, computer vision and other research fields. However, due to its sensitivity to initial partition, outliers, noise and other factors, the clustering results in data analysis, image segmentation and other fields are unstable and weak in robustness. Based on the fast global K-means clustering algorithm, this paper proposed an improved K-means clustering algorithm. Through the neighborhood filtering mechanism, the points in the neighborhood of the selected initial clustering center have not participated in the selection of the next initial clustering center, which can effectively reduce the randomness of initial partition and improve the efficiency of initial partition. Mahalanobis distance was used in the clustering process to better consider the global nature of data. Compared with the traditional clustering algorithm and other optimization algorithms, the results of real data set testing are significantly improved.


K-Means Clustering
K-means algorithm is a very classical clustering algorithm with a wide range of applications. This chapter mainly concludes this algorithm and its derived optimization algorithm.

Traditional K-Means Clustering
The execution process of the classic K-means algorithm is divided into the following steps: Step 1: The value of user input parameter K [5], which is the number of initial clustering centers and is generally obtained from given data samples based on empirical observation. The algorithm randomly generates K clustering centers 1 2 , , , k m m m  , represent clusters 1 2 , , , k c c c  .
Step 2: To calculate the Euclidean distance from each sample point x i in data set D to K clustering centers [6], and put the samples into the cluster ( ) . Euclidean distance represents the similarity degree between the sample point and the cluster center. The smaller the distance, the higher the similarity degree. The calculation formula is shown in formula (1).
Step 3: To calculate the mean value of all sample points in each cluster, and update all clustering centers in step 1 with the obtained mean value.
Step 4: Repeat step 1 and step 2 until the clustering center obtained two times in a row is no longer changed, then ending the clustering.
The traditional K-means clustering algorithm is simple in thought and easy to implement, which is one of the widely studied and applied clustering algorithms.
However, random selection of the initial clustering center also causes unstable clustering results and clustering efficiency, as well as local optimal problems [7].

Fast Global K-Means
The fast global K-means algorithm is an improvement on the traditional K-means algorithm. By considering global data, the initial clustering center is found to reduce the sensitivity of the algorithm to outliers and noise [14] [15]. The algorithm flow chart is shown in Figure 1. The calculation formula of b n is shown in formula (2). Where N is the total number of samples, 1 j k d − is the minimum distance between sample point x j and k initial clustering centers, and x n is the sample points except the clustering center.
This algorithm can effectively solve the random problem of the initial clustering center [8], and can effectively reduce the clustering times and thus shorten the clustering time. However, in the selection process of clustering center, repeated distance calculation is required for each sample point, which increases the time complexity of initial value selection.

Global K-Means Algorithm
The global K-means algorithm mentioned in literature [9] replaces the maximum relative distance b n from the existing clustering center in the Fast Global K-means algorithm with the maximum absolute distance During the selection of the initial cluster center, d j is only calculated as the distance between the pre-selected cluster center x n and other sample points x j , and d j is summed up. Finally, the point with minimum accumulation value is selected as the clustering center.
This method reduces the computational steps when the initial cluster center is selected and reduces the time complexity of the algorithm to some extent. However, the influence of the selected initial clustering center on the next initial clustering center is ignored, which reduces the constraint conditions of initial value selection and improves the randomness.

Initial Value Filtering Optimizes Fast Global K-Means
In this paper, the selection of initial cluster center is optimized by neighborhood screening. When selecting the initial clustering center, the points within the minimum radius of the existing clustering center do not participate in the selection of the next initial clustering center, which reduces the time complexity of the Fast Global K-means algorithm in selecting the initial value. In the process of updating the clustering center, Mahalanobis distance is used instead of Euclidean distance, which increases the consideration of global data of the algorithm and is more suitable for the application of computer vision field.

Neighborhood Filter
In practical applications, each cluster center will be a certain distance away, and the next cluster center must be outside a certain neighborhood of the known cluster center. According to formula (2), there must be no point that maximizes b n in a certain neighborhood of the known initial cluster center. Therefore, it is not necessary to calculate the initial cluster center search for the sample points in the neighborhood. Under the circumstance that the distribution of the whole class of samples is unknown, the size of the neighborhood is largely affected by the number of clustering centers k. Suppose the first initial cluster center m 1 is located at the middle point of the sample, and sample x max is the farthest sample point from m 1 , and the distance is d mmax . In the extreme case, K initial clustering centers are evenly distributed on the line segment formed by x max and m 1 , and the vertex of the line segment is two initial clustering centers, so the distance between each two initial clustering centers is − . After comprehensive consideration, sufficient sample points are ensured to serve as the next initial clustering center after each initial clustering center is determined, and the time complexity of the algorithm is minimized. In this paper, R is selected as formula (3) ( ) where k is the number of clustering centers, d mmax is the maximum distance between all sample points and the first initial clustering center (i.e., sample median).
Taking the selection of the second initial clustering center as an example, calculate the distance d m between all samples in the initial sample set { } 1 2 , , , n D x x x  and the first initial clustering center m 1 .
, the set composed of d m sample points, is selected. From D 1 , each sample x n is respectively selected as the second clustering center. According to formula (2), b n is calculated to determine the second initial clustering center m 2 .
Then, the distance between all sample points in D 1 and m 2 is calculated respectively, and the points whose distance is greater than the minimum radius R are formed into the set D 2 . 3 4 , , , k m m m  can be obtained according to the above methods.

Mahalanobis Distance
In the current researches on K-means clustering algorithm, most of them conduct clustering based on Euclidean distance, but Euclidean distance is only applicable to clustering of spherical structure, and the correlation between variables and the difference in importance of each variable are not considered when processing data [10]. It has some defects in the application of high correlation data and image fuzzy segmentation. Mahalanobis distance is a method of calculating distance similarity proposed by P. C. Mahalanobis, an Indian statistician. Can be used to calculate both follow the same distribution and its covariance matrix of the Σ degree of difference between random variables. When the covariance matrix Σ matrix for the unit, the Mahalanobis distance can be converted into Euclidean distance. The Mahalanobis distance formula is shown in formula (4).

) Journal of Computer and Communications
internal relationship between sample attributes [11], can effectively describe the global relationship between two sample points, and contains more neighborhood information and spatial information [12], which can play a better analysis effect in big data processing and image segmentation.

Average Error
The K-means clustering algorithm usually uses the square sum of clustering error D to represent the clustering effect, which is the sum of the distance from each sample to K cluster centers, and is defined as the formula (5).
where, x i represents the ith sample, and there are N samples, m j represents the jth clustering center, with a total of K clustering centers. In order to facilitate the observation of values, this paper uses the average error L to represent the clustering effect, which is defined as the formula (6).
For the same data set, the smaller the value of L is, the better the clustering effect is.

Algorithm Steps
Steps of fast global K-means clustering algorithm based on neighborhood screening and Mahalanobis distance: Input: K: The number of cluster clusters; D: A data set containing n objects. Output: Sets and categories of K clusters Method: (1) Calculate the median value of all samples as the initial cluster center of the first cluster, and set s = 1.
(2) Calculate the distance d j from each sample point x i to its clustering center (3) Calculate the minimum radius R in set D i , as shown in formula (3).
Experimental data: data sets of two-dimensional data and Wine quality-red standard data sets in UCI were selected in the experiment. Data source: http://archive.ics.uci.edu/ml/.

Simulation Result
In this paper, the algorithm time and the mean value of error sum square are used as the evaluation criteria of clustering effect.
Clustering experiments were carried out on traditional K-means algorithm,  There are 1600 pieces of data in Wine quality-red, and each piece of data has 12 characteristics, among which the data under the quality attribute can be used as sample labels. After removing the title and the last quality feature, 1599 pieces of data are used, and 11 feature data of each piece of data are normalized for clustering. The number of cluster centers was set to 6, and the average value was obtained after 10 cluster simulations. The clustering results were shown in Table   2 (kept three decimal places).
In this paper, the ratio of the number of correctly classified samples to the total number of samples was defined as the correct classification rate, which was used to test the clustering effect of RFGK-means and RMFGK-means.
According to the 6 qualities of Wine quality-red, the original samples are classified into classes D1 to D6, and the clustering results are classified into classes DA1 to DA6 respectively. Each data sample in DA1 is fitted with samples from D1 to D6 respectively, and the number of the same samples is recorded. The fitting results of RFGK-means are shown in Table 3.  Table 4.
The fitting results and similar set results after RMFGK-means clustering are shown in Table 5 and Table 6 respectively.
By analyzing the above results, the clustering effects of RFGK-means and RMFGK-means are shown in Table 7 (kept three decimal places).
According to the above data, the correct classification rate of samples obtained by RMFGK-means clustering is higher than that obtained by RFGK-means clustering.

Experimental Analysis
In the process of using traditional K-means for clustering, the clustering time and average error fluctuate greatly. Since the initial value is randomly selected, the clustering time is unstable, and the clustering effect is easy to fall into local optimal. The other three algorithms use the global method to find the initial clustering center, and can output the clustering center stably, so as to obtain stable clustering results. RFGK-means and RMFGK-means are faster than FGK-means Journal of Computer and Communications in the selection of initial clustering center. Mahalanobis distance is used to take into account the global distribution of data, instead of Euclidean distance, which can improve the accuracy of clustering results in real data sets.

Conclusion
The fast global K-means algorithm based on neighborhood screening can effectively shorten the time used for initial value search, enhance the robustness of the algorithm, and its clustering effect is basically consistent with the fast global K-means algorithm. The use of Mahalanobis distance instead of Euclidean distance in the clustering process can fully consider the integrity of data, effectively improve the anti-noise ability of the algorithm and improve the clustering accuracy. However, due to a large amount of calculation of Mahalanobis distance, the clustering time is increased to some extent, which makes the total time of the algorithm increase. RMFGK-means algorithm can exert greater advantages when clustering highly correlated data.