^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

A new method (the Contrast statistic) for estimating the number of clusters in a set of data is proposed. The technique uses the output of self-organising map clustering algorithm, comparing the change in dependency of “Contrast” value upon clusters number to that expected under a uniform distribution. A simulation study shows that the Contrast statistic can be used successfully either, when variables describing the object in a multi-dimensional space are independent (ideal objects) or dependent (real biological objects).

The first stage of data analysis is in its presentation. Cluster analysis can be used for analyzing of multi-dimensional data set [

There have been many methods proposed for estimating the number of clusters: gap statistic [

Nowadays this method and its like are widely used for solving various bio-medical problems. Thus, in paper [

In many of bio-medical issues we have to deal with a very large data set [

While solving biological and especially medico-biological problems we often face the problem of defining not only the optimal number of clusters which characterize this or that pathology, but also of estimating the number of independent factor attributes, i.e. decrease of space dimensionality. One of the reasons that cause it is a high cost of medical attention.

Exposing of strongly correlating attributes will make it possible to lessen the amount of attributes that are necessary for the analysis, and also to cut the number of medico-biological parameters, that characterize this or that pathology. The methods have enumerated don’t solve this problem.

In this paper we suggest a new method called Contrast statistics that enables to estimate effectively the optimal number of clusters, also making possible to estimate the number of independent factorial attributes. Since many methods of estimating the optimal clusters number [9,13] are based on the methods of Gap statistics and Silhouette statistics, which are nowadays considered classical, it’s only natural that these are the methods we used for the comparative analysis with our method. The second section presents the method description. The third contains the results of this method application.

Let X be a set with N points in a m-dimensional data space. Data is distributed in k clusters (O^{1}, O^{2}, , O^{k} are centres of these clusters). We examine a C point, which belongs to the O^{0} cluster. Then we define the Contrast of C(Cr) by Eq.1

where is Euclidean distance C to the centre it’s cluster, is Euclidean distance C to the nearest cluster besides its own nearest. Points with large, are well clustered, whereas those with small, tend to lie between clusters. Then we characterize quality of the division (k clusters) by Eq.2 (Contrast index)

Intuitively, when points concentrates in the cluster centres, Contrast (k) will have a high value, when points are distributed uniformly, this value will be low. It will enable us to make a conclusion about the efficiency of the division into the given number of clusters.

We generated datasets (10,000 points uniformly distributed) in m-dimensional space. Then we divided data in k clusters, using SOM technique. Then we calculated Contrast (k) of the division.

The following conclusions can be formulated from this analysis (

1) In the situation of uniformly distributed points (under condition of not great clusters number) dependence Contrast index upon clusters number is described by Eq.3:

where a and l are constants.

2) l (Eq.1) is a positive value and it decreases when dimension of space m increases.

3) When clusters number are great (for the given dimension of space m) the Contrast index doesn’t depend on k (Contrast = Const).

The algorithm optimal number of clusters computation can be proposed:

1) Cluster the observed data, varying the total number of clusters from k = 1, 2, K, calculating Contrast (k).

2) Obtain the regression line (ln(Contrast) vs ln(k)), which corresponds to uniform distribution, using great k values.

3) Estimate dimension of space number via line slope.

4) Finally choose the optimal number of clusters via Contrast (k_{opt}) greatest deviates from the regression line.

There have been many methods proposed for estimating the optimal number of clusters: Gap statistic [

demonstrated. For application of this approach in the case of k-means the value is calculated by Eq.4,

where—average Euclidean distance from the object to the centre of its cluster in the case of objects distribution under analysis in i-cluster (the quantity of objects in the referent distribution being equal to those the distribution under analysis ). In [

While computing Silhouette statistics [

where a(j) is the average Euclidean distance from the objects to the other objects of the same cluster, b(j) the average Euclidean distance from the object to the objects of the nearest cluster, to which it doesn’t belong. The optimal number of clusters k should be considered as such, for which the average (for all the objects) values s(j) is maximal.

Further on we demonstrate the comparative analysis of application for the optimal cluster number selection the Contrast statistic, the Gap statistic and the Silhouette statistic.

For estimation the Contrast statistic efficiency we analyze some datasets:

1) Datasets simulated in a multi-dimensional space (independent variables), small and large point’s number.

2) Datasets simulated in a multi-dimensional space (dependent variables), small and large point’s number.

3) Practical datasets (real medical and biological data, dependent variables), small and large point’s number.

We applied three different methods for estimating the optimal number of clusters: Gap statistic, Silhouette statistic, Contrast statistic.

The following results can be formulated:

1) For independent variables the Gap statistic is an efficient method of optimal clusters number calculation. The similar results give both Silhouette statistic and Contrast statistic.

2) For dependent variables the Gap statistic isn’t the efficient method. There is a problem on generation of reference distribution (unknown true number of space dimension). Both Silhouette statistic and Contrast statistic can be efficiently used.

3) Total time of Silhouette statistic calculation is as square of cluster size, total time of Contrast statistic calculation is as cluster size.

4) Additionally Contrast statistic allows estimating dimensional number of space for dependent variables (slope the regression line, using great k values) however one of the shortcomings of the estimation is that it is possible only with a small number of independent variables (m < 12).

There are measurement data for 150 irises specimen, in equal parts (50 specimen each) belonging to three species (iris setosa, iris versicollor, iris virginica) [

Analysis performed for the model task indicates that Gap statistics, Silhouette statistic and Contrast statistics all define the optimal cluster number as equaling to four.

A substantial analysis for each one of the separated clusters clearly demonstrates the biological importance of the solution obtained. In addition to the obtained optimal clusters number, considering the Contrast index dependency on k, we may safely assume, that the number of independent variables, in which the mentioned objects are distributed, is close to 3.

Contrast statistics method was used for analysis of Data of the National Insulin-treated Diabetics Register [

The distinction of the mortality cases from diabetes for differentiated clusters is statistically essential on the significance level p < 0.05 (chi-square test). Difference in disability occurrence in these clusters have the significance level p < 0.05. This allows to conclude, that cluster differentiation in 135-deimensional item space has not just been conducted formally, but content-dependent. In this case, following the analysis results we can say that “less serious cases” were related to the 1^{st} cluster, “most

serious” related to the 3^{rd} cluster. The 2^{nd} cluster contained modest cases [

New Contrast statistic technique the number of data clusters estimation is proposed. The method can be efficiently used for the real large medical and biological dataset. Contrast statistic additionally allows estimating dimensional number of space (only in the case of small space dimentionality) for dependent variables.