Estimating the number of data clusters via the contrast statistic

Abstract

A new method (the Contrast statistic) for estimating the number of clusters in a set of data is proposed. The technique uses the output of self-organising map clustering algorithm, comparing the change in dependency of “Contrast” value upon clusters number to that expected under a uniform distribution. A simulation study shows that the Contrast statistic can be used successfully either, when variables describing the object in a multi-dimensional space are independent (ideal objects) or dependent (real biological objects).

Share and Cite:

Lyakh, Y. , Gurianov, V. , Gorshkov, O. and Vihovanets, Y. (2012) Estimating the number of data clusters via the contrast statistic. Journal of Biomedical Science and Engineering, 5, 95-99. doi: 10.4236/jbise.2012.52012.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Behbahani, S., Nasrabadi, A. (2009) Application of SOM neural network in clustering. Journal Biomedical Science and Engineering, 2, 637-643. doi:10.4236/jbise.2009.28093
[2] Kohonen, T. (1982) Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59-69. doi:10.1007/BF00337288
[3] Tibshirani, R., Walther, G. and Hastie, T. (2000) Estimating the number of cluster in a dataset via the gap statistic. Technical Report, Department of Biostatistics, Stanford University, Stanford.
[4] Dudoit, S. and Fridlyand, J. (2002) A prediction—based resampling method for estimating the number of clusters in a dataset. Genome Biology, 3, 1-21. doi:10.1186/gb-2002-3-7-research0036
[5] Sugar, C. and James, G. (2003) Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association, 98, 750-763. doi:10.1198/016214503000000666
[6] Tibshirani, R. and Walther, G. (2005) Cluster validation by prediction strength. Journal of Computational & Graphical Statistics, 14, 511-528. doi:10.1198/106186005X59243
[7] Guo, P., Chen, P. and Lyu, M. (2002) Cluster number selection for a small set of samples using the Bayesian Ying-Yang model. IEEE Transactions on Neural Networks, 13, 757-763. doi:10.1109/TNN.2002.1000144
[8] Gangnon, R. and Clayton, M. (2007) Cluster detection using Bayes factors from over-parameterized cluster models. Environmental and Ecological Statistics; 14, 69-82. doi:10.1007/s10651-006-0007-7
[9] Yin, Z., Zhou, X.B., Bakal, C., Li1, F.H., Sun, Y.X., Perrimon, N. and Wong, S.T.C. (2008) Using iterative cluster merging with improved gap statistics to perform online phenotype discovery in the context of high-throughput RNAi screensBMC. Bioinformatics, 9, 264.
[10] Sharma, A., Podolsky, R., Zhao, J. and McIndoe, R.A. (2009) A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets. Bioinformatics, 25, 1152-1157. doi:10.1093/bioinformatics/btp123
[11] Medvedovic, M. and Sivaganesan, S. (2002) Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics, 18, 1194-1206. doi:10.1093/bioinformatics/18.9.1194
[12] Qin, Z.S. (2006) Clustering microarray gene expression data using weighted Chinese restaurant process. Bioinformatics, 22, 1988-1997. doi:10.1093/bioinformatics/btl284
[13] Yun, T.Y., Hwang, T.H., Cha, K. and Yi, G.-S. (2010) CLIC: Clustering analysis of large microarray datasets with individual dimension-based clustering. Nucleic Acids Research; 38, W246-W253. doi:10.1093/nar/gkq516
[14] Kim, J.H., Kohane, I.S. and Ohno-Machado, L. (2002) Visualization and evaluation of clusters for exploratory analysis of gene expression data. Journal of Biomedical Informatics, 35, 25-36. doi:10.1016/S1532-0464(02)00001-1
[15] Khalangot, N., Gurianov, V., Misko, L. and Harris, N. (2004) Analysis of large diabetic registers: metthodology and some results. Proceedings of the Ninth International Symposium on Health Information Management Research, Sheffield, 15-17 June 2004, 145-150.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.