Estimating the number of data clusters via the contrast statistic

Yuriy Lyakh; Vitaliy Gurianov; Oleg Gorshkov; Yuriy Vihovanets

doi:10.4236/jbise.2012.52012

Journal of Biomedical Science and Engineering > Vol.5 No.2, February 2012

Estimating the number of data clusters via the contrast statistic

Yuriy Lyakh, Vitaliy Gurianov, Oleg Gorshkov, Yuriy Vihovanets
Department of Medical Biophysics, Medical Informatics and Biostatistics, National Medical University, Donetsk, Ukraine.
DOI: 10.4236/jbise.2012.52012 PDF HTML XML 5,400 Downloads 9,157 Views Citations

Abstract

A new method (the Contrast statistic) for estimating the number of clusters in a set of data is proposed. The technique uses the output of self-organising map clustering algorithm, comparing the change in dependency of “Contrast” value upon clusters number to that expected under a uniform distribution. A simulation study shows that the Contrast statistic can be used successfully either, when variables describing the object in a multi-dimensional space are independent (ideal objects) or dependent (real biological objects).

Keywords

SOM Neural Network; Clustering; Gap Statistic; Silhouette Statistic

Share and Cite:

Lyakh, Y. , Gurianov, V. , Gorshkov, O. and Vihovanets, Y. (2012) Estimating the number of data clusters via the contrast statistic. Journal of Biomedical Science and Engineering, 5, 95-99. doi: 10.4236/jbise.2012.52012.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1]	Behbahani, S., Nasrabadi, A. (2009) Application of SOM neural network in clustering. Journal Biomedical Science and Engineering, 2, 637-643. doi:10.4236/jbise.2009.28093
[2]	Kohonen, T. (1982) Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59-69. doi:10.1007/BF00337288
[3]	Tibshirani, R., Walther, G. and Hastie, T. (2000) Estimating the number of cluster in a dataset via the gap statistic. Technical Report, Department of Biostatistics, Stanford University, Stanford.
[4]	Dudoit, S. and Fridlyand, J. (2002) A prediction—based resampling method for estimating the number of clusters in a dataset. Genome Biology, 3, 1-21. doi:10.1186/gb-2002-3-7-research0036
[5]	Sugar, C. and James, G. (2003) Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association, 98, 750-763. doi:10.1198/016214503000000666
[6]	Tibshirani, R. and Walther, G. (2005) Cluster validation by prediction strength. Journal of Computational & Graphical Statistics, 14, 511-528. doi:10.1198/106186005X59243
[7]	Guo, P., Chen, P. and Lyu, M. (2002) Cluster number selection for a small set of samples using the Bayesian Ying-Yang model. IEEE Transactions on Neural Networks, 13, 757-763. doi:10.1109/TNN.2002.1000144
[8]	Gangnon, R. and Clayton, M. (2007) Cluster detection using Bayes factors from over-parameterized cluster models. Environmental and Ecological Statistics; 14, 69-82. doi:10.1007/s10651-006-0007-7
[9]	Yin, Z., Zhou, X.B., Bakal, C., Li1, F.H., Sun, Y.X., Perrimon, N. and Wong, S.T.C. (2008) Using iterative cluster merging with improved gap statistics to perform online phenotype discovery in the context of high-throughput RNAi screensBMC. Bioinformatics, 9, 264.
[10]	Sharma, A., Podolsky, R., Zhao, J. and McIndoe, R.A. (2009) A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets. Bioinformatics, 25, 1152-1157. doi:10.1093/bioinformatics/btp123
[11]	Medvedovic, M. and Sivaganesan, S. (2002) Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics, 18, 1194-1206. doi:10.1093/bioinformatics/18.9.1194
[12]	Qin, Z.S. (2006) Clustering microarray gene expression data using weighted Chinese restaurant process. Bioinformatics, 22, 1988-1997. doi:10.1093/bioinformatics/btl284
[13]	Yun, T.Y., Hwang, T.H., Cha, K. and Yi, G.-S. (2010) CLIC: Clustering analysis of large microarray datasets with individual dimension-based clustering. Nucleic Acids Research; 38, W246-W253. doi:10.1093/nar/gkq516
[14]	Kim, J.H., Kohane, I.S. and Ohno-Machado, L. (2002) Visualization and evaluation of clusters for exploratory analysis of gene expression data. Journal of Biomedical Informatics, 35, 25-36. doi:10.1016/S1532-0464(02)00001-1
[15]	Khalangot, N., Gurianov, V., Misko, L. and Harris, N. (2004) Analysis of large diabetic registers: metthodology and some results. Proceedings of the Ninth International Symposium on Health Information Management Research, Sheffield, 15-17 June 2004, 145-150.

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies