CDV Index: A Validity Index for Better Clustering Quality Measurement

Abstract

In this paper, a cluster validity index called CDV index is presented. The CDV index is capable of providing a quality measurement for the goodness of a clustering result for a data set. The CDV index is composed of three major factors, including a statistically calculated external diameter factor, a restorer factor to reduce the effect of data dimension, and a number of clusters related punishment factor. With the calculation of the product of the three factors under various number of clusters settings, the best clustering result for some number of clusters setting is able to be found by searching for the minimum value of CDV curve. In the empirical experiments presented in this research, K-Means clustering method is chosen for its simplicity and execution speed. For the presentation of the effectiveness and superiority of the CDV index in the experiments, several traditional cluster validity indexes were implemented as the control group of experiments, including DI, DBI, ADI, and the most effective PBM index in recent years. The data sets of the experiments are also carefully selected to justify the generalization of CDV index, including three real world data sets and three artificial data sets which are the simulation of real world data distribution. These data sets are all tested to present the superior features of CDV index.

Share and Cite:

Yeh, J. , Joung, F. and Lin, J. (2014) CDV Index: A Validity Index for Better Clustering Quality Measurement. Journal of Computer and Communications, 2, 163-171. doi: 10.4236/jcc.2014.24022.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Mitchell, T.M. (1997) Machine Learning. 1st Edition, McGraw-Hill, Inc., New York.
[2] Bishop, C.M. (2006) Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag New York, Inc., Secaucus.
[3] Davies, D.L. and Bouldin, D.W. (1979) A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 224-227. http://dx.doi.org/10.1109/TPAMI.1979.4766909
[4] Dunn, J.C. (1973) A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics, 3, 32-57. http://dx.doi.org/10.1080/01969727308546046
[5] Shafi, I., Ahmad, J., Shah, S.I., Ikram, A.A., Khan, A.A. and Bashir, S. (2010) Validity-Guided Fuzzy Clustering Evaluation for Neural Network-Based Time-Frequency Reassignment. EURASIP Journal on Advances in Signal Processing, 2010, Article ID: 636858. http://dx.doi.org/10.1155/2010/636858
[6] Pakhira, M.K., Bandyopadhyay, S. and Maulik, U. (2004) Validity Index for Crisp and Fuzzy Clusters. 37, 487-501.
[7] Wikipedia. Minkowski Distance. http://en.wikipedia.org/wiki/Minkowski_distance
[8] Macqueen, J.B. (1967) Some Methods for Classification and Analysis of Multi-Variate Observations. Proceedings of the Fifth Berkeley Symposium on Math, Statistics, and Proba-bility, Vol. 1, University of California Press, 281-297.
[9] Fisher, R.A. (1936) The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7, 179-188. http://dx.doi.org/10.1111/j.1469-1809.1936.tb02137.x
[10] Bezdek, J.C. and Pal, N.R. (1998) Some New Indexes of Cluster Validity. Transactions on Systems, Man, and Cybernetics—Part B, 28, 301-315. http://dx.doi.org/10.1109/3477.678624
[11] Kothari, R. and Pitts, D. (1999) On Finding the Number of Clusters. Pattern Recognition Letters, 20, 405-416. http://dx.doi.org/10.1016/S0167-8655(99)00008-2
[12] Pal, N.R. and Bezdek, J.C. (1995) On Cluster Validity for the Fuzzy c-Means Model. IEEE Transactions on Fuzzy Systems, 3, 370-379. http://dx.doi.org/10.1109/91.413225

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.