Share This Article:

D-IMPACT: A Data Preprocessing Algorithm to Improve the Performance of Clustering

Abstract Full-Text HTML XML Download Download as PDF (Size:2528KB) PP. 639-654
DOI: 10.4236/jsea.2014.78059    3,319 Downloads   4,766 Views   Citations

ABSTRACT

In this study, we propose a data preprocessing algorithm called D-IMPACT inspired by the IMPACT clustering algorithm. D-IMPACT iteratively moves data points based on attraction and density to detect and remove noise and outliers, and separate clusters. Our experimental results on two-dimensional datasets and practical datasets show that this algorithm can produce new datasets such that the performance of the clustering algorithm is improved.

Conflicts of Interest

The authors declare no conflicts of interest.

Cite this paper

Tran, V. , Hirose, O. , Saethang, T. , Nguyen, L. , Dang, X. , Le, T. , Ngo, D. , Sergey, G. , Kubo, M. , Yamada, Y. and Satou, K. (2014) D-IMPACT: A Data Preprocessing Algorithm to Improve the Performance of Clustering. Journal of Software Engineering and Applications, 7, 639-654. doi: 10.4236/jsea.2014.78059.

References

[1] Berkhin, P. (2002) Survey of Clustering Data Mining Techniques. Technical Report, Accrue Software, San Jose.
[2] Murty, M.N., Jain, A.K. and Flynn, P.J. (1999) Data Clustering: A Review. ACM Computing Surveys, 31, 264-323. http://dx.doi.org/10.1145/331499.331504
[3] Halkidi, M., Batistakis, Y. and Vazirgiannis, M. (2001) On Clustering Validation Techniques. Journal of Intelligent Information Systems, 17, 107-145. http://dx.doi.org/10.1023/A:1012801612483
[4] Golub, T.R., et al. (1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286, 531-537. http://dx.doi.org/10.1126/science.286.5439.531
[5] Quinn, A. and Tesar, L. (2000) A Survey of Techniques for Preprocessing in High Dimensional Data Clustering. Proceedings of the Cybernetic and Informatics Eurodays.
[6] Abdi, H. and Williams, L.J. (2010) Principal Component Analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2, 433-459. http://dx.doi.org/10.1002/wics.101
[7] Yeung, K.Y. and Ruzzo, W.L. (2001) Principal Component Analysis for Clustering Gene Expression Data. Bioinformatics, 17, 763-774. http://dx.doi.org/10.1093/bioinformatics/17.9.763
[8] Shi, Y., Song, Y. and Zhang, A. (2005) A Shrinking-Based Clustering Approach for Multidimensional Data. IEEE Transaction on Knowledge Data Engineering, 17, 1389-1403.
http://dx.doi.org/10.1109/TKDE.2005.157
[9] Chang, F., Qiu, W. and Zamar, R.H. (2007) CLUES: A Non-Parametric Clustering Method Based on Local Shrinking. Computational Statistics & Data Analysis, 52, 286-298.
http://dx.doi.org/10.1016/j.csda.2006.12.016
[10] Jain, A.K. and Dubes, R.C. (1988) Algorithms for Clustering Data. Prentice Hall, Upper Saddle River.
[11] Ester, M., Kriegel, H.P., Sander, J. and Xu, X. (1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, 226-231.
[12] Ankerst, M., Breunig, M.M., Kriegel, H.P. and Sander, J. (1999) OPTICS: Ordering Points to Identify Clustering Structure. Proceedings of the ACM SIGMOD Conference, 49-60.
[13] Hinneburg, A. and Keim, D. (1998) An Efficient Approach to Clustering in Large Multimedia Databases with Noise. Proceeding 4th International Conference on Knowledge Discovery & Data Mining, 58-65.
[14] Tran, V.A., et al. (2012) IMPACT: A Novel Clustering Algorithm Based on Attraction. Journal of Computers, 7, 653-665. http://dx.doi.org/10.4304/jcp.7.3.653-665
[15] The UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/datasets
[16] Karypis Lab Datasets. http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/chameleon-data.tar.gz
[17] Karypis, G., Han, E.H. and Kumar, V. (1999) CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. Computer, 32, 68-75. http://dx.doi.org/10.1109/2.781637
[18] Radioresistant and Radiosensitive Tumors and Cell Lines.
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9712
[19] Chang, F., Qiu, W., Zamar, R.H., Lazarus, R. and Wang, X. (2010) Clues: An R Package for Nonparametric Clustering Based on Local Shrinking. Journal of Statistical Software, 33, 1-16.
[20] Hubert, L. and Arabie, P. (1985) Comparing Partitions. Journal of Classification, 2, 193-218.
[21] Visakh, R. and Lakshmipathi, B. (2012) Constraint Based Cluster Ensemble to Detect Outliers in Medical Datasets. International Journal of Computer Applications, 45, 9-15.
[22] D-IMPACT Preprocessing Algorithm. https://sourceforge.net/projects/dimpactpreproce/

  
comments powered by Disqus

Copyright © 2018 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.