Clusters Merging Method for Short Texts Clustering

Abstract

Under push of Mobile Internet, new social media such as microblog, we chat, question answering systems are constantly emerging. They produce huge amounts of short texts which bring forward new challenges to text clustering. In response to the features of large amount and dynamic growth of short texts, a two-stage clustering method was putted forward. This method adopted a sliding window sliding on the flow of short texts. Inside the slide window, hierarchical clustering method was used, and between the slide windows, clusters merging method based on information gain was adopted. Experiment indicated that this method is fast and has a higher accuracy.

Share and Cite:

Wang, Y. , Wu, L. and Shao, H. (2014) Clusters Merging Method for Short Texts Clustering. Open Journal of Social Sciences, 2, 186-192. doi: 10.4236/jss.2014.29032.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] He, H., Chen, B., Xu, W., et al. (2007) Short Text Feature Extraction and Clustering for Web Topic Mining. IEEE Third International Conference on Semantics, Knowledge and Grid, 382-385.
[2] Hartigan, J.A. and Wong, M.A. (1979) Algorithm AS 136: A k-Means Clustering Algorithm. Journal of the Royal Statistical Society, Series C (Applied Statistics), 28, 100-108.
[3] Szekely, G.J. and Rizzo, M.L. (2005) Hierarchical Clustering via Joint between-within Distances: Extending Ward’s Minimum Variance Method. Journal of Classification, 22, 151-183. http://dx.doi.org/10.1007/s00357-005-0012-9
[4] Zhao, P. and Cai, Q.S. (2007) Research of Novel Chinese Text Clustering Algorithm Based on HowNet. Computer Engineering and Applications, 43, 162-163.
[5] Tang, J., Wang, X., Gao, H., et al. (2012) Enriching Short Text Representation in Microblog for Clustering. Frontiers of Computer Science, 6, 88-101.
[6] Wang, L., Jia, Y., Han, W. (2007) Instant Message Clustering Based on Extended Vector Space Model. Advances in Computation and Intelligence, Springer Berlin Heidelberg, 435-443. http://dx.doi.org/10.1007/978-3-540-74581-5_48
[7] Peng, Z.Y., Yu, X.M., Xu H.B., et al. (2011) Incomplete Clustering for Large Scale Short Texts. Journal of Chinese Information, 25, 54-59.
[8] Chen, J.C., Hu, G.W., Yang, Z.H., et al. (2011) Text Clustering Based on Global Center-Determination. Computer Engineering and Applications, 47, 147-150.
[9] Liu, Z.X., Liu, Y.B. and Luo, L.M. (2010) An Efficient Density and Grid Based Clustering Algorithm. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 22, 242-247.
[10] Quinlan, J.R. (1979) Discovering Rules by Induction from Large Collections of Examples. Expert Sys-tems in the Micro Electronic Age. Edinburgh University Press.
[11] Guha, S., Rastogi, R. and Shim, K. (1998) CURE: An Efficient Clustering Algorithm for Large Databases. ACM SIGMOD Record, ACM, 27, 73-84.
[12] Zhou, Z.T. (2005) Quality Evaluation of Text Clustering Results and Investigation on Text Representation. Graduate University of Chinese Academy of Sciences, Beijing.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.