Short Text Classification Based on Improved ITC

Abstract

The long text classification has got great achievements, but short text classification still needs to be perfected. In this paper, at first, we describe why we select the ITC feature selection algorithm not the conventional TFIDF and the superiority of the ITC compared with the TFIDF, then we conclude the flaws of the conventional ITC algorithm, and then we present an improved ITC feature selection algorithm based on the characteristics of short text classification while combining the concepts of the Documents Distribution Entropy with the Position Distribution Weight. The improved ITC algorithm conforms to the actual situation of the short text classification. The experimental results show that the performance based on the new algorithm was much better than that based on the traditional TFIDF and ITC.

Share and Cite:

Li, L. and Qu, S. (2013) Short Text Classification Based on Improved ITC. Journal of Computer and Communications, 1, 22-27. doi: 10.4236/jcc.2013.14004.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Z. Y. Cui, “Study on Related Technologies of Chinese Short Text Classification,” Henan University, Henan, 2006.
[2] D. Fan, “Study on Chinese Short-Text Classification,” Tsinghua University, Beijing, 2009.
[3] G. Salton, “Antomatic Text Processing: The Information, Analysis, and Retrieval of Information by Computer,” Addison-Wesley Longman Publishing Co., Inc., Boston, 1989.
[4] V. Hatzivassiloglou, J. Klavans and E. Eskin, “Detecting Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine,” Proceedings of Joint SIGDAT Conference on Empirical Methods in NLP and very Large Corpora, Hong Kong, June 1999, pp. 21-22.
[5] J. Chang, “Study on Short Text Classification Algorithms,” Fudan University, Shanghai, 2008.
[6] W. Y. Liu, J. Q. Xiao, F. Min and B. Liu, “A Short Text Modeling Method Combing Semantic and Statistic Information,” Information Science, Vol. 180, No. 20, 2010, pp. 4031-4041. http://dx.doi.org/10.1016/ j.ins.2010.06.021
[7] X. Q. Wu, “Application of Hierarchical Keyword Extracting and Text Classification in BBS,” Shanghai Jiao Tong University, Shanghai, 2006.
[8] S. Wang, X. H. Fan and X. L. Chen, “Chinese Short Text Classification Based on Hyponymy Relations,” Journal of Computer Application, Vol. 30, No. 3, 2010, pp. 602-606.
[9] Z. Y. Cui, “Microblog Text Classification Based on Semantic Information,” Modern Computer, Vol. 8, 2010, pp. 18-20.
[10] Z. M. Han, Y. S. Zhang, H. Zhang, Y. L. Wang and J. H. Huang, “On Offensive Short Text Tendency Classification Algorithm for Chinese Microblog,” Computer Application and Software, Vol. 29, No. 10, 2010, pp. 89-103.
[11] Z. F. Zhang, D. Q. Miao and G. Gao, “Short Text Classification Based on LDA Topic Model,” Computer Application, Vol. 6, 2013, pp. 1587-1590
[12] G. L. Shi and Q. F. Shi, “Text Mining Based on Consistency of Product Reviews in Different Shopping Websites,” New Technology of Library and Information Service, Vol. 12, 2011, pp. 64-68.
[13] W. Yi and C. Meek, “Improving Similarity Measures for Short Segments of Text,” Proceedings of 22nd conference on Artifical Intellignce (AAAI-07), Vancouver, 24-26 July 2007, pp. 1489-1494.
[14] Y. T. Zhou, J. B. Tang and J. Q. Wang, “The Improved TFIDF Based on Information Entropy,” Computer Engineering and Application, Vol. 43, No. 35, 2007, pp. 156-158.
[15] K. L. Chen, “Collection and Analysis of Large-Scale Balanced-Corpus and Approach to Text Categorization,” Chinese Academy Science, Beijing, 2006.
[16] Y. F. Zhang, S. M. Peng and J. Lv, “The Improvement and Application of TFIDF Based on Text Classification,” Computer Engineering, Vol. 3, No. 19, pp. 76-78.
[17] F. J. Shao and Z. Q. Yu, “Principle and Algorithm of Data Mining,” China Waterpower Press, Beijing, 2003.
[18] J. Qin, X. R. Chen and W. J. Wang, “Feature Extraction of Text Classification,” Computer Application, Vol. 23, No. 2, 2003, pp. 45-46.
[19] W. H. Jia and M. Kamler, “Data Mining: Concepts and Techniques,” Morgan Kaufman Publishers, New York, 2006.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.