Document Clustering Using Semantic Cliques Aggregation

Abstract

The search engines are indispensable tools to find information amidst massive web pages and documents. A good search engine needs to retrieve information not only in a shorter time, but also relevant to the users’ queries. Most search engines provide short time retrieval to user queries; however, they provide a little guarantee of precision even to the highly detailed users’ queries. In such cases, documents clustering centered on the subject and contents might improve search results. This paper presents a novel method of document clustering, which uses semantic clique. First, we extracted the Features from the documents. Later, the associations between frequently co-occurring terms were defined, which were called as semantic cliques. Each connected component in the semantic clique represented a theme. The documents clustered based on the theme, for which we designed an aggregation algorithm. We evaluated the aggregation algorithm effectiveness using four kinds of datasets. The result showed that the semantic clique based document clustering algorithm performed significantly better than traditional clustering algorithms such as Principal Direction Divisive Partitioning (PDDP), k-means, Auto-Class, and Hierarchical Clustering (HAC). We found that the Semantic Clique Aggregation is a potential model to represent association rules in text and could be immensely useful for automatic document clustering.

Share and Cite:

Kumar, A. and Chiang, I. (2015) Document Clustering Using Semantic Cliques Aggregation. Journal of Computer and Communications, 3, 28-40. doi: 10.4236/jcc.2015.312004.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Ranganathan, P. (2011) The Data Explosion. IEEE Computer Society Press, 39-48.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.204.6768&rep=rep1&type=pdf
[2] McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J. and Barton, D. (2012) Big Data. The Management Revolution. Harvard Business Review, 90, 61-67.
[3] Delbru, R., Campinas, S. and Tummarello, G. (2012) Searching Web Data: An Entity Retrieval and High-Performance Indexing Model. Web Semantics: Science, Services and Agents on the World Wide Web, 10, 33-58.
http://dx.doi.org/10.1016/j.websem.2011.04.004
[4] Wu, X., Zhu, X., Wu, G.Q. and Ding, W. (2014) Data Mining with Big Data. IEEE Transactions on Knowledge and Data Engineering, 26, 97-107.
[5] Joshi, A. and Jiang, Z. (2002) Retriever: Improving Web Search Engine Results Using Clustering. TEAM, 2002, 59-81.
http://dx.doi.org/10.4018/978-1-930708-12-9.ch004
[6] Jonquet, C., LePendu, P., Falconer, S., Coulet, A., Noy, N.F., Musen, M.A. and Shah, N.H. (2011) NCBO Resource Index: Ontology-Based Search and Mining of Biomedical Resources. Web Semantics: Science, Services and Agents on the World Wide Web, 9, 316-324.
http://dx.doi.org/10.1016/j.websem.2011.06.005
[7] Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A. and Decker, S. (2011) Searching and Browsing Linked Data with SWSE: The Semantic Web Search Engine. Web Semantics: Science, Services and Agents on the World Wide Web, 9, 365-401.
http://dx.doi.org/10.1016/j.websem.2011.06.004
[8] Harth, A. (2010) VisiNav: A System for Visual Search and Navigation on Web Data. Web Semantics: Science, Services and Agents on the World Wide Web, 8, 348-354.
http://dx.doi.org/10.1016/j.websem.2010.08.001
[9] Fazzinga, B., Gianforme, G., Gottlob, G. and Lukasiewicz, T. (2011) Semantic Web Search Based on Ontological Conjunctive Queries. Web Semantics: Science, Services and Agents on the World Wide Web, 9, 453-473.
http://dx.doi.org/10.1016/j.websem.2011.08.003
[10] Kosala, R. and Blockeel, H. (2000) Web Mining Research: A Survey. ACM SIGKDD Explorations Newsletter, 2, 1-15.
http://dx.doi.org/10.1145/360402.360406
[11] Mladenic, D. (1999) Text-Learning and Related Intelligent Agents: A Survey. IEEE Intelligent Systems, 14, 44-54.
http://dx.doi.org/10.1109/5254.784084
[12] Chatterjee, R. (2012) An Analytical Assessment on Document Clustering. International Journal of Computer Network and Information Security (IJCNIS), 4, 63-71.
http://dx.doi.org/10.5815/ijcnis.2012.05.08
[13] Shah, N. and Mahajan, S. (2012) Semantic Based Document Clustering: A Detailed. International Journal of Computer Applications, 52, 42-52.
http://dx.doi.org/10.5120/8202-1598
[14] MacQueen, J. (1967) Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281-297.
[15] Cheeseman, P. and Stutz, J. (1996) Bayesian Classification (Auto Class): Theory and Results. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. and Uthurusamy, R., Eds., Advances in Knowledge Discovery and Data Mining, American Association for Artificial Intelligence, Menlo Park, 153-180.
[16] Boley, D., Gini, M., Gross, R., Han, E.H.S., Hastings, K., Karypis, G. and Moore, J. (1999) Document Categorization and Query Generation on the World Wide Web Using WebACE. Artificial Intelligence Review, 13, 365-391.
http://dx.doi.org/10.1023/A:1006592405320
[17] Jain, A.K. and Dubes, R.C. (1988) Algorithms for Clustering Data. Prentice-Hall, Inc., Upper Saddle River.
[18] Chiang, I.J., Lin, T.Y. and Hsu, J.Y.J. (2004) Generating Hypergraph of Term Associations for Automatic Document Concept Clustering. Proceedings of the 8th IASTED International Conference on Artificial Intelligence and Soft Computing, Marbella, 1-3 September 2004, 181-186.
[19] Maron, M.E. and Kuhns, J.L. (1960) On Relevance, Probabilistic Indexing and Information Retrieval. Journal of the ACM (JACM), 7, 216-244.
http://dx.doi.org/10.1145/321033.321035
[20] Fuhr, N. and Buckley, C. (1991) A Probabilistic Learning Approach for Document Indexing. ACM Transactions on Information Systems (TOIS), 9, 223-248.
http://dx.doi.org/10.1145/125187.125189
[21] Salton, G. and Michael, J.M. (1986) Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York.
[22] Salton, G. and Buckley, C. (1988) Term-Weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24, 513-523.
http://dx.doi.org/10.1016/0306-4573(88)90021-0
[23] Sparck Jones, K. (1972) A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation, 28, 11-21.
http://dx.doi.org/10.1108/eb026526
[24] Moffat, A. and Zobel, J. (1994) Compression and Fast Indexing for Multi-Gigabyte Text Databases. Australian Computer Journal, 26, 1-9.
[25] Feldman, R., Fresko, M., Kinar, Y., Lindell, Y., Liphstat, O., Rajman, M. and Zamir, O. (1998) Text Mining at the Term Level. In: Zytkow, J.M. and Quafafou, M., Eds., Principles of Data Mining and Knowledge Discovery, Springer, Berlin Heidelberg, 65-73.
http://dx.doi.org/10.1007/BFb0094806
[26] Feldman, R., Dagan, I. and Kloesgen, W. (1996) Efficient Algorithms for Mining and Manipulating Associations in Texts. Proceedings of the Thirteenth European Meeting on Cybernetics and Systems Research, Vienna, 9-12 April 1996, 949-954.
[27] Feldman, R. and Hirsh, H. (1996) Mining Associations in Text in the Presence of Background Knowledge. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, 2-4 August 1996, 343-346.
[28] Agrawal, R., Imieliński, T. and Swami, A. (1993) Mining Association Rules between Sets of Items in Large Databases. ACM SIGMOD Record, 22, 207-216.
http://dx.doi.org/10.1145/170036.170072

Copyright © 2023 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.