Hadoop-Based Similarity Computation System for Composed Documents


There exist a large number of composed documents in universities in the teaching process. Most of them are required to check the similarity for validation. A kind of similarity computation system is constructed for composed documents with images and text information. Firstly, each document is split and outputs two parts as images and text information. Then, these documents are compared by computing the similarities of images and text contents independently. Through Hadoop system, the text contents are easily and quickly separated. Experimental results show that the proposed system is efficient and practical.

Share and Cite:

Zhang, X. , Qin, Z. , Liu, X. , Hou, Q. , Zhang, B. and Wu, J. (2015) Hadoop-Based Similarity Computation System for Composed Documents. Journal of Computer and Communications, 3, 196-202. doi: 10.4236/jcc.2015.35025.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] Mao, E., Wesley, P. and Chu, W. (2007) The Phrase Based Vector Space Model for Automatic Retrieval of Free- Document Medical Documents. Data & Knowledge Engineering, 1.
[2] He, C.B., Tang, Y. and Tang, F.Y. (2011) Large-Scale Document Similarity Computation Based on Cloud Computing Platform. 2011 6th International Conference on Pervasive?Computing and Applications (ICPCA).
[3] Li, L.N., Li, C.P. and Chen, H. (2013) Map Reduce-Based SimRank Computation and Its Application. 2013 IEEE International Congress on Big Data.
[4] Baraglia, R., Morales, G.F. and Lucchese, C. (2010) Document Similarity Self-Join with MapReduce. 2010 IEEE International Conference on Data Mining. http://dx.doi.org/10.1109/ICDM.2010.70
[5] Dean, J. and Ghemawat, S. (2008) MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 1. http://dx.doi.org/10.1145/1327452.1327492

Copyright © 2020 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.