Hadoop-Based Similarity Computation System for Composed Documents

There exist a large number of composed documents in universities in the teaching process. Most of them are required to check the similarity for validation. A kind of similarity computation system is constructed for composed documents with images and text information. Firstly, each document is split and outputs two parts as images and text information. Then, these documents are compared by computing the similarities of images and text contents independently. Through Hadoop system, the text contents are easily and quickly separated. Experimental results show that the proposed system is efficient and practical.


Introduction
Document similarity computation is a hot research topic in information retrieval and it is a key issue for automatic document categorization, clustering analysis, fuzzy query and question answering.At present, it aims mainly to improve the accuracy and the efficiency with approaches such as the method based on vector space model [1], the method based on Map-Reduce model [2].The cloud computing platform with parallel processing ability system, such as Hadoop, is recommended to process large-scale document collection.However, there exist a large number of composed documents in universities with lots of images, tables and text information.These documents may be copied or renewed by some students in the teaching process.This will lead to extra work for the teachers to check the duplication problems.However, the above computing methods are only for text information.They are not suitable directly for these composed documents.On the other hand, it will spend much time to rewrite all the computing tasks in Hadoop system.With the development of data integration, many existed software components are designed easily integrated for computation in data level.
In this paper, we design an integrated system for composed document similarity computation with Hadoop platform and outer program interface.The main works we have done include three aspects: 1) the integration system design solution and its flow-chart; 2) the adopted approaches including documents splitting, image similarity computation with image processing method and text similarity computation using Map Reduce computation model in Hadoop platform; 3) carrying out some related experiments to prove the effectiveness of the ap-proach and system we presented.

Integration System Design
A kind of integrated system is presented here in Figure 1.It is based on data integration technology for several software systems.

Document Splitting
The system is designed to process complicated documents embedding images and text information.For the reason, the famous document format, *.doc in Microsoft Word, is chose as the analysis target.All the images in one document will be drawn automatically to be stored into a file folder in the operation file system.

Map-Reduce Technology
Map-Reduce is a framework for processing huge dataset on certain kinds of distributive problems using a large number of computers, and it is firstly presented by Google and used in Google clouding computing platform.There are also lots of algorithms solving huge dataset based on Map Reduce computation model.Map-Reduce has the ability to increase the computation performance of computer clusters which are composed of PC, and it can solve the problem that a single PC cannot process huge dataset for its limited processor and storage resource.In this framework, the procedure of processing huge dataset could be divided into two steps: Map and Reduce.In each step, it has (key, value) pairs as input, and generates (key, value) pair as output.Therefore, the technology is adopted for text document similarity computation.

Design of Similarity Computing Approach
After the document splitting process, its images and text information can be compared independently with other documents one by one.

Image Similarity Comparing Approach
In order to decrease the computation complexity, each image is processed as fingerprint for comparison, as shown in Figure 2.  The procedure of computation is stated as below: 1) Input one of the images in a document.
2) Input another one of images in new document.
3) Compute their fingerprints from the two images as Figure 2. 4) Compute their similarity.5) Go to 2) for inner loop.6) Go to 1) for outer loop.7) Output the computation results.

Document Vector Space Model
Document vector space model is an effective model to perform document similarity computation.Its core idea is firstly to extract feature words ( ) ( ) , ,..., n T T T .For every i T , it can be assigned certain weight ( ) cording to its significance relative to the document, and then k d can be represented by the n dimensional vector , ,..., , which contains the value of every feature word's weight.The weight can be computed according to two parameters below: 1) term frequency (TF), which denotes the times of i T occurs in the document j d ; 2) inverse document frequency (IDF) ( ) log / i N n , where N denotes the number of documents col- lection and i n denotes the number of documents which contain i T .So , i j W which denotes i T 's weight in document j d can be computed according to the formula as follows: , , log , i j W is also be called TF-IDF weight.Based on the document vector space model, the similarity ( ) (2)

Text Similarity Computation Approach
As the Map-Reduce model, all the text documents are input to the Hadoop system to operate the Word Count program.The word separating results are stored and back similarity computation.The Map-Reduce computation model can help us to realize the parallel processing.
Next, the output results are transformed to the similarity computing program.Here, all the text information will be calculated and compared using Equation (2).The computation functions are implemented as Figure 3.

Experimental Analysis
The document format is chosen for *.doc type.Firstly, each document is split to form two parts: one is group with images, and the other is text information.This process is implemented by C# programming.Then, these images are compared for similarity for all documents.

Document Analysis with Hadoop System
The separated text part from the original document is organized as another text file.Lots of such similar files are input to the Hadoop system for word counting by Map Reduce method.The Hadoop system with version 1.2.1 is adopted under Centos 6.5 operating system.Three computing nodes and one managing node are applied to form the basic Hadoop computing platform.Here, the Word Counting program in Hadoop system is used for all the text files.

Image Similarity Comparison
The operation program is implemented using Java, as shown in Figure 4.After two images are chosen, their similarity can be quickly obtained and displayed as number.

Text Similarity Computation
There are ten documents chosen for Map-Reduce computation and comparison.After the WordCount process is finished, the separated words in each document are input to the next program for text similarity computation.The main operation interface is shown in

Conclusion
Through system integration technology, the composed documents with images and text information can be easily implemented for their similarity.Especially for the text word counting process, the Hadoop system is adopted with its Map-Reduce model.The approaches of image similarity and text similarity show that the proposed integration system is efficient and practical.

Figure 1 .
Figure 1.Flow-chart of composed document similarity analysis.

Figure 2 .
Figure 2. Process of computing image fingerprint.
of document k d and l d can be computed according to the cosine value ( ) cos , k l d d of vectors space angle.The value of ( ) cos , k l d d is proportional to the similarity.If its value is smaller then the similarity between k d and l d is lower, otherwise it is contrary.The domain of ( ) log / i N n is [0, 1], 0 denotes k d and l d are ab-different and 1 denotes they are the same.The computation formula of cos( , ) k l d d as follows:

Figure 5 .
The results of similarity are shown in Figure 6 as matrix.With the number increasing of words for text similarity, the computation time and biggest similarity show stable, as shown in Figure 7.It means that only a small number of words output by Hadoop are necessary for text similarity computation.

Figure 4 .
Figure 4. Program operation result for image similarity.

Figure 5 .
Figure 5. Program operation process for text similarity.

Figure 6 .
Figure 6.Text similarity presentation as matrix.