Application of Full Text Search Engine Based on Lucene

This paper introduces us the full-text search engine based on Lucene and full-text retrieval technology, including indexing and system architecture, compares the full-text search of Lucene with the String search retrieval's response time, the experimental results show that the full text search of Lucene has faster retrieval speed.


Introduction
With the rapid development of Internet and with the explosive growth of Web information, Internet users how to remove the impurities and retained the essence quickly and easily to gain the information they need in the vast ocean of information to become a hot research topic in this field.
The core of information search is the full-text retrieval technology.Full-text search technology provided us with the information retrieval tool according to the content of data rather than the external features based on a variety of computer data such text, sound, image as processing object [1].Create all the possible terms in the index which are searched by network users as well as help people to manage and order extensive information and enable network users to quickly and easily retrieve any information they need.Lucene is a pure Java software project which is mature, free and open source.In recent years, Lucene has become one of the most highly praise and most popular information retrieval library.

Build a Text Database
Firstly, we should build a text database which is used to store all information retrieved by the user, then determine text model of retrieval system.The model has an identifiable and low degree of redundancy [2].Once the model is confirmed, we should not make further more changes any longer.

Create Indexing
Create index with the model according to the text of database.Indexing can greatly improve the speed of information retrieval.Which way do you use depends on the scale of information retrieval system.Large-scale information retrieval systems such as Google, Baidu take advantage of the approach of inverted index.

Search
After indexing the documents, you can start to search information you need.Search requests are submitted by the users and information retrieval systems to preprocess and search the information eventually return user the information.

Filter and Sort the Results
After the information retrieval system search the information that the users need and it will filter or sort the information by making a certain rule and then return the user related information [3].

Full Text Search Engine of Lucene
Lucene is one of the Jakarta projects of Apache Software Foundation which is an open source full text search engine toolkit, it's not a full text search engine [4], but a full-text search engine framework that can provide users complete query engine, text indexing engine and part of the analysis engine, it can also provide a simple but powerful API interface so that people can conveniently and quickly develop the search engine.

Systematic Systematic Structure of Lucene
Lucene is an excellent full text search engine, its structure has a strong object-oriented features.Lucene source package has seven modules, the five main modules are as follows [5]: 1) Org.apache.lucene.analysisAnalyzer, Its primary role is to segment the document and remove the stop words which are no help for retrieve but occurrence frequency is very high such as "and" "ah" further separate semantic search words such as Chinese phrase, English words and E-mail address.Lucene can also provide us with two parsers such as SimpleAnalyzer and Standar-dAnalyzer.
2) Org.apache.lucene.documentDocument Management, Document is similar to a record in relational database, it mainly responsible for the management of fields, and it divided into text field and date field.
3) Org.apache.lucene.indexIndexing Management, including the establishment of index, inserts records and deletes records.Indexing package is the core of information retrieval system, the purpose of full text search is to adopt the terms which are separated to create index so user can search the information only to those have indexed but not the full text search further greatly improve the efficiency of information retrieval.4) Org.apache.lucene.searchSearch Management, according to the query to obtain the results of retrieval.5) Org.apache.lucene.queryParserqueryParser, parsing the user query and then pass the searcher.

The Indexing of Lucene
In Lucene, an index is composed of segments, a segment is made up of documents, a document is composed of fields, and many terms consist of a field.The index process of Lucene is started from the add Document method of IndexWriter, as shown in Figure 1 [6].
In Figure 1, introduce a new class is DocumentWriter.In the API of Lucene, the main role of IndexWriter is to add documents to the indexing which provides us with the main interface for indexing.But writing process of indexing is completed by DocumentWriter.Separating data source and calculating the frequency and location of keywords as well as writing process of indexing is the most complicated thing in Lucene, which are actually occurred in the class of DocumentWriter.
Except for adding documents to indexing, Lucene will go further judge some cases about indexing and then merge indexing [7].

Examples of Lucene Retrieval Application
Lucene full text search is mainly composed of analysis, indexing and searching three modules.Analysis module is responsible for preprocessing document information; the principal role of indexing module is to enhance the speed of retrieval; searching module is mainly used for interacting to users [8].

Create DocumentWriter object
Named for the Segment Call DocumentWriter approach of addDocument and add documents to Save segment information, if there are more than one segment to determine whether merge it or not, then merge it if IndexWriter:addDocument() Figure 1.Indexing process of Lucene.

System Implementation
This paper employed the toolkits of Lucene to simulate two documents retrieval in the Eclipse development environment [9].Lucene Development Kit version is LU-CENE-CORE-2.0.0.JAR and its word tool is JEAN-ALYSIS-1.4.0.JAR, it can also require java runtime environment above JDK1.6version and need to import JAR package into Eclipse [10].
1) Preprocessing Module: Before using Lucene we need to preprocess the prepared text documents.The mainly role of preprocessing is to convert full-width characters into half-width characters.In order to better display the use of Lucene, this paper will divide the large documents into small documents and assign a unique ID number for each document.Main codes are as follows: public class FilePreprocess { public static void preprocess (File file1, String out-putDir) splitToSmall (characterProcess (file,outputDir + "output.all"),outputDir) public static File character Process (File file1, String destFile) private static String replace (String line) public static void splitToSmall (File file1, String outputpath) } The replace method is used for storing full-width characters and half-width characters by creating a HashMap and then traverse HashMap, if full-width characters are found we can replace it and finally return replaced characters.The characterProcess method is to convert full-width characters into half-width characters and return new files.The splitToSmall method is to call char-acterProcess to complete the replacement of full-width characters and half-width characters and finally return new file as splitToSmall method's first parameter, new files named for "output.all"and stored it into outputDir.
Then the splitToSmall method is to divide new files into several small files and stored it into the directory of out-putDir.
2) Indexing Module: After processing the document, you can use Lucene to process relevant information.Firstly, create indexing for processing documents.Secondly, build query object; Lastly, search in index.At first create a new IndexProcessor class for the document, the main code are as follows [11] writer.addDocument (doc) writer.close() } First of all, creating IndexWriter object which used StandardAnalyzer as an analysis tool, in order to generate indexing and store it into directory.IndexWriter.Max-Field-Length.UNLIMITED shows that IndexWriter creates indexing for fields in the Document and the length of field has no limitations; secondly, creating Documents as well as Fields and adding fields like file name, file contents to Documents; Field.Store.YES indicates that it will store the field of file name; Field.Index.NO_ NORMS shows that it can indexing but not analyze file name; Field.TermVector.YES will store terms of field; FileReader (files [i]) stands for adding values to Fields by using the approach of FileReader; Finally join the document into indexing and use close method to close indexing and write all the data in the cache memory to the disk, close all the data flow.If not closed, the experimental results show that only one segment file in indexing directory.
3) Searching Module: After indexing, system will establish a search class, the class will provide us with two approaches, index-Search approach is used to search indexing which are built by Lucene.However, string-Search approach used java long String to search information.You can also use the delete approach to delete operation on the preprocess text document.Main  First give the search path then parse the string and generate query object to search information.The three main modules are the general process of all information retrieval.

Experimental Results
Based on searching for two documents' keywords to obtain retrieval results by comparing Lucene retrieval with String retrieval [12].The number of the first document is 250,000 words, the former time is 75 ms, and the latter time is 1988ms; when the number of the document increased to 40 million words, the former time is 108ms, and the latter time is 5688ms.So we can see Lucene's retrieval time-consuming is superior to String retrieval time-consuming.If string retrieval was applied to a largescale information retrieval system, the search speed will be intolerable when the information storage capacity reaches TB level.
The instance support document indexing and retrieval for the form of txt.If there is a need of practical applications, you can use PDFBox to process PDF documents and use POI to process Excel and Word and use Jacob to deal with word documents, as long as put data source into a Document object, you can search the information you need.

Conclusion
Lucene is a full text indexing engine toolkit written in Java, multi-user support access, quickly visit indexing time and can cross-platform use [13].This paper in detail analyze the analysis of Lucene, indexing and searching three main modules from system architecture and compare the Lucene full text search with the String retrieval's response time, the experimental results show that the Lucene full text search has faster retrieval speed.