Mathematical Expression Extraction in Text Fields of Documents Based on HMM

Aiming at the problem that the mathematical expressions in unstructured text fields of documents are hard to be extracted automatically, rapidly and effectively, a method based on Hidden Markov Model (HMM) is proposed. Firstly, this method trained the HMM model through employing the symbol combination features of mathematical expressions. Then, some preprocessing works such as removing labels and filtering words were carried out. Finally, the preprocessed text was converted into an observation sequence as the input of the HMM model to determine which is the mathematical expression and extracts it. The experimental results show that the proposed method can effectively extract the mathematical expressions from the text fields of documents, and also has the relatively high accuracy rate and recall rate.


Introduction
Extracting the information we need from the text in documents and converting them into structured data for storing in database is the premise of their further utilization [1].At present, there are mainly three kinds of text information extraction technology: dictionary based model [2], rule based model [3] and statistics based model [4].
Hidden Markov Model [5] is one of the statistical models which obtains its model parameters through training instead of prior structural dictionaries and rules.It is established easily with stronger applicability and higher accuracy rate and recall rate.Therefore, it has got more attention of researchers, and widely been used in word segmentation [6], speech recognition [7], information extraction [8] and other fields.Journal of Computer and Communications Yu et al. [9] proposed an algorithm based on Hidden Markov Model for extracting the information of the header and reference of scientific papers in Chinese.At first, the algorithm segmented the paper header into semantic blocks by utilizing semicolon, enter, comma and other punctuation marks, and then extracted information from the semantic blocks.In order to make HMM more practical, the algorithm used the method of vertical merger and horizontal merger to combine HMM states when the model structure was being trained.In this way, an optimized HMM was obtained which had effectively improved the extraction accuracy rate and recall rate.
Liu [10] proposed an extraction method that can extract text information from different fields and data sets based on HMM.Firstly, the initial probability and transition probability parameters were divided into several templates for training.Based on this, a text information extraction algorithm based on multi-template HMM was proposed.It can effectively solve the problem that the training data was diversified and the model parameters were hard to learn, and has higher accuracy rate and recall rate.In addition, a new algorithm of text information extraction based on maximum entropy HMM was proposed through combining rule based method and statistic based method according to the information of context features and words meaning, which obtained a better extraction result.
Liu [11] proposed an information extraction method based on maximum entropy HMM (ME-HMM) and ontology technology for accurately extracting useful information from comprehensive evaluation of books.ME-HMM was used to some pre-processing works, such as nominal marking and named entity recognition.The information extraction model based on ontology technology was used for the extraction of professional sentence and addresses to solve the problems of information association and unclear underlying meaning.The experimental results show that this method can improve the accuracy rate of information extraction.
Du et al. [12] put forward a method of text information extraction based on mixed HMM to get the place names from the metadata of documents.This method considered the dependence of transition probability at a moment on previous and latter states.Two pass decoding of positive sequence and negative sequence were performed by utilizing Viterbi algorithm as decoding.The experimental results show that the method can achieve a better information extraction result.
Shuang and Sun [13] proposed an improved information extraction method based on HMM.The transition probability and output probability of the improved model were not only dependent on the current state, but also can be corrected according to the forward and backward dependency which can further improve the information extraction quality.The experimental results show that the proposed method can further improve the extraction quality.
In a word, the HMM has been widely used in text information extraction and played a tremendous role.However, there are less researches employing HMM on the mathematical expression extraction from text fields.
Because of the complex syntax rules of the mathematical expressions, if we use the rule based method of text field extraction to extract mathematical expressions, complicated rules should be made and the applicability of this method is limited.Based on the above considerations, this paper proposes a method to extract mathematical expressions from the unstructured text fields of documents based on HMM which could effectively avoid these problems.Firstly, the HMM model is constructed according to the symbol combination feature of the mathematical expression.Then, the extracted text is pre-processed.Finally, the trained model is used to extract the mathematical expressions automatically.
Simulation results show that this method can automatically, rapidly and effectively extract mathematical expressions from the text fields.

The Description of the Problem
It is found that the structured web pages express formulae through description language such as LaTeX [14], MathML [15], OpenMath [16] and so on which generally have fixed embedding modes and separating labels.For example, in LaTeX documents, the embedded formulae are labeled as "$…$", "\[…\]", "\begin{math}…\end{math}". While the isolated formulae are labeled as "$$…$$", "\[…\]", "\begin{align}…\end{align}", etc.Therefore, the mathematical expressions in the structured documents can be extracted according to the labels.Cui et al. [17] summarized the characteristics of various kinds of formats of formulae exist on web pages.And put forward a method of mathematical formula extraction based on heuristic rules.Chen et al. [18] proposed a method to automatically analyze and extract the features of LaTeX expressions and filtering formulae according to their features.In this method, the extraction accuracy was improved efficiently.
Different from structured text like web pages, the mathematical expressions existing in unstructured text have no clear labels.We cannot extract mathematical expressions from these kinds of documents directly with labels.The content of text consists of characters, numbers, punctuations and other printable symbols.And the combination modes of these symbols are many and varied.There are also no observable rules to separate mathematical expressions from ordinary text.In addition, the sources of text are varied, the typesetting forms and expression styles of text are also different, and the content of text is very large sometime.All of these reasons make great challenges and difficulties to extract mathematical expressions from ordinary text.So it is necessary to research the method for extracting mathematical expressions from ordinary text without special labels.

Algorithm for Extracting Mathematical Expressions in
Text Field of Documents Based on HMM

The Overall Framework of the Algorithm
According to the characteristics of mathematical expressions in the text fields of Output: HMM model.
Step 1: Get the training samples, that is, the correct mathematical expressions.
Step 2: Set up the HMM status value and observation value for expressions.
Step 3: Use Baum-Welch algorithm [19] to learn the parameters for training data and set up the model.
Step 4: When the training generation number meets the preset value, stop training; otherwise, go to Step 3.

2) Testing model
Input: Trained HMM model and test samples (the text which contains mathematical expressions to be extracted).
Step 1: Obtain the text that needs to be processed.
Figure 1.Overall framework of proposed method.

Construction of HMM Model and Its Parameter Learning
Some symbols of the mathematical expressions in text represented in Unicode codes are shown in Table 1.
Hidden Markov Model (HMM) [20] is the model of a double random process based on statistics.To construct the HMM model, we should first determine the structure of the model, that is, the number of contained states and the conversion between each state.According to the Table 1, we define a total of eight states of HMM as: the variables (numbers, letters, Greek letters), monocular operation (the operator has one operand only), binocular vision (the operator has two operations), punctuation, left and right delimitations, delimiter in a set, superscripts and subscripts, denoted as The observed values of the HMM model are the mathematical symbols shown in the mathematical expressions.This paper defines that the observation values corresponding to each state which are shown in Table 2.
In addition to the above observations, we define the trigger word dictionary as follows: Definition 1: Trigger_W, the trigger word, is an important part of a specific event that can describe its special attribute information.
In mathematical expression extraction, Trigger_W = {sin, cos, Tan, cot, sec, csc, arcsin, arccos, arctan, arccot, arcsec, arccsc}.Trigger_W is employed as the  After determining the model structure, we need to find the transition probability of state and the output probability of observation.In this experiment, the Baum-Welch algorithm was used to calculate the model parameters, and the specific steps are as follows: Step 1: Define two intermediate variables: ( )

G2
End Journal of Computer and Communications corresponding state i in the case of given model and observation sequence [13]: Step 2: Initialize [10]: ( ) , the expected value of the state i S at time Step 3: Iterative compute: Let , , 0, Step 4: The initial value is substituted into the above equations, and the new parameters are calculated.The iteration is repeated until the parameter converges, the method of judging convergence is: where the threshold ε is generally set to 1 6 e − .

Extraction of Mathematical Expressions
Extracting mathematical expressions from text fields of documents can be implemented with the following steps: Step 1: Design a crawler, and the depth of crawler is 20; select website home pages which are relevant to mathematics as the crawler entrance; start the crawler, and save the acquired pages to a local file.
Step 2: For the web pages crawled from the Internet, block the web pages at first [21].Remain the main part and delete the navigation blocks and link blocks.
As there are a lot of HTML labels in web pages, the blocked text will be cleaned through removing the useless noise data in order to reduce the noise in samples and improve the accuracy rate of mathematical expression extraction, which is shown in Table 3. Step 3: Segment the text which have been removed the useless noise data.After segmentation, use the word library of WordNet [22] (containing about 350 thousand words) to filter the text.As shown in Figure 4, the filtered text will be taken as the input of HMM model.
Step 4: Use the HMM model which has been constructed and the Viterbi algorithm to determine whether the expression in text is a mathematical expression or not, if so, the expression will be extracted.
For example, the proposed method analyzes the sentence "There's good reason why mathematics formulas are not patentable!It's really easy {4 ghs/} to come up with new formulas.Like a b c d + = + .I bet you won't find that in any textbook, because 56 + 4 = 60 it's not something really important."through the following steps: Step 1: Initialize: Build training samples and test samples.
Step 2: Train HMM model by training samples.
Step 3: Confirm the model structure, and evaluate parameters by Baum-Welch algorithm.
Step 4: Do some pre-processing works for the text.
Step 5: Take the fmList as the input for HMM model.
Step 6: Use the Viterbi algorithm to determine whether the expression is a mathematical expression.
Step 7: If so, the expressions will be extracted.

Experiment Results and Analysis
In order to prove the availability of proposed method based on HMM model in The experimental results will be evaluated by two evaluation indexes: accuracy rate and recall rate.The calculation methods of two evaluation indexes are expressed as follows: Re call rate N N = where extracted N is the number of the mathematical expressions extracted correctly; marked N is the number of the marked mathematical expressions in the text; all N is the number of all mathematical expressions in the text.
As shown in Table 4, the accuracy rate and recall rate of the proposed algorithm have some slight oscillations in the extraction process, but the whole situation tends to be stable.With the increase of text number, the accuracy rate and recall rate will also increase and tend to be stable finally, but they may decrease when the text number is too large.Because of the limited training samples and various mathematical expressions, it is very difficult to describe all cases by only a model, so the extraction model should be expanded further.
For verifying the performance of our method, we built a simulation system which employed the algorithm proposed in Reference [18].It is worth nothing that due to the differences of experimental environment, experimental data such as the mathematical expression format, and the limitations of our comprehensions on the original paper, the comparison results could not completely meet the results of the original paper and could only roughly reflect the situations of the method.
The comparison results about accuracy rate and recall rate are shown in From Figure 7, we can see that the average extraction time is 0.0011 ms when the number of mathematical expressions in text equals 5 and 0.0053 ms when the number of mathematical expressions in text equals 45 respectively.While the average extraction time of the simulation system is longer than the proposed method.The reason is that the compared method need to locate the mathematical expressions and scan back and forth firstly, and carry out many filtering operations before extraction.The proposed method can avoid these processes and improve the extraction efficiency obviously.These two curves have some overlapping data at the first phase.The reason is that when the number of mathematical expressions is relatively great, the advantage of proposed method is obvious, and the extraction time is less than simulation system.In contrast, the extraction time is about the same when the number of mathematical expressions is small.

Conclusion
To solve the problem that the mathematical expressions in unstructured text fields of documents is hard to be extracted automatically, rapidly and effectively, a novel method based on HMM was proposed.Through considering the characteristics of mathematical expressions, this method can realize the extraction of the mathematical expressions in text fields of documents with relative high accuracy rate and recall rate.However, there are some shortcomings of the mathematical expression extraction method based on HMM.Firstly, this method needs plenty of training samples and takes a lot of training time.Secondly, if the states of HMM is not comprehensive enough, the extracted result will be not good.We shall further improve the observed values and state values of HMM model and optimize the model parameters to make the model be more suitable for more mathematical expressions and achieve better extraction results.
documents, this paper proposes an automatic extraction algorithm of the mathematical expressions based on HMM.The overall framework is shown in Figure 1.The process of extracting mathematical expressions in text fields of documents based on HMM can be divided into two parts described as follows: 1) Training model Input: Training samples (the mathematical expressions).
= − " as an example, the corresponding state value and the observed value are shown in Figure2.When the HMM model is constructed, it is assumed that there are a meaningless start state (start) and a meaningless end state (end) as the preparatory state and final state of the model.Taking the mathematical expressions " an example, it's HMM model structure is shown in Figure3.
ξ is the probability of the state i and the state j corresponding to the model

Figure 2 .
Figure 2. State value and observed value of an expression.

Figure 3 .
Figure 3. Instance diagram of state value observation value correspondence.
Journal of Computer and CommunicationsFor verifying the time efficiency of proposed method, different numbers of mathematical expressions were selected to test the extraction time.The test data is shown in Figure7.

Figure 7 .
Figure 7.Comparison result on the extraction time.

Table 1 .
Special symbols of mathematical expressions.

Table 2 .
Observation values corresponding to each state.

Table 3 .
Noise types and processing methods.