An Improved Indexing and Matching Method for Mathematical Expressions Based on Inter-Relevant Successive Tree ()
1. Introduction
With the rapid increase of the amount of science and technology documents which contain many mathematical formulae with various expressing formats such as LaTeX and MathML in computers and network, finding and obtaining the required information according to the formulae in these documents becomes an urgent task in the fields of information retrieval and searching engine. However, the complex two-dimensional structure of mathematical expressions is often not properly handled in ordinary full-text search engine, which makes it necessary to research and develop special indexing and matching method for mathematical expressions.
At present, there are two kinds of strategies for realizing mathematical retrieval. The first category is to expand the existing text search engine system through transforming two-dimensional formulae into linear strings, such as LeActiveMath [1] [2], MathDex [3], EgoMath [4], DLMFSearch [5] and MIaS [6].
LeActiveMath [1] [2] is a system that enables people to use the searching tool to realize exploratory learning of mathematical knowledge. The system was based on the Lucene engine, and used OMDoc coding. The mathematical expressions are converted into symbolic stream, and the mathematical content is retrieved by matching the phrase in the symbolic flow. Because the query formula can appear in any layer of the matched formulae, the phrase match goes deep into each layer. The system also put forward a ranking mechanism to sort the search results, which makes the search results even better satisfy the needs of learners. EgoMath [4] for Wikipedia researched mathematical retrieval method. Many mathematical symbols in Wikipedia are expressed in TeX format, so EgoMath first considered the processing of mathematical formulae’ text to get a number of phrases. Then, by means of similarity description of the phrases, the standardization of formulae could be realized. The similarity description is similar to multiplication distributive law. Finding equal or similar descriptions in index can achieve similar retrieval. The method used in this system increases the semantic information and reduce the ambiguity of the formulae.
In a word, the method of extending the function of the existing text search engine for math retrieval need to convert formulae into character strings, which cannot provide a complete searching function for formulae. Another strategy of realizing math retrieval that designs the special index and corresponding matching algorithm is still not mature. This paper proposes an improved mathematical expressions index method based on the inter-relevant successive tree [16] [17] [18] and corresponding searching algorithm which integrated the above two strategies through employing full text index method to deal with the two-dimensional structure of mathematical expressions. The characteristics of the method is utilizing the extracted formulae features to form the relationships as predecessor and successor, for constructing math index in the mode of inter-relevant successive tree.
2. The Improved Math Index
The inter-relevant successive tree (ISTR) [16] [17] [18] is a full-text index model which can effectively connect the predecessor and successor nodes for expressing the relationships of characters in text. It has the characteristics of high speed of index creating and matching with high space efficiency. There are two layers in the tree called the root node layer and the leaf node layer. The leaf nodes which express the current character in a tree are linked to the root node of next tree corresponding to next character position, which forms the full text index.
The improved ISTR increases a clustering layer on the original two layers structure to form a mathematical index structure with three layers. These three layers are clustering layer, branch layer and successor layer.
The first layer is the clustering layer which clusters the root nodes of the original ISTR into several classes to avoid the problem that the increasing number of trees in the index structure may cause the large expansion of the index scale. The nodes in the layer are called the cluster nodes.
The second layer is the branch layer which contains the root node in the original ISTR. The nodes in the layer are called the branch nodes.
The third layer is the successor layer which consists of the leaf nodes in the original ISTR which is called the successor nodes now.
The improved index tree structure is shown in Figure 1. The shadow part of the figure indicates the improved new structure, similarly in the following figure.
As can be seen in the Figure 1, each branch of the tree has a number, and the nodes in the tree are orderly arranged. The numbers of the branches in the first layer indicate the positions of the branch nodes in the cluster nodes, which are called the branch node number. The numbers in the second layer branches indicate the positions of the successor nodes in the branch nodes, which are called the successor node number. The successor node not only records its branch node number, but also records the number of the nodes in the cluster node, which lays the foundation for realizing math retrieval.
3. Clustering of Mathematical Expressions Feature
3.1. Feature Extraction of Mathematical Expressions
Suppose that the mathematical expressions are described in LaTeX format. Take the FDS (Formula Description Structure) defined in [20] [21] as the basis of the retrieval feature of mathematical expressions.
Definition 1. EF is a 5-tuple (Key, Oper, Level, Flag, Prestr) which is used as the retrieval feature of formulae. The components of it are defined as follows.
Key is the identifier of the item of mathematical expression. The last one of key is “#” which indicates the end of the expression.
Oper stores the item corresponding to the key. The Oper value is “Null” when the key has no operator.
Level is the layer number of the item in the formulae. The value of layer on the reference layer is 0. The farther the layer from the baseline, the large the Level value has.
Flag stores the relationship between current item and its control item which has the higher level in the formulae. Its value is 1 - 8 means the relationships between two items to be above, superscript, right, subscript, below, contains, left-superscript and left-subscript respectively. The value 0 means the symbol is located in the reference layer.
Prestr indicates the keyword of the control item of current item. When the Flag value is 0, Prestr is “Null” .
Algorithm 1 shows the procedure for obtaining the EF of an expression.
Algorithm 1. Extraction of the EF of formula.
1. Sort the items of formula according to their Level value in ascending order.
2. Sort the items in the same level on the basis of Flag in ascending order.
3. Merge the symbols which have the same values of Level, Flag, Prestr, and extract the operators in these items.
For example, the LaTeX expression of “
” is “\[\frac{{a+b}}{a} = \frac{{{a^2}+ab}}{{{a^2}}}\]”. The EF of it is shown in Table 1. Among them, “\one\frac” and “\two\frac” are the serial number of the same character “\frac” in the expression.
3.2. Feature Clustering of Mathematical Expressions
Rule 1: For key1 and key2, if and only if the oper1 = oper2, key1 and key2 are aggregated in the same class. In the improved ISTR, each tree represents a class, and key1 and key2 are inserted into the same tree when building index.
Take the feature in Table 1 as an example. The value of Key (except “#”) is represented by k1, k2... k7. The value of Oper “(\frac, = \frac)” is described as O1, “(+)” is described as O2 and “(Null)” is described as O3.The clustering results are shown in Figure 2.
4. Construction of Mathematical Index
4.1. Expressing Rules of Formulae in ISTR
The nodes’ positions and tags’ order of operators and operands of formulae in the improved ISTR are as follows.
Rule 1: The operators and operands located at the same level with the same flag and identifier are stored in the same node.
Rule 2: If the operators and operands are at different levels, mark the lower level firstly, and then mark the higher level. The operators and operands are stored in different nodes.
Rule 3: The controlled items (operands) of a controlling item (operator) are at the same level with different flags are stored in different nodes. The operands’ tag order is above, superscript, right, subscript, below, contains, left-superscript and left-subscript successively. When there is no symbol at a flag, it is not marked.
Rule 4: If the operators or operands are located at the same level with the same flag but different identifier, mark them according to the order of their LaTeX description, and their located nodes are also changed.
The above rules can be used not only independently but also synthetically. We make a detailed description of these rules with the example of mathematical expressions.
Left-right structure: the expression is “
”. Its LaTeX description is “\[a − b + c\]”. The position of nodes and the tag sequence are shown in Figure 3.
Up-down structure: the expression is “
”. Its LaTeX description is “\[\frac{x}{y}\]”. The position of nodes and the tag sequence are shown in Figure 4.
Containment structure: the expression is “
” and its LaTeX description is “\[\sum\limits{i=1}^nx \]”. The position of nodes and the tag sequence are shown in Figure 5.
![]()
Figure 3. The index structure of the formula with left-right structure.
![]()
Figure 4. The index structure of the formula with up-down structure.
![]()
Figure 5. The index structure of the formula with containment structure.
Corner structure: the expression is “
” and its LaTeX description is “\[a_i^2\]”. The position of nodes and the tag sequence are shown in Figure 6.
When a variety of these operators appear on the same level of an expression, whether or not the same, the tag sequence will be strictly in accordance with the above rules. Only when the symbols of a flag are completed stored can the next item with different flag be processed.
4.2. Index Information in ISTR
For realizing mathematical expression retrieval with a high performance, we modified the ISTR based math index in [19] through adding a clustering layer to improve the searching efficiency. The index structure is shown in Figure 7.
The distribution of each element of formulae features EF in the improved ISTR is as follows: the first layer of clustering is based on Oper feature. The formulae with the same operator are clustered into the same class. The class of this layer is connected to the branch node in the next layer which has the same value of key in the Oper feature. In the second layer, not only the value of Key feature but also the values of flag and prestr of the key are stored. The third layer only stores the value of Key feature.
4.3. Index Construction Algorithm
The successor nodes store the number information of double key in the next tree. The algorithm for constructing the mathematical expression index is as follows.
Algorithm 2: Improved math index construction
Input: A LaTeX formula
Output: The improved ISTR index
Step 2: According to the Oper in EF, find the cluster node in current index. If Oper exists in current index, go to step 4. Otherwise, go to step 3.
Step 3: Create new cluster node, that is creating a new tree.
![]()
Figure 6. The index structure of the formula with corner structure.
Step 4: Analyze the tree which employ Oper as root node. If it contains the predecessor information of double key in its branch nodes, go to step 6; Otherwise, go to step 5.
Step 5: Create a new branch, and ascertain branch node number.
Step 6: Traverse successor nodes of predecessor as branch node to search the successor of double key. If the successor exists in successor nodes, ascertain successor node number, and go to step 8. Otherwise go to step 7.
Step 7: Insert a new leaf node, and ascertain successor node number.
Step 8: According to the branch node number and successor node number, insert the successor node information.
Step 9: End.
4.4. Case Analysis
The features of the “
” are listed in Table 1, and its ISTR structures are as follows Figure 8.
When the information of current formula already exists in the index structure, a new node is no longer generated.
ISTR structures of “
”, “
”, “
” are shown in Figure 9.
In index structure, the nodes in the tree also store the other feature information of the formulae. The index structure of several formulae is shown in Figure 10.
5. The Algorithm of Mathematical Expression Retrieval
In the improved ISTR which added the clustering layer, the number of search trees is reduced, and the retrieval time is shortened, so the retrieval efficiency is improved.
![]()
Figure 10. Index structure of several formulae.
Algorithm 3: Accurate retrieval algorithm
1: Input: Query math expression
2: Output: Math matching result
3: Define List Q_EF(term_key, term_oper, term_flag, term_prestr) is query math expression feature
Tree_node(cluster_node, branch_node, successor_node, T_flag, T_prestr, document information) is the node of a tree in the database
4: for each termitermj in Q_EF
5: if
then
6: if
then
7: if
then
8: Select document information that conform to ![]()
9: end if
10: end if
11: end if
12: end for
In addition to the accurate retrieval, there are three kinds of matching models, which are compatible matching, sub-expression matching and fuzzy matching. Compatible retrieval is based on the algorithm 3 to cancel the restrictions of the first keyword’ Flag and Prestr values, and the corresponding document information is the result of the compatibility matching. Sub-expression matching is to extract the sub-structure features of the query expression based on extracting the features of the query expression, and to accurate matching for sub-structure features that do not contain serial number. Fuzzy matching is a consistent search for the sub-structure of the query expression.
6. Experiments Results
6.1. Clustering Result
![]()
Figure 11. Distribution of the number of formulae.
6.2. Space Efficiency
The sizes of index files are shown in Table 2. With the growth of the number of indexed formulae, the size of the index files grows. Because of the clustering function of the cluster layer, the same nodes are stored only once. So the growth rate of the index file and the file size are acceptable. Table 3 lists the sizes of the index files in [10].
6.3. Time Efficiency
320 formulae were randomly selected to retrieve the efficiency of the experiment, and
![]()
Table 2. Sizes of index files with an increasing number of indexed formulae
![]()
Table 3. Sizes of index files in [10]
the results are shown in Table 4.The retrieval efficiency data of [10] are shown in Table 5. With the growth of the number of indexed formulae, retrieval time of the method increases. Because the number of the formulae grows, the number of nodes in the tree continues to grow, and the judgment time of the nodes is lengthened, which influences the retrieval time. The experimental data show that the time efficiency of the proposed method is within the acceptable range.
6.4. Contrast Experiment with the Original Index Structure
The contrast experiment between the improved math index and the original one in two aspects of space and time are shown in Figure 13 and Figure 14 respectively. From the sight of space, the improved index model added clustering layer in the original tree structure, so the store information is much more than the original index model, and the index files are also larger than the original index files. From the view of time, because of the added clustering layer, speed up the search speed of the tree. The retrieval efficiency of the algorithm is higher than that of the original index model. From Figure 13 and Figure 14, the original index model improvement has obtained the certain effect in this paper.
7. Conclusions
![]()
Table 4. Retrieval efficiency in this paper
![]()
Table 5. Retrieval efficiency in [10]
of the data diversity is insufficient. Moreover, the characteristic of special formulae needs to be further strengthened.
The future work is to add to the sorting module of the results. It can improve the rationality of the order of search results, and then improve the practical application of the system.
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Grant No. 61375075), the Natural Science Foundation of Hebei Province (Grant No. F2012201020; F2013201134) and the Project of Human Resources and Social Security of Hebei Province (Grant No.JRS-2016-1090).