DBpedia-Based Fuzzy Query Recommendation Algorithm and Its Applications in the Resource-Sharing Platform of Polar Samples

In order to continuously promote the polar sample resource services in China and effectively guide the users to access such information as needed


Introduction
As one of the most important modules of information platforms, each search engine has its query recommendation almost as a standard function for its search module.In particular, when the users are not clear about their search objects, the relevant search results given by the search engine can effectively guide the users to gradually get access to the information they need [1].During the past decade, the query recommendation technology has Finally a knowledge database as below can be obtained: Skua -> Stercorariidae -> Gull passerine -> Birds.Emperor penguin -> Spheniscidae -> Sphenisciformes -> Birds.
In the process of the category tree extraction, it is particularly important to set the height of the category tree; if it is too high, the traversal speed will be influenced; if too low, the matching effect will be reduced.

Construct a Proper Height for the Category Tree
A category tree's error rate will be quite high if it is constructed too small.On the other hand, if the tree is too big, the apparent error rate obtained by means of learning set test is very small, but its true error rate may still be relatively large.Therefore, we need to construct a tree of a proper/appropriate size to minimize the real error rate.
The purpose/aim of decision tree learning is to obtain a simple tree with a strong predictive capacity.When the tree is in its full growth, its predictive capacity will be reduced.In order to solve this problem, we need to obtain a tree of the proper/appropriate size.In generally, there are two methods available.
Method-1: Define the conditions that the tree will stop growing.1) Partition the number of instances to the minimum.When the size of the data subset corresponding to the current node is smaller than the number of specified minimum partition instances, no further partition is needed even though they do not belong to the same category.
2) Partition threshold value.When the difference between the value obtained by means of the applied partition method and the value of its parent node is smaller than the specified threshold value, no further partition is needed.
3) Maximize tree depth.When the length of further partition will exceed the maximal tree depth, stop partitioning.
Method-2: Carry out pruning after a complete decision tree is generated by evaluating subtrees.The entire decision trees will perform better if a subtree is removed, then the subset is pruned.Specifically, the implementing process in the Breiman CART [9] are as follows:

1) Tree construction
The decision tree is made up of the data sets partitioned by attribute values, and thereby needs to define the measurement partitioned by attribute, namely, according to this measurement, the optimal partitioning attributes for current data subset can be worked out.
When the fuzzy function of calculation cost for node has been selected, during the process of the tree growth, we are always trying to find an optimal bifurcation value to partition the samples in the node, so that the cost could be minimized.The fuzzy function ( ) P φ is used to represent the fuzzy degree of the tree node t or error partition index, namely: Here, , , , c D p p p =  is a decision set, c denotes the number of the decision-making categories in the decision set, 0 i p ≥ indicates the proportion of the i th decision-making category in the decision set D and 1 1 In the bifurcation tree of the CART algorithm architecture, the amount of changes of fuzzy degree due to bifurcation is as follows: where, t is the bifurcation node; ( ) E t is the fuzzy degree of the node t; ( ) l E t and ( ) r E t are the fuzzy degree of the left and right bifurcation node, respectively; l p and r p denote the percentage of the node t in the left and right bifurcation samples, respectively.For bifurcation of each internal node t, take the largest change of fuzzy degree in all possible bifurcation ways of t.For other nodes, repeat the same search process.

2) Pruning
The large scale trees are generated by the above algorithm and its apparent error rate is very small, but its true error rate may still be relatively large.We must construct a tree with a small true error rate by means of the pruning technique.We use a certain algorithm to prune the branches of this tree continuously.During the pruning process, we will obtain a list of decresing trees to form a sequence of pruned trees, and each tree in this sequence will have a smaller apparent error rate [9] compared with other subtrees of the same size, and then we can conclude that this sequence is an optimal one.The bifurcation tree can be pruned on the basis of the minimal cost complexity principle as below: In general, a tree can be expressed by T, the subtree with the root node of t is expressed by t T , then the pruned subtree 3 t T will shrink into a terminal node t3, the pruned tree can be expressed as 3 t T T − , and there is the 3 t T T T − ⊂ , which is the subset of T. Use T to express the terminal node set in the trees T, and the num- ber of the corresponding terminal nodes is T .The impurity index of the tree T is defined as follows: ( ) ( ).
( ) E t denotes the fuzzy index of the tree node t or the square error of the fitting node data set of the node t in the Equation (3), and the error index is the fuzzy function ( ) E t .The pruning principle of the decision tree, namely, the cost complexity measurement is displayed as: where ( ) a E T denotes a linear combination of the tree impurity index cost ( ) a E T and its complexity.Therein, a is the complexity parameter resulted from the complexity of a tree and T indicates the number of the terminal nodes for the tree T.
To find the next smallest tree of the tree T: For each internal node t of the tree T, we need to work out the value a of the penalty factor for the next tree t T T − wrongly partitioned, and label the value as t a , which is the ratio between the change amount of the error index before and after the current tree is pruned and the change of the terminal node number: The node we need to select is an internal node with minimal t a .The whole tree pruning process is to calcu- late t a , then seek the smallest t a , and then select t T T − as the next pruning object.
For each given value a, a smallest tree ( ) T a can always be found based on the corresponding to the measurement of its cost complexity: .
When the value a increases, ( ) T a always remains smallest until it reaches a jump point a′ , and then the tree ( ) T a′ becomes a new smallest tree.
After the smallest tree ( ) T a is determined, its height can be defined as + , where ( ) is the number of layers of the final leaf nodes while ( ) is the number of layers of the root nodes.
For such cases in this paper, we can work out the appropriate height of a tree as 5 h = according to the above algorithm.

Similarity Algorithm
The similar degree of traditional category tree is mainly calculated through the following two methods: character direct search method and vector included angle cosine calculation method.However, these two methods are oversimplified and thus the similarity is seriously affected.Therefore, based on the character search method, this paper has proposed a fuzzy query algorithm based on DBpedia.

Literal Character Matching Method (CCQ)
Literal character matching method is the easiest method, it judges the similarity by the proportionality of com-mon words on two words.For example, there is one common word between adelie penguins and emperor penguin, then the matching value is 0.5.

Fuzzy Query Algorithm (WIKIFQ)
If the search contents are classified according to such attributes of a word or a phrase as pronunciation, meaning and relevance, refer to the concrete fuzzy query algorithm [10].

1) Classification of the query contents
Firstly, the samples to be queried are classified appropriately.The query criteria are: the smaller distance between the example that belongs to a certain category and the center within the category is the better; the larger distance that is from the center among categories is the better.According to the attributes of each category, the average value of each category is calculated as the category center , 1, 2, , is the distance between the sample k x and the center i v of the ith category, and m > 1 is a fuzzy weighted exponent.Define the distances within a category and among categories, to make the distances satisfy that the distance is from the center within the category is smaller, and the distance is from the center among categories is larger.Define the distance within a category , .
Define the distance among categories , .
Synthesize the Equations ( 7) and ( 8) to define the objective function ( ) , where, is applicable to all i.
An objective finally needs to be classified into a certain kind of problems according to a certain membership degree, so the objective function satisfies a certain constraints as follows: .
According to the objective function, the following conditions should be met: 1) the defined ij u should be in- versely proportional to ij d , namely, ij u is a monotone decreasing function about ij d . 2)ij u is a monotone in- creasing function about the fuzzy weighted exponent m. 3) ij u is the membership degree, so 0 Moreover, it requires that each category must contain one sample at least, but the sample may not belong to the same category, so ∑ is true.4) Simultaneously, ij u satisfies the Equation (10).
According to 1)-4), ij u can be defined as follows: .
It can be proved that the Equation ( 11) satisfies the conditions 1)-4).
Under the constraint (10), the minimal value of the Equation ( 9) can be obtained by iterating repeatedly the Equation (11) to determine the final ij u .Based on ij u , assume the center of each category is i v , and it can be calculated as follows: ( ) .
2) Character matching query For a sample x to be queried, calculate the distance between x and the center i v , select k characters or words that is closest to x, which are represented by 1 2 , , , k x x x  respectively.Define an ordered pair ( ) ( ) f x is the category which the sample x belongs to: ( ) : , , , n w w w  of category, i w is the ith type content of the partition query content, and n is the number of categories.Then ( ) f x can be calculated as follows: ( ) ( ) ( )

Selection of the Fuzzy Weighted Exponent m
For the Equation ( 11), when 1 m → , each ij u in the Equation (11) satisfies 0 ij u → or 1, and when 1 m = , there is no weighted value ij u ; when m → +∞ , each ij u satisfies , and the partition is most fuzzy at this time.It is clear that the exponent m directly determines the fuzziness of the classification results.
To classify the search contents, we can use the fuzzy method and algorithm [11]- [13] of target recognition researched by BEZDEK, Lin Qing, Wei Meia and others.The fuzzy degree of classification is defined as follows: where, BEZDEK [11] ( ) 0.5 1, 0.5 0, 0.5 A fuzzy decision-making problem is formed by the intersection of a given fuzzy target f G and a fuzzy con- straint f C , i.e., f f D G C =  .In this paper, the fuzzy object of a decision-making problem that a keyword or a word is queried is defined as follows: where, * U is the set of the final ij u determined when the Equation ( 9) reaches a minimal value.In addition, while completing the content fuzzy classification, this algorithm also requires that the content should be partitioned as clearly as possible in order to correctly distinguish the category membership of each sample.Therefore, the selection of the parameter m is subject to another constraint, namely, the selected value can not make the classification results of the fuzzy classification algorithm overly fuzzy.The partition fuzzy degree is a good measurement to evaluate fuzzy classification to partition fuzziness.As result, the fuzzy constraint of the decision that the parameter m is preferred is as follows: When f G and f C are treated as fuzzy sets, they can be characterized by their membership functions re- spectively.In order to ensure that the membership functions of the fuzzy object f G and the fuzzy constraint f C have the same increasing or decreasing extent, the membership functions of f G and f C can be defined respectively as follows: The membership function of fuzzy decision can be expressed as , and the final decision-making result is the solution to satisfy ( ) Consequently, the optimal weighted index m * is the m value corresponding to the maximum membership degree of the intersection of fuzzy subsets corresponding to the fuzzy object and fuzzy constraint, respectively.The optimal weighted index m * can be obtained by the following formula: The m * obtained based on the Formula (19) will be able to ensure that the classified objective function and the classified partition fuzziness could be minimized by a larger membership degree, so that the fuzzy classification achieved by the fuzzy classification algorithm could not only express the similar information among samples, but also ensure the clarity of the sample classification.Therefore, a corresponding better fuzzy classification result will be obtained.

1) Construct a DBpedia Database
A data rather large in amount cannot be indexed or retrieved if placed directly in a document, so a MYSQL database should be built.Download the XML dump file of 2013 November from Chinese Wikipedia, extract the three files to get zhwiki-latest-categorylinks.sql,zhwiki-latest-pages.sql and zhwiki-latest-redirect.sql (with the total of 1.34 GB), and import such files into the DBpedia Chinese entry database already obtained with approximately 3,102,000 page records, 315,000 category records and 7,736,000 categorylinks records.
2) Get N entries from the database of BIRDS [14].N = Random (50 -100) 3) Extract the category tree from DBpedia and then form a weight matrix [15] [16].The code for a category tree of an individual entry is shown in Figure 1 and

Experimental Results
In order to verify the efficiency of this algorithm, we compare it with the traditional literal string matching algorithm and DBpedia semanteme-based algorithm (Table 1).The results indicate that the fuzzy matching algorithm is more accurate than other algorithms in terms of semantic analysis, and is even capable of detecting certain relationships between two seemingly different words.

Discussion
From the test result shown in Table 1, we can conclude that WIKIFQA method can detect similar data more efficiently and more accurately.After the application of the WIKIFQA on 2013 years, users have received more convenient service, and the value of PV, IP increased also (Figure 3).
But the algorithm is dependent on wiki database too much, and its accuracy is mostly affected by the quantity and quality of Wikipedia pages.To improve the accuracy, we have to update the database quarterly automatically.

Conclusion
This paper provides a DBpedia-based fuzzy query algorithm and gives the feature extraction method for fuzzy algorithm eigenvalue and implementation of semantic matching algorithm based on the analysis of the characteristics of the polar sample data and Wikipedia Chinese data.The experimental results show that this WIKIFQA  method can detect similar data more efficiently and more accurately to improve data accuracy, compared with the traditional literal string matching method and DBpedia semantic similarity algorithm.Also, the algorithm application on BIRDS proves its applicability and convenience.

Foundation
This work is supported by National Ocean Public Benefit Research Foundation(201305035); Key Laboratory of

Figure 2 . 4 )
Superimpose the fuzzy matching algorithm to get similarity In the experiment, take the fuzzy weighted exponent 1.75 m = .5) Test environment CPU: 2.5 GHZ × 2 core Memory: 4.0 GB OS: Windows 7 -64 bit

Figure 1 .
Figure 1.Get category by entry name.

Figure 3 .
Figure 3. Value of PV, IP on BIRDS.

Table 1 .
Comparison of WIKIFQA with CCQ and WIKIQA algorithms.