A New Algorithm for the Acquisition of Knowledge from Scientific Literature in Specific Fields Based on Natural Language Comprehension

The acquisition of knowledge and the representation of that acquisition have always been viewed as the bottleneck in the construction of knowledge-based systems. The traditional methods of acquiring knowledge are based on knowledge engineering and communication with field experts. However, these methods cannot produce systematic knowledge effectively, automatically construct knowledge-based systems, or benefit knowledge reasoning. It has been noted that, in specific professional fields, experts often use fixed patterns to describe their expertise in the scientific articles that they publish. Abstracts and conclusions, for example, are key components of the scientific article, containing abundant field knowledge. This paper suggests a method of acquiring production rules from the abstracts and conclusions of scientific articles in specific fields based on natural language comprehension. First, the causal statements in article abstracts and conclusions are extracted using existing techniques, such as text mining. Next, antecedence and consequence fragments are extracted using causal template matching algorithms. As the final step, part-of-speech-tagging production rules are automatically generated according to a syntax parsing tree from the speech pair sequence. Experiments show that this system not only improves the efficiency of knowledge acquisition but also simultaneously generates systematic knowledge and guarantees the accuracy of acquired knowledge.


Introduction
Knowledge acquisition (KA) has long been perceived as the most difficult bottleneck in the construction of knowledge-based systems (KBS).Over the past decade, knowledge engineers have argued over the best means of constructing an effective and reliable KBS.Many researchers view knowledge acquisition as critical [1][2][3][4][5].For example, Edward Albert Feigenbaum once said, "There are many important problems to be solved in the use, representation, and acquisition of the knowledge.Of them all, the knowledge acquisition is the most important and critical bottleneck" [6].It is imperative to build automated knowledge acquisition systems.
In specific professional fields, conclusive knowledge and experience are customarily represented concisely in normalized scientific language when stored in text form.Consequently, massive amounts of specialized field knowledge can be obtained from scientific articles.This techniques that can be used to extract knowledge en masse from scientific articles.The issue for knowledge engineers is how to formalize the explicit knowledge in these articles and the difficulty is how to achieve automatic conversion.Existing data mining and analysis techniques, such as text retrieval, text association analysis, and so on, provide effective algorithms to implement the transformation [7].
2) The material processed in this paper focuses on a specific scientific field.In addition, the statements in abstracts and conclusions are often short, conclusive, and designed to focus on a single topic.Therefore, priori knowledge can be used as a guide during the text mining process to improve effectiveness of KA.In the experiments described in this paper, the range of both field knowledge and topic has been limited in order to improve accuracy.
3) The production rule is a mature method of knowledge representation with strong expression ability.It is easily combined with the Drools engine mechanism that is used as inference device in the KBS proposed in this paper.
The method described in this paper takes texts that contain abundant field knowledge, such as professional papers, as input.After labeling, casual statements are then transformed into production rules that can be executed by machine.Transformation between text knowledge and production rules can be achieved automatically through necessary and effective manual interventions, which improve the degree of typical KA automation and realize computer-aided KA.

Related Work
As the data show, knowledge acquisition techniques are optimized toward three main goals: 1) To improve the efficiency of knowledge acquisition, which means acquiring knowledge from field experts effectively or adopting semi-automatic knowledge acquisition methods.
2) To extend the scope of knowledge that can be accessed and improve the automation of knowledge acquisition.
3) To simplify the process of conversion from acquired knowledge to production rules that can be executed by machines.
As far as improving KA efficiency is concerned, recent studies have mainly focused on two issues.First, there has been the development of methods and assistant tools that shorten the communication cycle with field experts and guarantee the accuracy of acquired knowledge.These include the famous Repertory Grid method of delimiting and identifying field objects, the integrated model of KA, the MRM method, and the KADs modelbased method, a comprehensive methodology for KA from multiple knowledge sources [8][9][10][11][12].Second, there has been the development of techniques that improve KA automation and shorten the acquisition cycle [13][14][15][16].These include pattern recognition, machine learning, and text mining techniques, such as the automatic KA method based on inductive learning, the incremental approach to discovering knowledge from text, and knowledge discovery [17][18][19].All of these methods have their own inevitable disadvantages.KA methods that use pattern reorganization or machine learning focus primarily on implicit rules contained in mass data and are suitable only for processing data text [20].The rules obtained are still untested and have to be verified by field experts.KA methods based on text mining are often designed to process large amounts of text and the knowledge obtained is generally inadequate.Additionally, they are always designed to investigate objects and object hierarchies [21,22].There is great gap between the knowledge rules and production rules used in KBS.
Compared to traditional KA methods, the method suggested in this paper processes causal statements focused on one specific field.The algorithm proposed in this paper, natural language comprehension for rule extraction (NLCRE), is designed to obtain IF-THEN rules from scientific articles by labeling the causal statements in those, extracting antecedence and consequence using causal templates, and generating rules automatically using a syntax parsing tree.

Architecture of Knowledge Based System Based on NLCRE
Using the NLCRE algorithm, we developed a new knowledge-based system with a knowledge base that can expand incrementally.The overall architecture of the KBS is shown in Figure 1.The KBS architecture is comprised of four modules: 1) Causal-statement-finding module (white) The function of the causal-statement-finding module is to extract causal statements from the abstracts and conclusions of scientific articles.
2) Production-rule-generation module (gray) This module is responsible for generating rules from causal statements.Pre-definition templates are constructed based on the characters of causal statements.The antecedence and consequence portions of production rules can be obtained by using templates matching algorithms.Then production rules are then generated by the part-of-speech-tagging and syntax-parsing tree based on natural language comprehension.After being filtered and refined manually, these production rules are added to the Copyright © 2011 SciRes.IJIS

NLCRE Production Rule Generation Algorithm
KBS.
3) Field knowledge management module (purple) Production rules generated from the scientific articles or acquired from field experts are stored and managed in the field KBS.Knowledge management includes knowledge update, knowledge addition, knowledge deletion, and other processes.In order to eliminate redundancy and conflict, these operations must be verified before they are performed.
The presentation of field knowledge contained in the natural language of scientific papers is quite different from the requirements of knowledge representation in KBS.Translating natural language to formal language requires an essential change translates labeled text into production rules.This change benefits the process of forming clear knowledge hierarchies and structures in knowledge systems.Furthermore, it improves the extensibility of knowledge and provides an effective means of increasing the efficiency of reasoning.4) Inference device (green) Knowledge reasoning must be considered when designing a KBS.In this paper, the Drools java rules engine is the main tool employed toward knowledge reasoning.Drools is an open-source business rule engine and an enhanced java language implementation system based on the RETE algorithm [23].
The NLCRE algorithm can be divided into four stages, as shown in Figure 2.These stages will now be described in detail.In addition to traditional functions such as knowledge storage, management, and reasoning, the KBS presented in this paper can also make use of production rules directly extracted from great numbers of scientific papers and alter reasoning results accordingly.Because the production rules can be obtained automatically, the degree of KBS automation is improved and knowledge acquisition time is reduced relative to traditional methods of knowledge acquisition.In this way, computer-assisted knowledge acquisition processes can be realized.

Stage 1: Labeling of Causal Statements
The aim of this stage is to mine casual statements from great numbers of field-specific articles.This paper mainly processes the abstracts and conclusions of scientific papers because they are contain the most useful field knowledge.
Text extraction of casual statements takes advantage of NLU and data mining technologies and uses existing field knowledge for guidance.This stage comprises the following steps: 1) Extract abstracts and conclusions from the scientific articles in question, then label and extract their conclusive sentences.
2) Label field glossaries such as "NEPE propellant," "metal powder," "oxidants," "energy," and so on.Current effective labeling techniques can be adopted in this step.These include Tregex, which is used to search and operate tree data structure.
3) Label field operation terms, such as "increase," "decrease," and "add."These operation terms will take the glossaries defined in the previous step as operation objects.
4) Identify causal words and use predefined causal templates with the terms and glossaries to extract causal statements from the conclusive sentences.
Causal statements can then be extracted from scientific articles and these sentences can be translated into production rules.

Stage 2: Elimination of Noise
After stage 1, some noisy words that do not contain knowledge will inevitably be included in the causal statements.These noisy words affect the accuracy of the production rules and should be eliminated before converting the field causal statements to production rules.To this end, noise elimination templates must be constructed according to field and language characters.Because the labeled contents are often conclusive sentences that always have fixed patterns and locations, it is practical to locate most of the similar noisy words with high occurrences and separate them from the labeled causal statements using fuzzy matching algorithms.Table 1 shows key words often labeled as noisy content.

Stage 3: Extraction of Antecedence and Consequence Phrases
Once the noisy words have been eliminated, it is crucial to convert the remaining labeled content into rough production rules describable as "IF (antecedence) THEN (consequence)" according to causal templates.This ensures the accuracy of the whole conversion, so it is important to construct effective causal templates with high accuracy.In field-specific scientific papers, causal relationships are usually presented in fixed patterns.Conversion templates meant to translate causal statements to production rules can be constructed by summarizing common presentation patterns.Detailed information is provided in Appendix.
In the Chinese language, some words are used to indicate casual relations explicitly, such as "for," "because," "so," and so on.For sentences that have obvious casual relationships, it is quite easy to identify antecedents and consequents fill in "IF (antecedent) THEN (consequent)" rules.For example, the causal statement "B is the result of A" can be converted to "IF (A) THEN (B)."However, sometimes causal statements are more complicated.Intricate "and" and "or" relationships may exist among many conditions and results.Conditions and results may even intersect.Conversion templates must be designed to deal with this complicated reality.They must also be Table 1.Key noise content words.tested and refined according to experiment results.This improves the effectiveness of the conversion from causal statements to production rules.After this stage, causal statements described in natural language are formatted into "IF (antecedent) THEN (consequent)" structures.The antecedents and consequents are refined and formatted to predicate logic to generate production rules correctly.The semantics of the production rules must be consistent with the causal statements.Integrity must be preserved during conversion.

Stage 4: Generation of Production Rules
The antecedent and consequent of each production rule must be separated from the causal statement.However, they are still represented in natural language and cannot be used to form production rules.This kind of knowledge, then, must be converted into first-order predicate expressions that have clear semantics and are suitable for use in production rules.In this stage the parts of speech (POS) of the antecedent and consequent are tagged.Next, POS tagging results are analyzed according to a parsing tree (Figure 3) and a bottom-up merge is carried out.At the same time, the order of the words and phrases in each antecedent and consequent are changed to eventually form predicate verbs with structures such as "predicate (object, value)." The POS sequences of the antecedents and conesquents are produced through this POS-tagging.Then verb-object structures contained in the antecedents and consequents can be found out.Finally, production rules are generated by translating the verb-object structures into first-order predicate expressions.Generally, predicate (object, value) structure is the basic element of production rules.The process of transforming antecedents and consequents into production rules can be focused on generating predicate (object, value) structures.
In this paper, a parsing tree, which is summarized from scientific articles from one specific field, is used as guide to analyze POS sequences and transforming them into predicate structure elements.The conversion process is made up of the following steps: 1) Mark the POSes of the antecedents and consequents and represent these in speech pairs (word, speech tag).For example, an antecedent that literally says "increase content of metal fuel AL" and may be represented with a speech pair list including (increase, V), (metal fuel, DV), (AL, DC), (of, "of"), (content, DV).There are eight POS types in this system, as shown in Table 2.
2) Merge the appropriate speech pairs into noun phrases (NPs) according to the parsing tree (Figure 3) and mark the speech tag as NP.For example, according to

Implementation of Production Rule Generation via Algorithm
the "DC.DV->NP" branch of the parsing tree, "metal fuel" and "AL" should be combined and tagged "NP" to form a new speech pair: "(metal fuel, NP)." 3) Combine verbs with NPs and other words according to the parsing tree to form predicate elements with "predicate (object, value)" structure.For example, the speech pairs (increase, V), (metal fuel AL, NP), and (content, V) must be merged together to form a predicate element such as increase (AL, content).In this step, some fixed phrases must also be converted into formal formats according to the corresponding part of the parsing tree (Figure 3).
In the process of obtaining rules from scientific literature, the first three steps can be implemented by using existing mature template matching algorithms.At the end of the third stage, the antecedence and consequence of each rule has been extracted from the causal statements and structured as "IF (antecedence) THEN (consequence)."However, the antecedence and consequence are still represented in the form of natural language.This section will focus on describing the algorithm used to convert them into predicate expressions according to the parsing tree shown in Figure 3.

4)
Merge the predicate elements bottom-up according to the parsing tree and change the relative positions of the predicate elements.When the merging process comes to the highest level, rules will be generated.
1) This step can be implemented by querying field variable table, field constant able, etc. (Table 2).After t Table 2. POS in the KBS.

POS Type Corresponding Causal Statement Element
Field Variable: DV Variables, such as "metal power," etc.
Field Constant: DC Constant, such as AL, etc.
Degree Word: Adj (adjective) Degree words, such as "mass of," etc.
And/or Relationship Words: And/or Logical words, such as "at the same time," "and," etc.

Experiments
This section describes two instances of the algorithm used for deriving rules from scientific articles under simulated use.

Example 1
Stage 1: A causal statement is extracted from the paper "Effect of RDX Particle Size on Properties of CMDB Propellant" [24]."As experiments have proved, adding moderate amounts of AP and decreasing the granularity and granularity gradation of high-energy composite propellants can improve flammability." Stage 2: After eliminating the noise in the statement, the result turns to be "through adding moderate AP, decreasing its granularity and granularity gradation in high-energy composite propellant, the flammability will be improved." Stage 3: The antecedent and consequent are derived by matching casual relation templates (Through A, B: IF (A) THEN (B)) and represented as IF (adding moderate AP, decreasing its granularity and granularity gradation in high-energy composite propellant) THEN (the flammability will be improved).
Stage 4: In this stage, the predicate expressions are derived from the antecedent and consequent according to the parsing tree.This can be implemented following the instructions in Section 4.4.
1) By querying the field variable table (e.g., high-energy composite propellant), field constant table (e.g., AP), etc. (Table 2), the structured speechPairs of the antecedent and consequent are produced 2) The process of merging speechPairs based on the parsing tree is shown in 3) The production rule is finally derived from the causal statement as follows: IF (high-energy composite propellant &&add (AP, moderate) && decrease (granularity) &&make (granularity gradation)) THEN (improve (flammability)).Stage 4: Based on the above result, the production rule can be derived as follows:

Example 2
1) After querying field variable table (e.g., average size), field constant table (e.g., RDX), field verb table (e.g., decrease, increase) etc. (shown in Table 2), the structured speechPairs will be produced.The 2) According to the parsing tree, the merging process of the speechPairs takes place as shown in Figure 5.

Conclusions and Possibilities for Future Research
This paper is based on a well-noted fact: in specific professional fields, experts often use relatively fixed patterns to describe their findings when they publish scientific articles.Moreover, the abstracts and conclusions, the essence of these articles, contain abundant field knowledge.Labeling the summarizing those statements in abstracts and conclusions that have apparent causal relationships, transforming these relationships into formats that can be used in KBSes, and then using an automatic process to represent this knowledge as production rules may greatly improve the effectiveness and accuracy of knowledge acquisition.
In this paper a pattern-text-mining-based method of acquiring production rules from field-specific papers is proposed.This paper takes texts that contain field knowledge as input.First, the summarizing statements in the abstracts and conclusions of the articles are processed using mature technologies, such as text mining, and causal statements are extracted from them.Then production rules are derived by POS analysis and parsing tree processing.
Two experiments showed that the proposed method can be used effectively to obtain production rules from summarizing statements.In future studies, algorithms used for POS analysis can be investigated to enhance the algorithm's self-learning ability.At the same time, the process of converting causal statements to production rules can be refined to further improve the efficiency of field knowledge acquisition.

Figure 1 .
Figure 1.Architecture of Knowledge-Based System Based on NLCRE.

Stage 1 :
Another example of a statement with a causal relationship was extracted from the paper "Influence of Ammonium Perchlorate and Aluminum Powder on the Combustion Characteristics of AP-CMDB Propellant"[25]."If average size of RDX decreases from 92.02 μm to 17.35 μm, pressure index will increase 7.3%."Stage2: The above statement does not contain any noise, so it is not changed during this stage.Stage 3: By matching casual relation template (If A, B: IF (A) THEN (B)), the antecedent and consequent can be extracted from the statement and represented as follows: IF (average size of RDX decreases from 92.02 μm to 17.35 μm) THEN (pressure index will increase 7.3%).