Text Rank is a popular tool for obtaining words or phrases that are important for many Natural Language Processing (NLP) tasks. This paper presents a practical approach for Text Rank domain specific using Field Association (FA) words. We present the keyphrase separation technique not for a single document, although for a particular domain. The former builds a specific domain field. The second collects a list of ideal FA terms and compounds FA terms from the specific domain that are considered to be contender keyword phrases. Therefore, we combine two-word node weights and field tree relationships into a new approach to generate keyphrases from a particular domain. Studies using the changed approach to extract key phrases demonstrate that the latest techniques including FA terms are stronger than the others that use normal words and its precise words reach 90%.
The knowledge available through a web is infinite most days. It frequently includes data of great quality in the form of online pages. But identifying relevant information automatically and choosing the highest set of data for a specific information need isn’t an easy task. Text Rank is a natural language rating algorithm based on the general concept of a graph [
The remainder of the paper is organized as follows: Section 2 illustrates FA words and their methodology for extraction. Text Rank for extracting of FA words is defined in Section 3. Section 4 shows corpus construct and experimental results.
All traditional methods of text classification and document similarity are based on word information in the whole documents. The key idea in our new study is to extract a new term called (FA) words that can recognize fields by using specific words without reading the whole document. For example, word “election” can indicate the document filed “Politics”.
Document fields can be decided efficiently if there are many FA words and if the frequency rate is high. Therefore, five levels of FA words can be described. Traditional method was building FA words by adding new FA word candidates to FA word dictionary manually, but there are many FA words not appended to the dictionary, and much time needed to revise that dictionary. A new method for selecting English field association terms of compound words and a method to append these FA words to that dictionary automatically in [
Definition 1: A standard FA word indicates a minimal unit (word) with intended meaning defining a given area.
Definition 2: A composite term FA is composed of multiple single words of FA. For example, in machine learning, the word compound FA is the information system.
Definition 3: A field tree is a graphical description of associations between domains of the documents. Leaf nodes suit terminals in field tree as baseline of information. The knowledge base will constitute a dictionary of FA Terms.
A field tree which contains 11 superfields, 70 median fields and 321 terminal fields (subfields) is used in our analysis. In
Every FA Word is connected within a dynamic field tree to a particular specialty the one shown in
Many FA Words may know a given field individually, although some FA Words might refer to several or even more domains. Therefore each FA concept has a broader focus for associating with a sector. So take that into consideration, FA Terms are graded into five distinct levels [
We mean either index term or terminology by traditional Words. An index phrase is a term which takes the meaning of a document’s subject matter and is usually used in document retrieval [
Via contextual expression compares the efficiency, for the purpose of recording and promoting correct usage [
FA word | Field association path | Levels |
---|---|---|
Microelectronics | < technology\electronics > | 1 |
Biocontrol | < technology\biological science > | 2 |
< technology\agriculture > | ||
Information | < technology > | 3 |
Seed coating | < technology\agriculture > | 4 |
< industry\agrofood industry > | ||
The, he she, it, in, on,... | 5 |
FA words provide a more structured analysis of words which can distinguish between different areas. In a category strategy called field tree, FA mainly related to the fields and the object of FA words are to develop a comprehensive depth of knowledge. At the level of individual terms, both indexes word and terminology are similar to FA Terms, but in several respects they are also distinct. FA words are equivalent to index phrases in that they both consist of selecting a string of letters that are statistically and contextually significant, together with the names of persons and places, etc. used in a particular article. But index words are less clearly defined than FA words, and the choosing of index phrases can very much depend on the user and the reason for which they are chosen. Contrary to FA words, sometimes indebted words may also compose of word stems.
Nominal terms and concept study may rule out the naming of persons and places that qualify as FA words. Likewise, certain topic terminologies that do not appear in the records will not count as FA terms. Also, for each domain, traditional words are typically handled in isolation [
Therefore, the words to rank are sequences of one or more lexical term extracted from text, and these describe the vertices assigned to the text graph [
If FA words are often used in papers to index each document collection, database D i may be represented as a vector of terms in which the document-term weight is represented [
Two key words-weight elements of FA exist:
1) Rate of presence of the FA term F A w k or F A t f i k or text in question D i .
2) Inversion level of FA word, text F A T k or F A i d f k . And its computed as
F A w i k = F A t f i k ∑ k ( F A t f i k ) 2 = F A t f i k ⋅ F A i d f k ∑ k ( F A t f i k ⋅ F A i d f ) 2
The formula of Text Rank is proposed in [
W S ( V i ) = ( 1 − d ) + d ∗ ∑ V j ∈ I n ( V i ) F A w j k ∑ v k ∈ O u t ( V j ) F A w j k W S ( V j ) (1)
The d value is normally set at 0.85. F A w j k is the weight of the edge from the prior node V j to the current node V k . I n ( V j ) is the set of nodes that point to
it (predecessors). O u t ( V k ) is the set of nodes that node V j points to (successors). ∑ v k ∈ O u t ( V j ) F A w j k is the summation of all edge weights in the previous node V k .
Algorithm 1 describes our method for ranking text for specific fields. the inputs for this algorithm is the selected set of FA words and threshold µ. in this algorithm we calculate the concentration ratio as follows, For the parent < S >, the child field < C >, the concentration ratio(Concentration (w, < C >)) of the FA word w in the field < C > is defined as in line 4. For the root = < S >, the child field = < S/C > of the field tree, the Formula in line 5 is used to judge whether or not the word w is a Perfect-FA word. After that, calculate the weight of these words in step 6. And repeat these steps for semi-perfect FA words at steps 7, 8 and 9.
Algorithm 1:
Input: 1- w, is a set of candidates FA
2- Norm(w,< S >) for w and < S >
3- µ a threshold to judge FA words ranks
Output: weighted PFA,SPFA words
Method
1) Set PFA={}, SPFA={}
2) set root=< S >, set child< s/c >.
3) for the root< S > and any child< S/C >
4) calculate conc(w,< S >) = ((Normalization(w,< C >))/(Normalization(w,< S >))),
Normalization(w,< T >) = ((Frequancy(w,< T >))/(Total-Frequancy(< T >)))
5) if (conc(w,< S >) Ù conc(w,< S/c >)) ≥ µ
Then set w in class PFA
6) P F A w i k = P F A t f i k ∑ k ( P F A t f i k ) 2 = P F A t f i k ⋅ P F A i d f k ∑ k ( P F A t f i k ⋅ P F A i d f ) 2
7) W S ( V i ) = ( 1 − d ) + d ∗ ∑ V j ∈ I n ( V i ) P F A w j k ∑ v k ∈ O u t ( V j ) P F A w j k W S ( V j )
Else
8) if (conc(w,< S >) ≥ µ Ù conc(w,< S/c >)) < µ
Then set w in class SPFA
9) S P F A w i k = S P F A t f i k ∑ k ( S P F A t f i k ) 2 = S P F A t f i k ⋅ S P F A i d f k ∑ k ( S P F A t f i k ⋅ S P F A i d f ) 2
10) W S ( V i ) = ( 1 − d ) + d ∗ ∑ V j ∈ I n ( V i ) S P F A w j k ∑ v k ∈ O u t ( V j ) S P F A w j k W S ( V j )
End
Apply Algorithm 1 for some documents first extracts PFA and SPFS as in
Field association terms serve as a highly simplified description for a domain, which can be used as domain labels and used in text classification, retrial of information, and so on. The approach presented is tested by applying it to 38 MB of text databases (newspaper, technical report, etc.) comprising 11 fields, 70 median fields and 321 terminal fields.
TELECOMMS | MICROELECTRONICS | INFORMATION PROCESSING | Graphics | Middleware | MULTIMEDIA | NETWORKING | MODELLING | |
---|---|---|---|---|---|---|---|---|
d1 | 0.24 | 0.25 | 0 | 0.59 | 1.17 | 0.2 | 0 | 0 |
d2 | 0 | 0 | 0 | 0 | 0 | 0.44 | 0.54 | 0.6 |
d3 | 0.25 | 0.24 | 0.21 | 0.44 | 0 | 0 | 0 | 0 |
D4 | 0.22 | 0.24 | 0.22 | 0 | 0 | 0 | 0.57 | 0 |
First ever step is data preparation of field documents, involving the dataset of the domain file, word division, eliminating stop words and choosing candidate terms. The second step is to extract FA terms perfect and semi-perfect. Third, calculate the weight of the words FA. Next, use the ranking formulas on the prepared dataset to measure node weight or edge weight and FA word ratings. Finally, take the PFA top PFA and SPFA words as domain keyphrases.
Our experiment considers a comparison for Text Rank using normal keywords and FA words.
Precision, Recall and F-measure are used to evaluate relevancies of the given methods and defined as follows:
Recall ( R ) = Correct Classified Documnts Totall Corrected Classified
Precison ( P ) = Correct Classified Documnts Totall Retrieved Classified
F-measure = 2 × P × R P + R
Precision, Recall and F-measure for 11 super-fields are measured using normal keywords and FA words and as shown in
From the evaluation results it turns out that the best performance is recorded in Text Rank with FA-words as obvious in
Name of field | Precision | Recall | F-measure | |||
---|---|---|---|---|---|---|
NW | FAW | NW | FAW | NW | FAW | |
Electronics | 0.59 | 0.78 | 0.79 | 0.98 | 0.69 | 0.88 |
Transport technology | 0.22 | 0.84 | 0.62 | 0.90 | 0.74 | 0.86 |
Industrial technologies | 0.48 | 0.89 | 0.74 | 0.95 | 0.64 | 0.92 |
Energy | 0.71 | 0.89 | 0.79 | 0.98 | 0.69 | 0.89 |
Physical and exact science | 0.48 | 0.90 | 0.63 | 0.99 | 0.71 | 0.94 |
Biological science | 0.55 | 0.80 | 0.64 | 0.93 | 0.63 | 0.86 |
Agriculture | 0.77 | 0.85 | 0.81 | 0.9 | 0.65 | 0.95 |
Measurements and standards | 0.64 | 0.88 | 0.75 | 0.95 | 0.69 | 0.91 |
Protecting and environment | 0.59 | 0.81 | 0.64 | 0.95 | 0.73 | 0.87 |
Agrofood industry | 0.61 | 0.84 | 0.72 | 0.98 | 0.82 | 0.90 |
Social and economic concerns | 0.43 | 0.86 | 0.69 | 0.97 | 0.70 | 0.89 |
Extraction of domain keyphrase is important for many tasks in the processing of language by nature. This method extracted keyphrases from all domains. We analyzed the development of Text Rank using field association words and a node weight. Experiment indicates that the Text Rank accuracy will grow markedly with Algorithm 1. Experiments show that the PFA and SPFA weights are of the highest precision when we extract Top words from a domain corpus.
Experiments also show that FA Text Rank’s extraction precision will hit 90 percent when our corpus extracts keyphrases.
The authors declare no conflicts of interest regarding the publication of this paper.
El Barbary, O.G. and Atlam, El-Sayed (2020) Text Rank for Domain Specific Using Field Association Words. Journal of Computer and Communications, 8, 69-79. https://doi.org/10.4236/jcc.2020.811005