A New Approach of Time Series Variation Based on Power Links and Field Association Words

This paper has proposed a new methodology extracting stability classes of field association words depending on automatically power link analysis to enhance the precision of decision tree. In this paper, we have studied the effects of the time variation based on the frequencies of specific words called field association words that connected to documents using power link in a specific period. The stability classes have referred to the popularity of field association words based on the change of time in a given period. The new approach has evaluated by conducting experiments simulating results of 1575 files (about 5.16 MB). Based on these experiments, it has turned out that, the F-measure for ascending, stable and descending classes have achieved 93.6%, 99.8% and 75.7%, respectively. These results mean that F-measure was increasing by 12%, 4% and 34% than traditional methods because of the power link analysis.


Introduction
Recently, there are a huge amount of texts which are processed automatically in computers. Documents can be managed by computers to retrieve important information that can be used for searching, clustering and classifying, etc. Many words occur frequently in documents and generally strongly related to the field of that document. Extraction of important words from documents is highly use-ful and conclusive for Information Retrieval (IR) branch. IR refers to the process of finding relevant information in topics that concern the user and then retrieving that information. Generally, the frequencies of words in texts change by the time and often linked with particular period of time. For example, words as "flu" is more spread in winter and "stormy" is more spread when there are strong winds and usually rains. Moreover, word groups that linked with words of search on a search engine as Yahoo and Google are changing with the time Ohkubo et al. [1]. Fields like sport and political figures always change with the time variant.
Traditional approaches [2]- [8] for searching, classifying, clustering and analyzing texts are incapable to determine specific words in a particular period of time.
Atlam et al. [9] [10] approach points out that the popularity of words change according to the time using the ordinary keywords that represent documents. However, ordinary keywords are not representing documents correctly, because Atlam approach ignored the importance of connection between words and their fields.
By finding specific words in the document the reader can easily decide the field of the document without a need to read the whole text. These specific words or units are called Field Association (FA) words. FA words are the smallest number of words by which the reader of the text can decide the field of the text. Other method by Atlam et al. [11] represents FA words across the time based on the traditional FA algorithm of Atlam et al. [12]. However, this algorithm had some drawbacks, because some of irrelevant FA words that are not restricted to special field are selected as FA words candidates, while that are restricted to that particular field are not selected to be FA words [13] [14].
Power Links Analysis (PLA) was developed by Rokaya and Atlam [15] to analyze the co-occurrences of terms in the publications of a given topic which solve the previous drawback of Atlam approach. This approach is based on the advanced form of frequency as well as the distance between different instances of the given words. This additional information activates the concept of distribution of terms among different parts of a given document. PLA is based on the assumption that document words can give an enough representation of the document content and depends on the quantified frequencies and the distances between different instances of the two terms. Moreover, a time-series of such maps can trace the dynamic changes in this conceptual space. Further, PLA can be used to extract more relevant FA words from text documents [16].
The novelty of this paper is using Power Links Analysis (PLA), however the traditional methods have discussed the effects of the time change on the frequencies of FA words only. This means that there are some changes in the levels of some words after using their PLAs, which will affect on the DT results and improve the performance of the precision and recall as well as F-measure as shown in the paper results. Our paper suggests a methodology for automatic evaluation of the Stability (St) classes using the decision tree C4.5 algorithm of Quinlan [17] on the FA terms based on PLA. This methodology is assumed to indicate the popularity of FA terms relying on the change of time and to improve the DT precision. This paper is organized as follows. Section 2 introduces related work and FA words with their levels.
Section 3 presents the concept of PLA. Section 4 introduces our new methodology for evaluating St classes that indicate the popularity of FA words using PLA in a certain time period. Section 5 presents the experimental observations. Finally, conclusion and possible future work are introduced in section 6.

Related Work
Usually, IR field collection of documents changes based on time passes. The collection at time t contains, for an instance, t M documents, t K tokens, and term q which has t q frequency. Whilst at time 1 t + , these computations will be changed. Therefore, the documents are added to or deleted from a collection according to the time and also the collection frequencies of words change. Atlam et al. [11] proposed a method to introduce the popularity of words with time based on their frequency in the past years texts data. This approach defines number of attributes and three classes of stability as the index of spread of words to obtain the frequency change of words quantitatively. Furthermore, decision tree is used to estimate these classes. However, this approach used the common keywords to represent the documents which are not the best representative. This method neglected entirely the importance of the relationship between words and their fields.
Atlam et al. [9] introduced the effects of changing the time on the frequency of specified words called FA words using the decision tree. They presented number of features to study the changing of FA words frequency according to the time and three stabilization classes that refer to the popularity of FA term across time. However, this method was depended on the traditional FA algorithm of Atlam et al. [10] which caused to produce some irrelevant FA terms that are not restricted to the specific field.
Co-word analysis the dynamics of science as a result of actor strategies. This technique should allow the reader in principle to identify the actors and explain the global dynamic [18]- [26]. Rokaya and Atlam [15] proposed a method of building dynamic FA words dictionary using PLA. Furthermore, this algorithm presented new rules to enhance the quality of FA terms dictionary in English. Moreover, the PLA algorithm used a technique to extract and refine the confusion sets to provide context-sensitive spell checking based on FA words [27]. This technique joined between the advantages of statistical and machine learning method of Rokaya et al. [27] which used to build a real word spell checker in English and Arabic. Also, the PLA approach was presented to classify the important and advertising messages and spam the reduction one [28] [29].

1) FA WORDS
In this paper, a document field is a fundamental and popular knowledge which can be utilized in human communication, for example, <MIDICINE/Diseases/Pollen Allergy> explains the path on tree with super-field <MIDICINE> with subfield <Diseases> and terminal field <Pollen Allergy> [5]. The tree structure was organized to illustrate the associations among document fields through the field tree Dozawa [30].
2) FA WORDS LEVELS Some FA words can be decided only by a specific field, whilst others may be decided by two or more fields. Relying on FA word success in referring to specific fields, there are five different levels as follows: a) Ideal FA words: words associated with one sub-field (e.g., influenza, chemotherapy and insulin). b) Semi-ideal FA words: words associated with some of sub-fields in one super-field (e.g., sneeze and cough). c) Medial FA words: words associated with single super-field (e.g., blood and hospital). d) Various FA words: words associated with some of sub-fields of different super-fields (e.g., program and win). e) Non-FA words: not related to any field or decide it (e.g., rule and size).
The traditional algorithm of Atlam et al. [10] was used to judge these levels and to determine automatically the FA words based on term frequency and concentration ratio. The resulted FA terms are used as input to the algorithm of computing the ranks of FA words depending on PLA [31]. The traditional algorithm [10] takes as input the list of words selected from a corpus which comprises of groups of documents in different fields to judge the level of FA words.

Power Link Analysis (PLA)
The concept of PLA reflects the value of the word in terms of its relation to the words in the document. Moreover, each document will be presented by the average of the PLA between the current term and the terms in the same document.

PLA Steps
In the following sub-sections, we will introduce three main concepts for power link analysis as follows: 1) TERM to TERM PLA Supposed we have two terms 1 t and 2 t belongs to a document D, sometimes there is a link between 1 t and 2 t such link will be measured by the function ( ) , LT t t is symmetric, which means 2) TERM TO DOCUMENT PLA By using the PLA between two terms as mentioned above, the PLA of a term to a document related to a given field can be represented by Figure 1.
The PLAs for n of FA words t and document D related to a given field S can be represented by ( ) , , LTD t D S as follows: where n is the number of FA words related to a field S and exist in document D.
3) TERM TO FIELD PLA Term t to field S power link is represented by LTS as follows: represents the number of FA terms that identify the field S and occur in a document whenever the term t occurs, and nd represents the number of documents that contain FA words that identify the field S and the term t.
Our new approach will use the algorithm of evaluating the levels of FA terms based on PLs [25] and then, check the effectiveness of the time change using new methodology. The selection of FA words is depended on using the PLAs.

System Outcome
3. where qj f is the frequency of word q w in field j S , to get FA words collection based power link with the above five levels in each i p , we will select the FA words levels for the topic of medical field. The output will be collection consists of FA terms lists based power link for the medical field in each given time, 1   14. calculate

FA Term Based on PLA Features
In the following sub-sections, five features of FA term based on PLAs are intro- More details about these attributes were discussed in [5]. Moreover, in this paper, we use the FD to describe terms of the medical field, as an example, as follows: a) "Disease-name: e.g., Cancer", b) "Treatment-name: e.g., Avastin", c) "Doctor-name: e.g., Magdi Yacoub", d) "Organization name: WHO", e) "Medical-name: e.g., Patient". These features are useful to study the influence of the time change on the frequencies and determine the St classes better and easily.

Experimental Data
We trained our new approach using a collection of data (corpus) obtained from Z. S. Malki et al.
Internet web sites such as Medical News Today and Independent News (2014-2017). The total number of files is 1575 file with size about 5.16 MB. From the collected data corpus, the FA words based on PLAs with their levels are extracted. After that, select the first three specific levels of FA words based on PLAs that are related to the sub-field <diseases> under the medical field considering the time change of the frequencies for these selected FA words. The sub-field <diseases> of the medical field is privileged by frequent and constant articles every year, therefore, its terms are unique and tend to change with respect to time change. The data is divided into two groups; one is considered as the training data which is introduced to DT C4.5 as an input and the other group represents the test data which is totally different from the training data. The features of both groups are determined by using that frequency of FA words that result from using PLAs changes by the time. Table 1 shows a sample of the DT data for produced FA words based on PLA, which are Diabetes, FDA, Insulin and Plague.

Experimental Results
Firstly, the resulted FA words are improved by using the PLA. Table 2 shows comparison between samples of the FA words before and after using PLA. For  example, the term "viral" took level 1 for <Hiv/Aids> field but after using the PLA the level changed to be level 4. Therefore, the PLA improved the results for our new approach as shown in Table 2.
From Table 2, it is clear that there are some changes in the levels of some words after using their PLAs, which will be affected on the DT results and improve the performance of the precision and recall.
Secondly, after training the DT by the training data, a comparison was done between automatic results by DT and manual results by human with respect to the classification of St classes of the tested data as in Table 3.  Table 4 introduces the accuracy of the new approach using the three evaluation terms R,P and F-measure rates to determine correctly the classified FA words based on PLA that are evaluated by the DT C4.5 depend on the frequency change with the time change.

Traditional and New Method Results Comparison
In this paper, the F-measure estimates the accuracy of the new methodology and the traditional method by Atlam et al. [12]. The traditional method uses the traditional algorithm [2] of FA terms which neglects the links among terms, documents and fields. Table 5 shows the P, R and F-measure rates for both the traditional and the new method. Table 5 and Figure 3 show the F-measure rates for traditional and new based on PLA methods. Based on the evaluation results, it turns out that the performance is better when using the new method that depends on FA words with PLA. Moreover, the result of F-measure using the new based on PLA method in sub-section 5.2 is more correct than the traditional method. Generally, stable class has slightly improved. This is logic valid because it has stable frequencies of FA words depend on PLA according to time change.
Finally, it is clear that the effectiveness of the new methodology based on PLA is confirmed by using F-measure with improvement of the accuracy for ascending class by 12%, stable class by 5% and descending class by 34%, respectively.

Conclusions
This paper presented a new methodology to produce St classes of classified FA words based on PLA automatically. We have provided a detailed overview of the suggested method and its algorithm and presented our evaluation. The results from our evaluation indicate that the performance of our new method is better than traditional method performance. In conclusion, the effectiveness of the new methodology based on PLA is confirmed by using F-measure for ascending class  as 93.6%, stable class as 99.8% and descending class as 75.7%, respectively.
Future work could focus on using compound FA words based on PLA and build Arabic dictionary based on FA words that can be produced from using PLA. Moreover, multi-language approach can be applied for this system to make cross-language information retrieval.