A New Approach of Time Series Variation Based on Power Links and Field Association Words ()
1. Introduction
Recently, there are a huge amount of texts which are processed automatically in computers. Documents can be managed by computers to retrieve important information that can be used for searching, clustering and classifying, etc. Many words occur frequently in documents and generally strongly related to the field of that document. Extraction of important words from documents is highly useful and conclusive for Information Retrieval (IR) branch. IR refers to the process of finding relevant information in topics that concern the user and then retrieving that information. Generally, the frequencies of words in texts change by the time and often linked with particular period of time. For example, words as “flu” is more spread in winter and “stormy” is more spread when there are strong winds and usually rains. Moreover, word groups that linked with words of search on a search engine as Yahoo and Google are changing with the time Ohkubo et al. [1]. Fields like sport and political figures always change with the time variant.
Traditional approaches [2] - [8] for searching, classifying, clustering and analyzing texts are incapable to determine specific words in a particular period of time. Atlam et al. [9] [10] approach points out that the popularity of words change according to the time using the ordinary keywords that represent documents. However, ordinary keywords are not representing documents correctly, because Atlam approach ignored the importance of connection between words and their fields.
By finding specific words in the document the reader can easily decide the field of the document without a need to read the whole text. These specific words or units are called Field Association (FA) words. FA words are the smallest number of words by which the reader of the text can decide the field of the text. Other method by Atlam et al. [11] represents FA words across the time based on the traditional FA algorithm of Atlam et al. [12]. However, this algorithm had some drawbacks, because some of irrelevant FA words that are not restricted to special field are selected as FA words candidates, while that are restricted to that particular field are not selected to be FA words [13] [14].
Power Links Analysis (PLA) was developed by Rokaya and Atlam [15] to analyze the co-occurrences of terms in the publications of a given topic which solve the previous drawback of Atlam approach. This approach is based on the advanced form of frequency as well as the distance between different instances of the given words. This additional information activates the concept of distribution of terms among different parts of a given document. PLA is based on the assumption that document words can give an enough representation of the document content and depends on the quantified frequencies and the distances between different instances of the two terms. Moreover, a time-series of such maps can trace the dynamic changes in this conceptual space. Further, PLA can be used to extract more relevant FA words from text documents [16].
The novelty of this paper is using Power Links Analysis (PLA), however the traditional methods have discussed the effects of the time change on the frequencies of FA words only. This means that there are some changes in the levels of some words after using their PLAs, which will affect on the DT results and improve the performance of the precision and recall as well as F-measure as shown in the paper results. Our paper suggests a methodology for automatic evaluation of the Stability (St) classes using the decision tree C4.5 algorithm of Quinlan [17] on the FA terms based on PLA. This methodology is assumed to indicate the popularity of FA terms relying on the change of time and to improve the DT precision.
This paper is organized as follows. Section 2 introduces related work and FA words with their levels.
Section 3 presents the concept of PLA. Section 4 introduces our new methodology for evaluating St classes that indicate the popularity of FA words using PLA in a certain time period. Section 5 presents the experimental observations. Finally, conclusion and possible future work are introduced in section 6.
2. Related Work and FA Words
2.1. Related Work
Usually, IR field collection of documents changes based on time passes. The collection at time t contains, for an instance,
documents,
tokens, and term q which has
frequency. Whilst at time
, these computations will be changed. Therefore, the documents are added to or deleted from a collection according to the time and also the collection frequencies of words change. Atlam et al. [11] proposed a method to introduce the popularity of words with time based on their frequency in the past years texts data. This approach defines number of attributes and three classes of stability as the index of spread of words to obtain the frequency change of words quantitatively. Furthermore, decision tree is used to estimate these classes. However, this approach used the common keywords to represent the documents which are not the best representative. This method neglected entirely the importance of the relationship between words and their fields.
Atlam et al. [9] introduced the effects of changing the time on the frequency of specified words called FA words using the decision tree. They presented number of features to study the changing of FA words frequency according to the time and three stabilization classes that refer to the popularity of FA term across time. However, this method was depended on the traditional FA algorithm of Atlam et al. [10] which caused to produce some irrelevant FA terms that are not restricted to the specific field.
Co-word analysis the dynamics of science as a result of actor strategies. This technique should allow the reader in principle to identify the actors and explain the global dynamic [18] - [26]. Rokaya and Atlam [15] proposed a method of building dynamic FA words dictionary using PLA. Furthermore, this algorithm presented new rules to enhance the quality of FA terms dictionary in English. Moreover, the PLA algorithm used a technique to extract and refine the confusion sets to provide context-sensitive spell checking based on FA words [27]. This technique joined between the advantages of statistical and machine learning method of Rokaya et al. [27] which used to build a real word spell checker in English and Arabic. Also, the PLA approach was presented to classify the important and advertising messages and spam the reduction one [28] [29].
2.2. Field Association (FA)
1) FA WORDS
In this paper, a document field is a fundamental and popular knowledge which can be utilized in human communication, for example,
explains the path on tree with super-field
with subfield
and terminal field
[5]. The tree structure was organized to illustrate the associations among document fields through the field tree Dozawa [30].
2) FA WORDS LEVELS
Some FA words can be decided only by a specific field, whilst others may be decided by two or more fields. Relying on FA word success in referring to specific fields, there are five different levels as follows:
a) Ideal FA words: words associated with one sub-field (e.g., influenza, chemotherapy and insulin).
b) Semi-ideal FA words: words associated with some of sub-fields in one super-field (e.g., sneeze and cough).
c) Medial FA words: words associated with single super-field (e.g., blood and hospital).
d) Various FA words: words associated with some of sub-fields of different super-fields (e.g., program and win).
e) Non-FA words: not related to any field or decide it (e.g., rule and size).
The traditional algorithm of Atlam et al. [10] was used to judge these levels and to determine automatically the FA words based on term frequency and concentration ratio. The resulted FA terms are used as input to the algorithm of computing the ranks of FA words depending on PLA [31]. The traditional algorithm [10] takes as input the list of words selected from a corpus which comprises of groups of documents in different fields to judge the level of FA words.
3. Power Link Analysis (PLA)
The concept of PLA reflects the value of the word in terms of its relation to the words in the document. Moreover, each document will be presented by the average of the PLA between the current term and the terms in the same document.
PLA Steps
In the following sub-sections, we will introduce three main concepts for power link analysis as follows:
1) TERM to TERM PLA
Supposed we have two terms
and
belongs to a document D, sometimes there is a link between
and
such link will be measured by the function
:
where
is the number of different terms in document D,
is the co-occurrence frequency of the two terms
and
in the document D,
is the distance between any two successive instants
and
of the terms
and
. The value
represents the average distance between any instants
and
of the terms in the document D. The function
is symmetric, which means
.
2) TERM TO DOCUMENT PLA
By using the PLA between two terms as mentioned above, the PLA of a term to a document related to a given field can be represented by Figure 1.
The PLAs for n of FA words t and document D related to a given field
can be represented by
as follows:
where n is the number of FA words related to a field
and exist in document D.
3) TERM TO FIELD PLA
Term t to field
power link is represented by LTS as follows:
where at least one FA words can exist in document
which is related to field
,
represents the co-occurrence of the words t and the field
, where
In other words,
represents the number of FA terms that identify the field
and occur in a document whenever the term t occurs, and nd represents the number of documents that contain FA words that identify the field S and the term t.
Our new approach will use the algorithm of evaluating the levels of FA terms based on PLs [25] and then, check the effectiveness of the time change using new methodology. The selection of FA words is depended on using the PLAs.
Figure 1. Terms to documents and field PL candidate.
4. Suggested Methodology
4.1. System Outcome
Figure 2 shows the outlines of the suggested approach. In this approach
represents the frequency of FA words depending on the PL k in a particular period pi. However,
represents the total frequencies of all FA words based on PLAs that lie in pi. In order to accommodate the influence by the difference of FA words in each period with the changing of time rightly, the normalization frequency of FA words k in that period of time pi,
, is represented by the following formula:
Figure 2. Suggested approach outline for judging the publicity of FA words with PLA according to time changing.
Definition 1: The FA words based on PLAs with increasing frequency according to the time change are called ascending FA words based on PLAs and its class is called ascending class. Whilst the FA words based on PLAs with descending frequency according to the time change is called descending FA words and its class is called descending class. Moreover, the FA words based on PLAs with stable frequency according to the time change are called stable FA words and its class is called stable class. These three classes are called FA words based on PLAs Stability (St) classes and are defined to determine how much the publicity of a specific FA word based on PLAs with the time change based on the changing of frequency of FA words in a given data.
The updated algorithm based on PLA:
Input: Corpus (collection of documents)
, where each field Sj comprises of a set of documents
that grouped according to a set of particular time periods
, and words
that have PLA with that fields.
Output: Evaluation of St classes for the list of FA terms based on PLA.
Steps:
1. apply the FA algorithm of Rokaya et al. [25] that based on PLAs in each
on the collection of extracted Words (W) in field S with their attributes.
2.
.
3. where
is the frequency of word
in field
, to get FA words collection based power link with the above five levels in each
, we will select the FA words levels for the topic of medical field. The output will be collection consists of FA terms lists based power link for the medical field in each given time,
, where
in period
,
.
4. for each
in FAT, do.
5. for each
in
, do.
6. calculate
.
is the normalization frequency of FA word based on power link k in each
,
and
.
7. end.
8. end.
9. get
.
10. for each K in FA_NormFrequency, do.
11. calculate
where a is the slope of the trend line,
(the values of periods),
(the values of the time series of normalization frequency of FA term based power link
in each period of time
).
12. calculate
, where b is the slice of the trend line.
13. calculate
,
,
represents the estimated values of
,
.
14. calculate
,
and
represent the gradients of first line obtained from ancient data (first two years) and second line obtained from all data (
) respectively.
15. end.
16. let
, where
is the FA words list based on power link with their features
and St classes, which is obtained after appending FD and St to the former calculations, FD represents the Feature Description of the FA terms based power link that are decided manually as described in sub-section 4.2.1.
17. DT C4.5 algorithm applied on the five features of FA terms based power link with their stability classes
as a training data.
18. To evaluate the accuracy of this model test data is used to obtain automatically the St Classes for the new test data.
In summary, the suggested algorithm is using for evaluating the stability (St) classes of FA words based on PLA automatically. The St classes refer to the publicity of FA words based power link across the time by depending on their frequencies.
4.2. FA Term Based on PLA Features
In the following sub-sections, five features of FA term based on PLAs are introduced as follows: Feature Description (FD), Slice of trend line (b), Slope of line (a), Angle between two lines (θ) Correlation confection (r) and the resulting amounts from these features are used as a training data for DT C4.5. These features are utilized to measure the frequency change of FA words based on PLAs with respect to time change to judge the stabi1lity (St) classes for FA words. More details about these attributes were discussed in [5]. Moreover, in this paper, we use the FD to describe terms of the medical field, as an example, as follows: a) “Disease-name: e.g., Cancer”, b) “Treatment-name: e.g., Avastin”, c) “Doctor-name: e.g., Magdi Yacoub”, d) “Organization name: WHO”, e) “Medical-name: e.g., Patient”. These features are useful to study the influence of the time change on the frequencies and determine the St classes better and easily.
5. Experimental Observation
5.1. Experimental Data
We trained our new approach using a collection of data (corpus) obtained from Internet web sites such as Medical News Today and Independent News (2014-2017). The total number of files is 1575 file with size about 5.16 MB. From the collected data corpus, the FA words based on PLAs with their levels are extracted. After that, select the first three specific levels of FA words based on PLAs that are related to the sub-field
under the medical field considering the time change of the frequencies for these selected FA words. The sub-field
of the medical field is privileged by frequent and constant articles every year, therefore, its terms are unique and tend to change with respect to time change. The data is divided into two groups; one is considered as the training data which is introduced to DT C4.5 as an input and the other group represents the test data which is totally different from the training data. The features of both groups are determined by using that frequency of FA words that result from using PLAs changes by the time.
Table 1 shows a sample of the DT data for produced FA words based on PLA, which are Diabetes, FDA, Insulin and Plague.
5.2. Experimental Results
Firstly, the resulted FA words are improved by using the PLA. Table 2 shows comparison between samples of the FA words before and after using PLA. For
Table 1. A sample of dt data based on PLA.
Where a mean Slope of line, b means Slice of trend line, r means Correlation confection, θ Angle between two lines, FD. Feature Description, class means stability class increasing, decreasing and constant.
Table 2. Comparison between new and traditional approach based on PLS.
example, the term “viral” took level 1 for
field but after using the PLA the level changed to be level 4. Therefore, the PLA improved the results for our new approach as shown in
Table 2.
From Table 2, it is clear that there are some changes in the levels of some words after using their PLAs, which will be affected on the DT results and improve the performance of the precision and recall.
Secondly, after training the DT by the training data, a comparison was done between automatic results by DT and manual results by human with respect to the classification of St classes of the tested data as in Table 3.
Table 3 represents the final result of the DT. Shaded rectangles mean the FA words based on PLA number that is determined correctly in both manual and automatic system of DT. The columns have FA words based on PLA number that evaluated by the DT and the rows have FA words based on PLA number that evaluated manually in each St Class.
To evaluate the DT system, three main terms in IR are called Recall (R), Precision (P) and F-measure are applied to each St class and defined as follows:
Table 4 introduces the accuracy of the new approach using the three evaluation terms R,P and F-measure rates to determine correctly the classified FA words based on PLA that are evaluated by the DT C4.5 depend on the frequency change with the time change.
Table 3. Resulted DT with manually evaluation.
Table 4. Evaluate St classes based on PLA using R, P and F-measure.
5.3. Traditional and New Method Results Comparison
In this paper, the F-measure estimates the accuracy of the new methodology and the traditional method by Atlam et al. [12]. The traditional method uses the traditional algorithm [2] of FA terms which neglects the links among terms, documents and fields. Table 5 shows the P, R and F-measure rates for both the traditional and the new method.
Table 5 and Figure 3 show the F-measure rates for traditional and new based on PLA methods. Based on the evaluation results, it turns out that the performance is better when using the new method that depends on FA words with PLA. Moreover, the result of F-measure using the new based on PLA method in sub-section 5.2 is more correct than the traditional method. Generally, stable class has slightly improved. This is logic valid because it has stable frequencies of FA words depend on PLA according to time change.
Finally, it is clear that the effectiveness of the new methodology based on PLA is confirmed by using F-measure with improvement of the accuracy for ascending class by 12%, stable class by 5% and descending class by 34%, respectively.
6. Conclusions
This paper presented a new methodology to produce St classes of classified FA words based on PLA automatically. We have provided a detailed overview of the suggested method and its algorithm and presented our evaluation. The results from our evaluation indicate that the performance of our new method is better than traditional method performance. In conclusion, the effectiveness of the new methodology based on PLA is confirmed by using F-measure for ascending class
Table 5. Comparison of new and traditional methods using F-measure.
Figure 3. F-measure evaluation for traditional and new based power link methods.
as 93.6%, stable class as 99.8% and descending class as 75.7%, respectively.
Future work could focus on using compound FA words based on PLA and build Arabic dictionary based on FA words that can be produced from using PLA. Moreover, multi-language approach can be applied for this system to make cross-language information retrieval.