Extractive Summarization Using Structural Syntax, Term Expansion and Refinement

This paper investigates a procedure developed and reports on experiments performed to studying the utility of applying a combined structural property of a text’s sentences and term expansion using WordNet [1] and a local thesaurus [2] in the selection of the most appropriate extractive text summarization for a particular document. Sentences were tagged and normalized then subjected to the Longest Common Subsequence (LCS) algorithm [3] [4] for the selection of the most similar subset of sentences. Calculated similarity was based on LCS of pairs of sentences that make up the document. A normalized score was calculated and used to rank sentences. A selected top subset of the most similar sentences was then tokenized to produce a set of important keywords or terms. The produced terms were further expanded into two subsets using 1) WorldNet; and 2) a local electronic dictionary/thesaurus. The three sets obtained (the original and the expanded two) were then re-cycled to further refine and expand the list of selected sentences from the original document. The process was repeated a number of times in order to find the best representative set of sentences. A final set of the top (best) sentences was selected as candidate sentences for summarization. In order to verify the utility of the procedure, a number of experiments were conducted using an email corpus. The results were compared to those produced by human annotators as well as to results produced using some basic sentences similarity calculation method. Produced results were very encouraging and compared well to those of human annotators and Jacquard sentences similarity.


Introduction
The growth of the web and the emergence of digital libraries make text analysis and similarity calculations an important technique for many applications. The multiple lingualism of such data explosion further necessitates the need for more robust, efficient and generalized tools and techniques to facilitate the utilization of available content. Text representations and processing have become an important backbone of many tools and applications including text mining [5] [6], summarization [7]- [17], clustering [18], categorization [19] [20], copy-detection [21] [22] [23], plagiarism [24] [25], web-search [26] [27], information retrieval [28] and computational biology [29] [30] [31].
This paper reports on work conducted to investigate the use of syntactical structures, namely POS-tagging of English sentences in the selection of parts of a document to be used as candidates for extractive summarization. The procedure uses POS tagging [32] [33] and LCS [3] [4] combined with term expansion using WordNet [1] and a local thesaurus [2] in the selection of the most appropriate extractive text summarization for a document. The produced sentences (extractive summary) of the document were the results of calculations based on the use of a set of selected common subsequences that was the result of the POS-tagged document' sentences. The results were further refined using term sets that were expanded into two subsets using WorldNet and a local electronic dictionary/ thesaurus.
At first syntactical features of the sentences within the text were represented as POS-tags using TreeTagger [32] [33]. After that each document's tagged strings were further compared using LCS for common syntactical structures. A normalized score between 0 and 1 was calculated for each pair of sentences using the longest common subsequences to produce a final measure of similarity. An initial set of sentences was produced. The produced sentences were selected for being the topmost similar based on a predefined cut of value or mere selection of top-n sentences.
Further processing of the initial candidate set of sentences was performed, where a set of terms was produced from the set of candidate sentences to produce an initial set of terms or keywords (restricted to verbs, nouns and or adjectives and adverbs). The new set of terms was then used to expand the set of candidate sentences with any sentence that shares same terms in the original document. The expanded set of candidate sentences was further subjected to the same process again. The initial set of terms can either be used as is, or further improved using some global sources such as WordNet or a local resource such as a thesaurus or both. These experiments have used the initial set of terms as is, extended with WordNet and local thesaurus. This cycle can be performed any number of times to produce more refined sets of candidate sentences.
As an experimental validation of the adopted procedure, a dataset made up of real emails along with a human annotation of important sentences was used [5].
Obtained results were also compared to those that can be produced using sentence-based Jacquard coefficient similarity [3]. Results obtained have showed the utility of the approach in generating a set of candidate sentences that can be used International Journal of Intelligence Science for extractive summarizations and other similarity-based work.
All in all, a number of important processing tasks were performed these experiments including text tagging and basic text preprocessing. This aimed to make a reduction of a document' sentences into a set of POS tags without exclusion of any stop words, stemming or removal of numbers, punctuation or special characters. Each tag of each produced string was replaced by a single character.
Mapping of similar tags such as verbs, nouns, adjectives and adverbs into single character or symbol can be applied to further reduce the size of the produced string to better improve efficiency of LCS processing.
Each tagged string of the original sentence was fed into LCS module to find the length of the most common-subsequence. Pairs of strings were then compared and scored based on a normalized value of the length of the longest common subsequence.
The most similar sentences were further analyzed to find set of words (terms) to be used for fetching of related sentences from the original document. Before using the collection of new sentences, the produced set of terms was expand-on using WordNet or a local dictionary. Top k-sentences were selected as candidate subset of sentences that can be used for extractive summarization.
The rest of the paper is made up of Section 2 on related work; Section 3 on the proposed procedure; Section 4 on the experiments conducted, and the document collections used; Section 5 on results analysis and Section 6 on conclusions and future work.

Related Work
Multi-target text summarization by humans involves full understanding, interpretation and generation of an abstract of documents. Such a task is not easy for the average person, talk less of a computer program. It is a very critical human cognitive activity whose objective is to sum up the main points of a long text.
Automatic text summarization, the automation of this critical human activity is normally considered part of machine learning and data mining fields. It is typically utilized in a variety of fields such as search engines, document summarizations, and other non-typical fields such as image collections and videos. Automatic summarization involves methods and techniques from a variety of related fields that share text analysis and processing.
Tasks of text processing and data analysis have become a necessity in this ever-expanding field of text analysis and processing. Work on automatic text summarization [15] [16] [34]- [43] aims to make it easier and more efficient to create applications related to Natural Language Processing, such as Information Retrieval, Question Answering or Text Comprehension.
Automatic text summarization can be defined as the process of reducing a given text using a computer program to create a set of important points that can be extracted from the original document. Many tools and technologies with related algorithms have been developed and deployed to make a coherent summary International Journal of Intelligence Science of documents. Such methods take into account length, writing style and syntax using machine learning and other techniques [44]. All such tools share the major objective of creating a set (or subset) from the original document that works as a representative summary or abstract of the entire document. Summarization techniques and algorithms try to find subsets of objects which cover informational content of a single document or a group of documents.
Automatic summarization techniques can also be categorized based on the number of documents involved (single document versus multi-documents), the genre where a generic summarization which creates a generic summary of the documents versus query relevant summarization which creates a summary that selects objects from the original document that are relevant to some specific query.
Query-focused summaries enable users to find more relevant documents more accurately, with less need to consult the full text of the document [17].
Most commonly, however, automatic summarization is categorized based on the type of produced summary which can be extractive or abstractive. In extractive [10] [42] summarization the summary is created by reusing portions (words, sentences ... etc.) of the input text. As for abstractive [11]  Lehn's work [11], considered one of the earliest attempts at automatic summarization, suggested a basic idea where sentences that convey important contents are those that contain some content descriptive words. Most of his work was based on finding the extracts from a given text depending on manually generated rules using sentence position, word formatting, word frequency and others clues [34] [45] [46]. The problem with this view is in its dependency on the format and position in the text rather than the semantics of text. Other early summarization systems such as FRUMP, SUMMONS, CIRCUS and SUMMARIST [47] [48] were based on the use of pre-defined patterns that are labor intensive. Patterns would trigger certain templates to be filled as the text is read [49] [59]. These nodes in the graph that are connected were thus a representation of relatedness characterized by the value of the cosine similarity of their corresponding sentences. Sentences which were more similar to other sentences in the document are considered important and were included in the extractive summary.
Semantic graph based techniques [16] extract Subject-Object-Predicate triplets from the sentences that were then used to generate a graph of the document.
Machine learning techniques are used to select a subpart of the graph where the sentences in the sub-part would make up the summary. Naïve Bays, Neural Networks and Hidden Markov Model (HMM) [12] were some of the machine learning methods used in summarization.
Testing and evaluation of summarization systems is a critical aspect that has been performed using all types of data sets and corpuses [8] [60] [61] [62]. Emails present one change in which such systems can be tested and verified. One of the first attempts that uses extraction of important phrases from emails as a way of email summarization is in [63] [64]. In [63], researchers focused on thread summarization using content and structural features to group sentences as "relevant" and "not relevant". Other researchers used a scoring-based summarization to generate "thread overviews" on mailing lists. In particular, [65] assumed that topical consistency can be maintained by selecting sentences with higher POS overlap with the root message. They based sentence score on POS overlap with the subject line and the root message. Whereas, [64] looked at thread summary creation more like an online group decision-making process using structure and Singular Value Decomposition (SVD) [66] on words bags to calculate a unique sentence scoring.
Using a supervised classifier, and a linguistically driven post-process to mark sentences as task descriptions, the SmartMail [67] was created to identify "action items" in a message by providing a task-focused summary consisting of a list of action items. A large email corpus was constructed representing each sentence on a large set of features with SVM classifiers trained to identify "task" sentences which, in turn, were utilized to obtain logical forms and task descriptions.
The idea of summarizing email threads using multi-candidate reduction as a framework [67] for abstractive multi-document summarization was used in [68]. They filtered sentences and compressed them in two ways in which they refer to a "parse-and-trim", and a Hidden Markov Model approach.
Ranked sentences using clue words through the construction of a Fragment Quotation Graph to capture the flow of a conversation in a thread was developed and used in [69]. A score for each sentence is assigned using the graph based on a test corpus that was built from 20 different Enron threads. The authors' approach outperformed MEAD and RIPPER-on this test set [69].
In [70] the authors presented a transformation for summarizing emails using an ontology that was populated by entities and relationships present in the email.
The ontology could be learned very accurately with classifiers trained on a large International Journal of Intelligence Science set of features. It was then used to generate a summary maximizing an objective function relating sentence and entity weights. Work on extending the problem of keyword extraction in a supervised setting using a decision tree and a genetic-algorithm-based classifier to classify phrases in a document as key phrase or not was presented in [71].
One of the main tasks found in summarization as well as other text processing work had to do with evaluation of relatedness or similarity of parts of text, be it words, sentences or larger portions including whole documents. Different methods and approaches have been used to tackle this issue of similarities between documents using semantically, syntactical or semantic features. Semantic similarity received less attention for the inherent difficulties of representing semantics and the limitations on assessment coverage of user studies [54] [72]. Commonly used methods for determination of similarity include fingerprinting [21], Information Retrieval [28] and other hybrid techniques [24] [44]. In Information Retrieval models, more emphasis was put on representing documents by their words and word frequencies. Indexing with an appropriate model to evaluate similarities between documents was also used.
The combined use of syntactical POS tagging and text processing methods for the purpose of text similarity calculations and its applications was used in this recent work [72]- [77]. It was based on the intuition that similar (exact) documents would have similar (exact) syntactical structures. Documents that contain reused portions of other documents or are written by the same author or on the same topic would contain similar structures.
Looking at a lump of text as a string made of meaningful, well defined and numerable units (alphabets), means that a modified (and similar) text can be thought of as an intervention or application of edit operations commonly mentioned in bio-sequences analysis of insertions, deletions and substitutions.

Proposed Procedure
A brief description of the proposed procedure is shown in Figure 1.
Steps of the used procedure are briefly described next.

Text Tagging and Pre-Processing
This step makes a reduction of a document' sentences into a set of POS tags without exclusion of any stop words, stemming or removal of numbers, punctuation or special characters. Since, LCS algorithm handles characters, each tag of each produced string has been replaced by a single character. More simplification and reduction can be obtained through the mapping of similar tags such as verbs, nouns, adjectives and adverbs into single characters or symbols. This reduction can produce shorter strings, which is better for LCS calculation efficiency.

LCS-Processing
Each sentence' string of tags was then fed into an LCS module to produce the International Journal of Intelligence Science

Tokenization, Term Selection and Expansion
The most similar sentences (based on the normalized LCS score) were then further analyzed to produce a set of terms (bag of words) to be used as keywords for collection of related sentences from the original document. All sentences that shared any of the key words were collected to be used for the next stage.
Before terms were used for collection of new sentences, the produced set of terms were subjected into a module that further expanded the set of keywords using either 1) WordNet; or 2) an electronic thesaurus-dictionary.

Subset of Candidate Sentence Selection
Once the procedure was applied a sufficient number of times, top sentences were selected as candidate subset of sentences that can be used for extractive summarization. The set is considered as a top k-sentences or any set of sentences that lay above a certain threshold value.

Dataset and Experiments
To evaluate the proposed procedure, it was applied on a subset of emails that were taken from [5] collection. The original email dataset consisted of a set of emails that were manually annotated with summaries and keywords and contained both single and thread emails. It totaled 349 annotated emails and threads. The dataset was developed for use by automatic summarization methods and other extraction experiments.
According to the developers, 319 emails of the 349 that were annotated came from the Enron corpus [12]. Thirty other emails were provided by volunteers. The set consists of a total of more than 100,000 words and close to 7000 sentences.
The emails were classified as either corporate which refers to any communication within work environment; or private which refer to two different sets of pri-International Journal of Intelligence Science vate emails, the first was taken from the Enron collection and the second was mainly provided by volunteers from their own private mailboxes.
As per the developers of the email corpus, emails were manually annotated by two independent annotators generating 1) an abstractive summary; 2) a set of important sentences (extractive summary); 3) a set of key-phrases; and 4) a classification of the emails as either corporate or private.
For the purpose of this work, it was enough to use a subset of the corpus. A private single email collection (referred to as PSS) was used. The subset was made of 103 private emails along with two sets of sentences that were provided by human annotators. That gave a total of 206 extractive summaries.
As can be seen from the samples provided in Table 1, the original email corpus was formatted using XML. The set of private single emails texts were extracted for each email in the test corpus along with their respective extractive summaries. The summary is made of 5 sentences suggested by the two human annotators. The annotators were identified as 1 and 3 in the original corpus, thus, the two sets PSS-A1 and PSS-A3 were created to correspond to the two annotators respectively. The two human annotators produced two sets that were not identical as expected. Table 2 is a sample that shows the same email (088) along with produced POS Tags (for one sentence only) as well as the final and the much reduced string and terms.
All in all, the following comparisons were performed on the results: 1) Comparison of the used procedure produced results (sub set of sentences) obtained against the PSS-A1 set to its provided human annotated sentences. 2) Comparison of the procedure produced results (sub set of sentences) obtained against the PSS-A3 set to its provided human annotated sentences.
3) Correlation of the used procedure produced results to "how those of the two annotators compare to each other". That is we compared the annotators summaries to each other and then we correlated that to our results.
For the above 1) and 2) comparisons the top five sentences produced were compared to the 5 sentences produced by human annotators. Total match of results with a value of 1 meant that both sets contained the same sentences. Lesser values of (0.8, 0.6, 0.4, 0.2, 0) represented less of an agreement to no agreement at all. No regard was paid to the order of sentences in these experiments. All of the comparisons provided were performed using the produced sentences based on the following combinations.

Based on Original Terms (TT Set)
In this set the terms were selected from the candidate sentences as is without any expansion of the list of terms.

Based on the Expanded Terms Using WordNet Synonyms (ST Set)
In this set, the original terms set was expanded using synonyms from WordNet. In particular, nouns and verbs were used as seeds to expand the list using Word-Net. WordNet [1] is a well known lexical database for English and other languages. It groups words into sets of synonyms called synsets. WordNet also provides short definitions, usage examples, and records a number of relations among the synonym sets or their members.

Based on the Expanded Terms Using a Local Dictionary (DT Set)
In this set the original terms set was expanded using synonyms from the Mobysaurus-thesaurus-dictionary [2]. The terms (nouns and verbs) were used to expand the original set using Mobysaurus. Mobysaurus is a free, feature-rich English thesaurus and dictionary. It integrates Moby Thesaurus II, Roget's Thesaurus, GCIDE Dictionary and WordNet. In addition to the above methods, another important evaluation that was conducted was correlation of our results to those that can be obtained by mere comparison based-on words contained in the sentences using a standard Jacquard coefficient similarity [3]. Results are further discussed in the following section.

Results, Analysis and Discussions
In order to validate our procedure, a number of experiments were performed as already described above. The results of each of the performed steps are explained next. Table 3 and Table 4

Comparison of Results against Human Annotated Sentences (Set PSS-A1)
As is shown in Table 3, resulting abstracts obtained by our procedure were compared to those of the human annotator 1 and on three cycles using the thesaurus and WordNet synset expansion.
One noticeable thing is that in both cases of expansion using the thesaurus or WordNet higher averages were obtained in the expanded cycles of 1 and 2 than that of the base cycle of 1.
As is shown in Table 3, the best average obtained was 42.1 for the WordNet cycle 3 slightly better than that of the thesaurus. The 42.1 is still lower than that obtained when the two annotators were compared to each other. It is worth noting, as seen from the last row (JSC Annotator-1) versus (Maximum), that the obtained results outperform JSC in all cases except for the 100% case.

Comparison of Results against Human Annotated Sentences (Set PSS-A3)
As is show in Table 4, resulting abstracts obtained by the used procedure were compared to those of the human annotator 3 were slightly better than the case of human annotator 1. The results in both cases of expansion using a thesaurus or WordNet showed higher averages in the expanded cycles of 1 and 2 than that of the base cycle of 1.
The best average obtained, as shown in Table 4, was 46.2 for the WordNet cycle 3 was slightly better than that of the thesaurus. The 46.2 results were still lower than that obtained when the two annotators were compared to each other.
Interestingly, all the results were better than JSC across all columns and compared better than the case of annotator 1 compared to the case of annotator 1 vs. 3.

Correlation of Results vs. Two Annotators as They Compare to Each Other
As is shown in both tables (Table 3 and Table 4), when compared to each other, International Journal of Intelligence Science the annotators results showed variations. That is an indication of the difficulty and inconsistency of abstracting even when humans were involved.
The obtained results compared showed an under performance but reasonable results when compared with how the annotators compared to each other. Results showed that the procedure compares better with annotator 3 than 1.

Comparison of Results to Jacquard Similarity Coefficient (Mere Sentence Terms)
As is shown in Table 3 and Table 4, in both cases, results obtained outperformed the mere use of JSC on sentences. As a matter of fact, results were better in almost every case beyond cycle 1.
It can be seen that, a combined approach to extractive summarizations can perform reasonably well when compared to results obtained from human annotators. These experiments highlight the utility of combining structural (syntactical) features extracted as POS tags with semantically driven approach in both accelerating the processing that can be done using traditional string processing techniques such as LCS. It also highlights the functionality of combining such structural future with expanded keywords in improving the ranking and selection of important or representative sentences. The utility and functionality of such approach is further enhanced through the use of refinement cycles. Results are quite comparable to human annotators work and better than that of the mere use of common sentences comparison techniques such as Jacquard similarity coefficient.

Conclusions
A procedure for extractive summarization was developed and experiments were performed to investigate and validate the results. The procedure used an approach based on a combined POS tagging of sentences of a text document and term expansion using WordNet and a local thesaurus in the selection of the most appropriate extractive text summarization for that document. Sentences were POS-tagged and the produced strings were reduced into single character tags. Which were then subjected to Longest Common Subsequence (LCS) to calculate the similarity of the pairs of the sentences that make up the document producing a normalized score was obtained. A selected top subset of the most similar sentences was tokenized to produce a set of important keywords which were further expanded into two subsets using WorldNet and a local thesaurus. The two expanded sets obtained along with the original set of terms were re-cycled to further refine and expand the list of selected sentences from the original document. The process was repeated a number of times in order to find the best representative set of sentences. A final set of the top (best) sentences was selected as candidate sentences for summarization.
Experiments using an email corpus were conducted to verify the utility of the procedure. The obtained results were compared to those produced by human an-International Journal of Intelligence Science notators on one hand and to those results produced using Jacquard similarity coefficient. Comparison and analysis of the obtained results using the developed procedure were very encouraging and compared reasonably to the human annotators and other methods. Since the approach does not require language-specific linguistic processing beyond identifying sentence and word boundaries, it can also be applied to other languages, for example. At the same time, incorporating syntactic and semantic information has led to superior results compared to plain similarity methods.