Evolutionary Algorithm for Extractive Text Summarization

doi:10.4236/iim.2009.12019

Paper Menu >>

Journal Menu >>

Intelligent Information Management, 2009, 1, 128-138

doi:10.4236/iim.2009.12019 Published Online November 2009 (http://www.scirp.org/journal/iim)

Evolutionary Algorithm for Extractive Text

Summarization

Rasim ALGULIEV, Ramiz ALIGULIYEV

Institute of Information Technology, Azerbaijan National Academy of Sciences, Baku, Azerbaijan

Email: rasim@science.az, a.ramiz@science.az

Abstract: Text summarization is the process of automatically creating a compressed version of a given

document preserving its information content. There are two types of summarization: extractive and abstrac-

tive. Extractive summarization methods simplify the problem of summarization into the problem of selecting

a representative subset of the sentences in the original documents. Abstractive summarization may compose

novel sentences, unseen in the original sources. In our study we focus on sentence based extractive document

summarization. The extractive summarization systems are typically based on techniques for sentence extrac-

tion and aim to cover the set of sentences that are most important for the overall understanding of a given

document. In this paper, we propose unsupervised document summarization method that creates the summary

by clustering and extracting sentences from the original document. For this purpose new criterion functions

for sentence clustering have been proposed. Similarity measures play an increasingly important role in

document clustering. Here we’ve also developed a discrete differential evolution algorithm to optimize the

criterion functions. The experimental results show that our suggested approach can improve the performance

compared to sate-of-the-art summarization approaches.

Keywords: sentence clustering, document summarization, discrete differential evolution algorithm

1. Introduction

Text summarization is the process of automatically cre-

ating a compressed version of a given document pre-

serving its information content. Automatic document

summarization is an important research area in natural

language processing (NLP). The technology of automatic

document summarization is developing and may provide

a solution to the information overload problem [1–3].

The process of text summarization can be decomposed

into three phases: analysis, transformation, and synthesis.

The analysis phase analyzes the input text and selects a

few salient features. The transformation phase transforms

the results of the analysis into a summary representation.

Finally, the synthesis phase takes the summary represen-

tation, and produces an appropriate summary corre-

sponding to users’ needs. In the overall process, com-

pression rate, which is defined as the ratio between the

length of the summary and that of the original, is an im-

portant factor that influences the quality of the summary.

As the compression rate decreases, the summary will be

more concise; however, more information is lost. While

the compression rate increases, the summary will be lar-

ger; relatively, more insignificant information is con-

tained. In fact, when the compression rate is 5–30%, the

quality of the summary is acceptable [1–4].

Text summarization can be categorized into two ap-

proaches: extractive and abstractive. Extractive summa-

rization methods simplify the problem of summarization

into the problem of selecting a representative subset of

the sentences in the original documents. Abstractive

summarization may compose novel sentences, unseen in

the original sources [1]. However, abstractive approaches

require deep NLP such as semantic representation, in-

ference and natural language generation, which have yet

to reach a mature stage nowadays [5].

Extractive summarization systems are commonly used

in automatic summarization to produce extractive sum-

maries. Systems for extractive summarization are typi-

cally based on technique for sentence extraction, and

attempt to identify the set of sentences that are most im-

portant for the overall understanding of a given docu-

ment. Most commonly, such systems use some kind of

similarity or centrality metric to identify the set of sen-

tences to include in the summary [1,4,6–15]. For exam-

ple, in [14] a paragraph extraction from a document

based on intra-document links between paragraphs is

proposed. It yields a TRM (Text Relationship Map) from

intra-links, which indicate that the linked texts are se-

mantically related. It proposes four strategies from the

TRM: bushy path, depth-first path, segmented bushy

path, augmented segmented bushy path. An improved

version of this approach is proposed in [1,4,6].

In this paper we demonstrate an extractive text sum-

R. ALGULIEV ET AL. 129

marization method which is based on sentence clustering.

For this purpose new criterion functions for sentence

clustering have been offered. In our study we developed

a discrete differential evolution algorithm to optimize the

criterion functions. The experimental results on an open

benchmark datasets from DUC2001 and DUC2002 show

that our suggested approach can improve the perform-

ance compared to state-of-the-art summarization ap-

proaches.

The rest of this paper is organized as follows: Section

2 introduces related works. The proposed sentence clus-

tering based approach for generic single-document sum-

marization is presented in Section 3. The discrete differ-

ential evolution algorithm for optimization procedure is

given in Section 4. The experiments and results are given

in Section 5. Finally, we conclude our paper in Section 6.

2. Related Work

Automatic document summarization has been actively

investigated in recent years, and most researchers have

concentrated on the extractive summarization method,

but not the abstractive summarization method – see for

example, the 6th issue of the Information Processing and

Management: an International Journal of 2007. The cur-

rent paper contains four references to the articles pub-

lished in this edition [5,16–18]. A comprehensive survey

of document summarization can be found in [17].

The centroid-based method [19,13] is one of the most

popular extractive summarization methods. MEAD

(http://www.summarization.com/mead/) is an implemen-

tation of the centroid-based method for either single- or

multi-document summarizing. It is based on sentence

extraction. For each sentence in a cluster of related

documents, MEAD computes three features and uses a

linear combination of the three to determine what sen-

tences are most salient. The three features used are cen-

troid score, position, and overlap with first sentence

(which may happen to be the title of a document). For

single documents or (given) clusters it computes centroid

topic characterizations using tf-idf-type data. It ranks

candidate summary sentences by combining sentence

scores against centroid, text position value, and tf-idf

title/lead overlap. Sentence selection is constrained by a

summary length threshold, and redundant new sentences

avoided by checking cosine similarity against prior ones

[20]. In [21] each document is considered as a sequence

of sentences and the objective of extractive summariza-

tion is to label the sentences in the sequence with 1 and 0,

where a label of 1 indicates that a sentence is a summary

sentence while 0 denotes a non-summary sentence. To

accomplish this task, a conditional random field is ap-

plied [22]. A novel extractive approach based on mani-

fold-ranking of sentences to query-based multi-document

summarization proposed in [23]. This approach first uses

the manifold-ranking process to compute the manifold-

ranking score for each sentence that denotes the biased

information richness of the sentence, and then uses

greedy algorithm to penalize the sentences with highest

overall scores, which are considered both informative

and novel, and highly biased to the given query.

The summarization techniques can be classified into

two groups: supervised techniques, that rely on machine

learning algorithms trained on pre-existing document-

summary pairs, and unsupervised techniques, based on

properties and heuristics derived from the text. Super-

vised extractive summarization techniques [1,4–6,21,

24–26] treat the summarization task as a two-class clas-

sification problem at the sentence level, where the sum-

mary sentences are positive samples while the non-

summary sentences are negative samples. After repre-

senting each sentence by a vector of features, the classi-

fication function can be trained in two different manners

[21]. The first is in a discriminative way with well-

known algorithms such as SVM (Support Vector Ma-

chine) [4]. In [1], the use of genetic algorithm (GA),

mathematical regression (MR), feed forward neural net-

work (FFNN), probabilistic neural network (PNN) and

Gaussian mixture model (GMM) for automatic text

summarization task have been investigated. This ap-

proach is a trainable summarizer, which takes into ac-

count several features, including sentence position, posi-

tive keyword, negative keyword, sentence centrality,

sentence resemblance to the title, sentence inclusion of

name entity, sentence inclusion of numerical data, sen-

tence relative length, bushy path of the sentence and ag-

gregated similarity for each sentence to generate summa-

ries. The article [25] presents a multi-document, multi-

lingual, theme-based summarization system based on

modeling text cohesion (story flow). Many unsupervised

methods have been developed for document summariza-

tion by exploiting different features and relationships of

the sentences, such as clustering of sentences [7–11], the

hidden topics in the documents [12], graphs based on the

similarity of sentences[13,19,27].

Recently, graph-based methods have been offered to

rank sentences. Lexrank [19] and [27] are two such sys-

tems using the algorithms PageRank and HITS to com-

pute sentence importance. Lexrank is used to compute

sentence importance based on the concept of eigenvector

centrality in a graph representation of sentences for

multi-document summarization task. The graph-based

extractive summarization algorithms succeed in identi-

fying the most important sentences in a text based on

information exclusively drawn from the text itself.

Unlike other systems, which attempt to find out what

makes a good summary by training on collections of

summaries built for other articles, the graph-based meth-

ods are fully unsupervised, and rely on the given texts to

derive an extractive summary[23].

R. ALGULIEV ET AL.

130

On the other hand, summarization task can also be

categorized as either generic or query-based. A query-

based summary presents the information that is most

relevant to the given queries [16,23,24,28,29] while a

generic summary gives an overall sense of the docu-

ment’s content [7–14,19,27,30]. The QCS system (Query,

Cluster, and Summarize)[16] performs the following

tasks in response to a query: retrieves relevant docu-

ments; separates the retrieved documents into clusters by

topic, and creates a summary for each cluster. QCS is a

tool for document retrieval that presents results in a for-

mat so that a user can quickly identify a set of documents

of interest. In [29] are developed a generic, a query-

based, and a hybrid summarizer, each with differing

amounts of document context. The generic summarizer

used a blend of discourse information and information

obtained through traditional surface-level analysis. The

query-based summarizer used only query-term informa-

tion, and the hybrid summarizer used some discourse

information along with query-term information.

Automatic document summarization is a highly inter-

disciplinary research area related with computer science,

multimedia, statistics, as well as cognitive psychology. In

[31] is introduced an intelligent system, the event index-

ing and summarization (EIS) system, for automatic

document summarization, which is based on a cognitive

psychology model (the event-indexing model) and the

roles and importance of sentences and their syntax in

document understanding. The EIS system involves syn-

tactic analysis of sentences, clustering and indexing sen-

tences with five indices from the event-indexing model,

and extracting the most prominent content by lexical

analysis at phrase and clause levels.

3. Sentence Clustering

Clustering is the process of discovering natural group-

ings or clusters and identifying interesting distributions

and patterns within multidimensional data based on some

similarity measure. The topic of clustering has been ex-

tensively studied in many scientific disciplines such as

text mining, pattern recognition, IR etc. Document clus-

tering is a central problem in text mining which can be

defined as grouping documents into clusters according to

their topics or main contents. Document clustering has

many purposes including expanding a search space, gen-

erating a summary, automatic topic extraction, browsing

document collections, organizing information in digital

libraries and detecting topics. In the literature a wide

variety of clustering algorithms have been proposed for

different applications and sizes of data sets. The surveys

on the topics [32–35] offer a comprehensive summary of

the different applications and algorithms.

Generally clustering problems are determined by four

basic components [36,37]: 1) the (physical) representa-

tion of the given data set; 2) the distance/dissimilarity

measures between data points; 3) the criterion/objective

function which the clustering solutions should aim to

optimize; and, 4) the optimization procedure. For a given

data clustering problem, the four components are tightly

coupled. Various methods/criteria have been proposed

over the years from various perspectives and with vari-

ous focuses.

3.1. Sentence Similarity Measure Based on

Terms Co-Occurrence

Let a document is decomposed into a set of sen-

tences





,, n

DS S12

,...S, where n is the number of

sentences. Let





..., m

, ,tTt represents all the dis-

tinct words (terms) occurring in a document , where

is the number of words. In most existing document

clustering algorithms, documents are represented using

the vector space model (VSM) [33]. Each document is

represented using these words as a vector in -dimen-

sional space. A major characteristic of this representation

is the high dimensionality of the feature space, which

imposes a big challenge to the performance of clustering

algorithms. They could not work efficiently in high-di-

mensional feature spaces due to the inherent sparseness

of the data [17]. The vector dimension m is very large

compared to the number of words in a sentence, thus the

resulting vectors would have many null components [38].

In our method, a sentence is represented as a set of

distinct terms appearing in it, , where

is the number of distinct terms in the sentence .



, ,...,tSt

Similarity measures play an increasingly important

role in NLP and IR. Similarity measures have been used

in text-related research and application such as text min-

ing, information retrieving, text summarization, and text

clustering. These applications show that the computation

of sentence similarity has become a generic component

for the research community involved in knowledge rep-

resentation and discovery. There are more papers on

similarity between documents than between sentences or

short texts, but not few [38,39]. The paper [39] presents a

method for measuring the similarity between sentences

or very short texts, based on semantic and word order

information. First, semantic similarity is derived from a

lexical knowledge base and a corpus. Second, the pro-

posed method considers the impact of word order on the

sentence meaning. The overall sentence similarity is de-

fined as a combination of semantic similarity and word

order similarity. Liu et al. [40] present a novel method to

measure similarity between sentences by analyzing parts

of speech and using Dynamic Time Warping technique.

In [41] proposed a novel measure based on the earth

R. ALGULIEV ET AL. 131

mover’s distance (EMD) to evaluate document similarity

by allowing many-to-many matching between subtopics.

First, each document is decomposed into a set of subtop-

ics, and then the EMD is employed to evaluate the simi-

larity between two sets of subtopics for two documents

by solving the transportation problem. The proposed

measure is an improvement of the previous optimal

matching (OM)-based measure, which allows only

one-to-one matching between subtopics. The study of

semantic similarity between words has long been an in-

tegral part of IR and NLP [42]. The method, proposed in

[42], integrates both page counts and snippets to measure

semantic similarity between a given pair of words. In this

paper modified four popular co-occurrence measures;

Jaccard, Overlap (Simpson), Dice and PMI (Point-wise

Mutual Information), to compute semantic similarity

using page counts.

In this section we present a method to measure simi-

larity between sentences using the Normalized Google

Distance (NGD) [43]. First we calculate a similarity

measure between the terms before defining the similarity

measure between the sentences. Using the NGD [43] the

similarity measure between terms and we define

as:

NGD (,) exp(NGD(,))

kl kl

imt tt t (1)

where



maxlog(),log()log()

NGD(, )logminlog(), log()

tt nff



, (2)

is the number of sentences containing the term ,

denotes the number of sentences containing both

terms and , is the number of sentences in the

document.

From the properties of NGD follows that [43]:

1) The range of the is between

and ;

NGD (,)

disst t0

 If kl

tt or if kl

tt but 0,

then ) 1

kl. That is, the semantics of

and l

t, in the Google sense is the same.

klkl

fff 

NGD (,t tsim

 If 0

f, 0

f and 0

f, we take

GD(, )1

tt, then (,)1

t t.

NGD

0sim

 If 0kll k

ffn and klkl

fnf ,

then .

NGD

0(,)1

mt tsi

2) for any . For every pair

and , we have

NGD (,) 1

simt t

)

NGD NGD

(, (,)

kl lk

imttsimt t: It is

symmetric.

Using the formula (1) we define a similarity meas-

ure between sentences and

S as follows:

NGD

(,)

(, )kil j

tStS

imt t

simS Smm







(3)

From the properties of follows that: 1)

the range of the is in and ; 2)

for every ; 3) for every pair

and

),(

NGD lkttsim

),( ji SS

,() NGD jSSsim

NGD

sim

,( ji SS

)

0),(

NGD 

ii SSsim

SNGD

sim



: it is ex-

changeable.

3.2. Objective Functions

Typically clustering algorithms can be categorized as

agglomerative or partitional based on the underlying

methodology of the algorithm, or as hierarchical or flat

(non-hierarchical) based on the structure of the final so-

lution [32–35]. A key characteristic of many partitional

clustering algorithms is that they use a global criterion

function whose optimization drives the entire clustering

process. In recent years, it has been recognized that the

partitional clustering technique is well suited for cluster-

ing a large document database due to their relatively low

computational requirements [44].

Automatic clustering is a process of dividing a set of

objects into unknown groups, where the best number

of groups (or clusters) is determined by the clustering

algorithm. That is, objects within each group should be

highly similar to each other than to objects in any other

group. The automatic clustering problem can be defined

as follows [32–35,45]:

The set of sentences



SSSD,...,, 21





are clustered

into non-overlapping groups Chere k

alled a cluster, k is the unknown number of clusters.

The partition should possess three properties:



CC ,...,





, w

is c

1) Two different clusters should have no sentences in

common, i.e.





qpCC  for qp





kqp ,...,2,1, ;

2) Each sentence should definitely be attached to a

cluster, i.e.



;

3) Each cluster should have at least one sentence as-

signed, i.e.









kp ,...,2,1





Partitional clustering can be viewed as an optimization

procedure that tries to create high-quality clusters ac-

cording to a particular criterion function. Criterion func-

tions used in partitional clustering reflect the underlying

definition of the “goodness” of clusters. Many criterion

functions have been proposed in the literature [32–35,44]

to produce more balanced partitions.

We introduce a criterion function that is defined as fol-

lows:

max→))(1( 2

FF sigm (4)

R. ALGULIEV ET AL.

132

where is a sigmoid function th)(zsigm

numbers

at maps from

the real into ]1,0[, )exp(1

)(zsigm.

The criterion funct4) balust

z

ion (lances both intra-cer

similarity and inter-cluster dissimilarity. This function is

obtained by combining two criteria:

max),(

1∈,

NGD  



simCF

pCSS

lip1

pli

SS (5)

and

min),(

NGD2   





pqCSCS

qp piql

SSsim

F (6)

The criterion Function (5) maximizes the average

spectively

crlution

Theat can be used to optimize

thm ofe pairwise similarity between the sentences

assigned to each cluster. The 2

F criterion Function (6)

computes the clustering by fing a solution that sepa-

rates each cluster from other clusters. It minimizes the

similarity between the sentences i

S and l

S assigned

to different clusters p

C and q

C, r.

4. Modified Disete Differential Evo

din

Algorithm (MDDE)

re are many techniques th

the criterion Functions (4)-(6) described in the previous

Section 3. The paper [17] presents an up-to-date survey

on evolutionary algorithms for clustering. It tries to re-

flect the profile of this area by focusing more on those

subjects that have been given more importance in the

literature. Particularly, the paper has focused mainly on

hard partitional algorithms, though overlapping (soft/

fuzzy) approaches have also been covered. An original

contribution of the present paper is that it discusses key

issues on the design of evolutionary algorithms for data

partitioning problems, such as usually adopted represen-

tations, evolutionary operators, and fitness functions, just

to mention a few. In particular, mutation and crossover

operators commonly described in the literature are con-

ceptually analyzed, giving especial emphasis to those

genetic operators specifically designed for clustering

problems (i.e., cluster-oriented and context-sensitive

operators). In our study these criterion functions were

optimized using a differential evolution (DE) [45,46].

The execution of the differential evolution is similar to

other evolutionary algorithms like genetic algorithms or

evolution strategies. The evolutionary algorithms differ

mainly in the representation of parameters (usually bi-

nary strings are used for genetic algorithms while pa-

rameters are real-valued for evolution strategies and dif-

ferential evolution) and in the evolutionary operators.

Like to other evolutionary algorithm, DE also starts

with a population of N n-dimensional search variab

ctors. The classical DE [45,46] is a population-based

global optimization thuses a real-coded representation.

In our study we use a genetic encoding that deals with

discrete variables (clusters), such that each component of

the chromosome takes a value between 1 and k and

represents the cluster to which the sentence is assigned.

Potential set of solutions to the optimization problem are

represented by a population of chromosomes. The initial

population of chromosomes is generated by producing

the series of integer random numbers. These numbers are

uniformly generated between 1 and k inclusively.

Potential solutions (chromosomes) to the target problem

are encoded as fixed length discrete strings, i.e.,

,1 ,2,

( )[( ),( ),...,()]

rrr rn

txtxtxt



where





,()1, 2,...,

tk,

Nr ,...,2,1



, and ns,...,2,1



, N is the size of the

en the nu population. For exam

ple, givmber ofrs cluste



k, the ncumber s 8 and the popula-

tion size 3

of sentenen



N, a population can be as:

]3,4,1,2,4,3,1,2[

1)(



,1,3,2,

, ]1,1,3,2,2,4,)(

2tX , and

]4,4,2,1[)(

3,4[



tX . In this example, chromosome

idate solution where sentences

S are assigned to cluster 1

C; sentences 1

S are located in cluster 2

C; sentences 3

S and

S are assigned to cluster 3

C, and 4

S and S are

located in cluster 4

For each individual vector )(tthat belongs to the

current populationDE

)(

1tX represents a

S and

and

cand

omly

nd X

, randples three other

dividuals )(

1tX r, )(

2tX r, a)(

r from the same

generation (for mutually different 321rrrr

sam



 ). It

then calcula ) and )(

3tX r,

scales it by a scalar

tes the difference of X(



, and creattrial ofring

)]1),...,1(),1([)1( ,2,1, 

es a

(

fsp







tytytytYnrrrr by

the result to )(X. , for the

adding

rThus

th component of

each vector

)1(ty 









otherwise),(

CRrndif)),()(()(

,2,2,1

,tx

txtxtx

ssrsrsr



The scaling factor (

)



) and the crossover rate

are control parame DE, are set by the user

. B

oth ters of

dlues remain constant uring the search process.



a real-valued factor (usually in range ]1,0[ ), that con-

trols the amplification of differential variations and R

is a real-valued crossover factor in rang]1, control-

ling the probability to choose mutated value for

e 0[

in-

stead of its current value. s

rnd is the unify distrib-

uted random numbers within the range ]1,0[ csen

orml

R. ALGULIEV ET AL.

133

of th

rent in the next generation;

where function to be maxim

he population the process of

tion -

once for each



ns ,...,2,1

If the new offspring yields a better value e objec-

tive function, i

t replaces its pa

is the objective

After initialization of t

that it uses the mutation operation adopted from genetic

algorithm. The algorithm proposed is based on the muta-

tion adopted from genetic algorithms. In our modifica-

tion, at the iteration 1



[)

for each vector cre-

ates a vector

)(tX r

(),..., ,)]11(),1(1( 2,1,









tm nr

tmttm rr mr,

which is defined as:

herwise, the parent is retained in the population, i.e.,











 ))(())1((if),(

))(())1((if),1(

)1( tXftYftX

tXftYftY

tX rrr

r (8)







 otherwise,0

))1((randif,1

)1( ,

tysigm

tm srs

sr (12)

)(f ized.

calculaof the fitness is performed. To judge the qual

ity

The vector )1(



tmr represents the changes that will

be needed to move the particle from to

)(tXr

)1(



tX r

)(tX r

. If the component of vector is one,

it means that this component will be copied from the

)1(tmr

)1(



tX r. If the component of )1(



tmr is

null, it means that this component will be mutated. The

inversion operator has been used as a mutation operator.

The inversion operator takes the components from the

corresponding to the null components of the

vector

)(tX r

)1(



tmr and puts them to form a new particle.

The pseudo-code of an inversion operator is described in

Figure 1 ( S is the cardinality of the set ) [11]: S

of a partition provided by a chromosome, it is neces-

sary to have a fitness functions. The fitness functions are

defined as

))(())((

1tXtXfitness aa 1



(9)

))((

2tXfitness a

))((tXfitnessa

zation of the fitness F

r minimi

tX a2



Maximiunctions (9)-(11) leads to

maximization (ozation) of the

tio

, (10)

))((tX a

F. (11)

objective Func-

ns (4)-(6), respectively.

The main difference between the traditional discrete

DE (DDE) algorithms and the algorithm proposed here is Let’s consider a following example (Figure 2):



end_if

stopthen0Sifelse

,\SS

)(1)(

Smin

Smax

then0if

end_for

end_if

else

)(1)(then11)(if

:1for

operatorinversion procedure

,,,

























txtx

txtxtm

srsr

srsrsr



Figure 1. Pseudo-code of an inversion operator

324 12341

()

011 001 01

424 21331

(1)

mt

(1)

Xt

Figure 2. Inversion operator

R. ALGULIEV ET AL.

134

The components of a vector , using the

pseudo-code described in Figure 1, are calculated as fol-

lows:

Step 1. then , i.e.

)1( tX r

1)1(if,tm sr

2)() 2,  txr

)()1( ,,txtx srsr 

)()1( 3,3,

2, txr4



 txt r

1)()1(8,

3)(

5,  txr, and

)1(

5, txr



 txt r

Step 2.







,10



7,6,4 )1(:S tms ,sr

Step 3.





77, , 6,4,1maxSmax 



sSmin





17,6,4,1min 

Step 4. ,sr)1)()1(

,txtx sr  and (



txsr

and )1(

)(

,tx sr 

, i.e. )()1( 7,1,  txtx rr 4



txr

3 )(

1, txr

Step 5.







 





6,47,1\7,6,4,1,\SS   ss

Step 6.



66,4maxSmax





46,

Smin



4min

Step 7. and

)()1(,,txtxsrsr   )1(



txsr

)1(

), i.e. and

(

,tx sr 

1)()1( 6,4,  txtx rr



txr

2)(

4,  txr

Step 8.







 

 6,4\,6,4,\SS ss, i.e.

0 an

The stopping criterion of DE could be a given number

of consecutive iterations within which no improveme

on solutions, U time limit, or maximum

number of (fitness calculat, is at

tained. Unless otherwise specified, in this paper we use

the last one as the termination criteria, i.e. the alg

tewhen a maximum number of fitness c

tion is a

5. Experiments and Results

For evaluation the performance of our methods we used

two document datasets DUC2001 and DUC2002 and

corresponding 100-word summaries generated for each

of documents. The DUC2001 and DUC2002 are an open

benchmark datasets which contain 309 and 567 docu-

ments-sumary pairs from Document Understanding

(http://duc.nist.gov). The datasets DUC2001

ization evaluation. It

effective for meas-

d hence stop.

a specified CP

iterationsion), t-

max

orithm

rminates alcula-

chieved.

Extractive summarization works by choosing a subset

of the sentences in the original document. This process

can be viewed as identifying the most salient sentences

in a cluster that give the necessary and sufficient amount

of information related to main content of the cluster

(topic). In a cluster of related sentences, many of the

sentences are expected to be somewhat similar to each

other since they are all about the same topic. The ap-

proach, proposed in papers [13,19], is to assess the cen-

trality of each sentence in a cluster and extract the most

important ones to include in the summary. In centroid-

based summarization, the sentences that contain more

words from the centroid of the cluster are considered as

central. Centrality of a sentence is often defined in terms

of the centrality of the words that it contains. In this sec-

tion we use other criterion to assess sentence salience,

developed in [11] which is based on technique proposed

in [47].

and DUC2002 are clustered into 30 and 59 topics, re-

spectively.

At a preprocessing step, the stopwords in each sen-

tence were removed using the stoplist provided in

ftp://ftp.cs.cornell.edu/pub/smart/english.stop and the

remaining words were stemmed using the Porter’s

scheme [48].

We use the ROUGE (Recall Oriented Understudy for

Gisting Evaluation) toolkit [49,50], which was adopted

Conference

by DUC for automatically summar

has been shown that ROUGE is very

uring document summarization. It measures summary

quality by counting overlapping units such as the N-gram,

word sequences and word pairs between the candidate

summary and the reference summary. The ROUGE-N

measure compares N-grams of two summaries, and

counts the number of matches. The measure is defined by

Formula (31) [49–51]:

Ngram

OUGE-N (N-gram)

ref

umm S

SSumm S

Count



R



(N-gram)

match

Count

 (14)

where N stands for the length of the N-gram,

N( )gram



match is the maximum number of

N-grams co-occurring in candidate summary and a set of

reference-summaries. )gramN( Count is the number

of N-grams in the reference summaries. We show three

of the ROUGE metrics in the experimental results:

ROUGE-1, ROUGE-2 and ROUGE-SU4. The ROUGE-

1 and ROUGE-2 scores are based on the overlap of uni-

grams and bigrams, respectively, between the candidate

summary and the reference summary. The ROUGE-SU4

score is also based on the overlap of bigrams between

summaries, but allows

Count

for gaps to occur between words

(skip-bigram), with a maximum gap length of words, and

includes unigram co-occurrence statistics as well.

The optimization procedure used here is stoch

nature. Hence, for each criterion function (

) it has been run several times for diffeof

astic in

and

ren

t values F

parameters ]8.0;3.0[



CR and ]5.0;1.0[MR. At ex-

nts the size of population and the number of itera-

tion we kept unchanged changing only parameters CR

and MR with step 0.1. For both datasets we take the same

number of iterations w population size

perime

hich is 1000. The

R. ALGULIEV ET AL. 135

is 40 % of the total number of documents in the datasets.

The parameters of the DE are reported in Table 1.

The first experiment compares our methods with other

methods. We compare our proposed methods with both

supervised and unsupervised methods. Among the super-

vised methods we choose SVM [4] and CRF [21]. SVM is

one of the state-of the-art classifiers. CRF combines the

merits of HMM (Hidden Markov Model) and LR (Logis-

tic Regression). HMM extends Naive Bayes (NB) by

considering the sequential information, while LR is a

discriminative version of NB. The unsupervised methods

we compare include QCS [16] and graphalgo-

hm HITS [27] (Among the several options of graph-

based algoriethodauthority

score of HITS on the directed backward graph is the best.

Therefore it is taken by us for comparison). Table 2 and 3

show the results of all the methods in terms ROUGE-1,

ROUGE-2, and ROUGE-SU4 metrics on DUC2001 and

DUC2002 datasets, respectively. As shown in Tables 2

and 3, on DUC2001 dataset, the values of ROUGE-1,

ROUGE-2 and ROUGE-SU4 metrics of all the methods

are better than on DUC2002 dataset. In the Tables 2 and

3 highlighted (bold italic) entries represent the best per-

forming methods.

The numerical comparison of our methods with the

methods CRF, QCS, HITS and SVM is shown in Tables 4

and 5. Here we use relative improvement

100

methodsother  for comparison. In

spite of the fact that among our criterion functions the

worst result is obtained by criterion function 1

F but it

shows better result than the other methods.

-based

rit

thm [27] the m based on the

methods)other methodour(

Table 1. Param

Dataset Population size, N Number of itera

Compared with the best method QCS on DUC2001

(DUC2002) dataset the criterion function 1

F improves

the performance by 1.05% (0.57%), 0.37% (0.43%) and

-1, ROUGE-2 and

ers of the DE

0.59% (0.31%) in terms ROUGE

ROUGE-SU4 metrics, respectively.

tion,

max

tCrossover rate, CRMutation rate,

DUC2001 155 1000 0.6 0.2

DUC2002 285 1000 0.6 0.2

Table 2. ROUGE scores for summarization methods on DUC2001 dataset

Methods ROUGE-1 ROUGE-2 ROUGE-SU4

0.0.19164 0.215745836 4

0.45603 0.19046 0.21427

0.45952 0.19338 0.21763

CRF 0.44598 0.18564 0.20934

QCS 0.45129 0.18976 0.21302

SVM 0.43132 0.18136 0.20372

Table ROUGE scoummarization methods on DUC2002 dataset

Methods ROUGE-1 R

3.res for s

OUGE-2 ROUGE-SU4

0.45119 0.18847 0.21184

0.44985 0.18896 0.21234

0.45412 0.18982 0.21268

CRF 0.44155 0.17974 0.20129

SVM 0.43405 0.17084 0.19036

QCS 0.44865 0.18766 0.21119

R. ALGULIEV ET AL.

136

Tads with otherble 4. Comparison our metho methods on DUC2001 dataset

Methods Metr

provement

ics

% of im

ROUGE-1 3.04 2.25 2.78

ROUGE-2 4.17 2.60 3.23

ROUGE-SU4 3.96 2.36 3.06

ROUGE-1 1.82 1.05 1.57

ROUGE-2 1.91 0.37 0.99

QCS

ROUGE-SU4 2.16 0.59 1.28

ROU5.57 4.77 5.30

GE-1

ROUGE-2 5.57 3.98 4.62

HITS

4 ROUGE-SU5.51 3.88 4.59

ROUGE-1 6.54 5.73 6.27

ROUGE-2 6.63 5.02 5.67

SVM

4 ROUGE-SU6.83 5.18 5.90

Table 5. Comds ther methods on DUC2002 dataset parison our methowith o

Methods M

% of imment

etrics

prove

ROUGE-1 2.85 2.18 1.88

ROUGE-2 5.61 4.86 5.13

CRF

ROUGE-SU4 5.66 5.24 5.49

ROUGE-1 1.22 0.57 0.27

ROUGE-2 1.15 0.43 0.69

QCS

ROUGE-SU4 0.71 0.31 0.54

ROU6.09 5.40 5.09

GE-1

ROUGE-2 13.04 12.53

HITS

ROUGE-SU4 12.39 11.94 12.21

12.24

ROUGE-1 4.62 3.95 3.64

ROUGE-2 11.11 10.32 10.61

SVM

ROUGE-SU4 11.73 11.28 11.55

6. Conclusions

We have presented an unsuperviseh to auto-

matic document summarization. Och consists

of two steps. First sentences are cluthen rep-

resentative sentences are defined a

cluster. In our study we ded discrete

differential evolution algo to oe objective

functions. When comparing our mveral ex-

isting summarization methods on an open DUC2001 and

2001ets, we fat our methods can im-

prove the summarization results significantly. The meth-

were sinOUGE-1, ROUGE-2 and

REES

M. A. Fattah and F. Ren, “GA, MR, FFNN, PNN and

GMM based models for automatic text summarization,”

Computer Speech and Language, Vol. 23, No. 1, pp.

d approac

ur approa ROUGE-SU4 metrics.

stered, and

nd extracted on each REFE

develope

rithm

a modifi

ptimize th

ethods to se

DUC datasound th

ods evaluated ug R

[1]

R. ALGULIEV ET AL. 137

126–144, 2009.

Mani, “The challenges of automatic sum-

R. M. Aliguliyev, “Effective su

rization method of text documents,” Proceedings of the

CM International Conference on Web

. 264–271, 19–22 Sep-

tion and Information Sciences, Vol. 40,

l optimization in the summarization of text docu-

and Development in Information

ltiple documents,” Informa-

[15] K. M. Svore, L. Vanderwende, and C. J. C. Burges, “En-

hancing single-document summarization by combining

e, Czech Republic,

J. Dorr, J. Lin, and R. Schwartz, “Multi-

. Radev, “Lexrank: Graph-based lexical

in context:

al Joint Conference

multi-document summarizer based on lexical chains,”

[2] U. Hahn and I.

marization,” IEEE Computer, Vol. 33, No. 11, pp. 29–36,

2000.

[3] I. Mani and M. T. Maybury, “Advances in automated text

summarization,” MIT Press, Cambridge, 442p, 1999.

[4] J-Y. Yeh, H-R. Ke, W-P. Yang, I-H. Meng, “Text summa-

rization using a trainable summarizer and latent semantic

analysis,” Information Processing and Management, Vol.

41, No. 1, pp. 75–95, 2005.

[5] S. Ye, T.-S. Chua, M.-Y. Kan, and L. Qiu, “Document

concept lattice for text understanding and summarization,

Information Processing and Management, 2007, Vol. 43,

No. 6, pp. 1643–1662.

[6] R. M. Alguliev and mma-

2005 IEEE/WIC/A

Intelligence (WI’05), France, pp

tember 2005.

[7] R. M. Alguliev and R. M. Alyguliev, “Automatic text

documents summarization through sentences clustering,”

Journal of Automa

cent

No. 9, pp. 53–63, 2008.

[8] R. M. Alguliev, R. M. Aliguliyev, and A. M. Bagirov,

“Globa

ments,” Automatic Control and Computer Sciences, Vol.

39, No. 6, pp. 42–47, 2005.

[9] R. M. Aliguliyev, “A novel partitioning-based clustering

method and generic document summarization,” Proceed-

ings of the 2006 IEEE/WIC/ACM International Confer-

ence on Web Intelligence and Intelligent Agent Technol-

ogy (WI-IAT’06 Workshops) (WI-IATW’06), Hong Kong,

China, pp. 626–629, 18–22 December 2006.

[10] R. M. Aliguliyev, “A new sentence similarity measure

and sentence based extractive technique for automatic

text summarization,” Expert Systems with Applications,

Vol. 36, No. 4, pp. 7764–7772, 2009.

[11] R. M. Aliguliyev, “Clustering techniques and discrete

particle swarm optimization algorithm for multi-document

summarization, Computational Intelligence (accepted).

289,

[12] Y. Gong and X. Liu, “Generic text summarization using

relevance measure and latent semantic analysis,” in: Pro-

ceedings of the 24th Annual International ACM SIGIR

Conference on Research

Retrieval, New Orleans, USA, pp. 19–25, 9–12 Septem-

ber, 2001.

[13] D. R. Radev, H. Jing, M. Stys, and D. Tam, “Centroid-

based summarization of mu

tion Processing and Management, Vol. 40, No. 6, pp.

919–938, 2004.

[14] G. Salton, A. Singhal, M. Mitra, and C. Buckley, “Auto-

matic text structuring and summarization,” Information

Processing and Management, Vol. 33, No. 2, pp. 193–207,

1997.

RankNet and third-party sources,” in: Proceedings of the

Joint Conference on Empirical Methods in Natural Lan-

guage Processing and Computational Natural Language

Learning (EMNLP-CoNLL’07), Pragu

pp. 448–457, 28–30 June 2007.

[16] D. M. Dunlavy, D. P. O’Leary, J. M. Conroy, and J. D.

Schlesinger, “QCS: A system for querying, clustering and

summarizing documents,” Information Processing and

Management, Vol. 43, No. 6, pp. 1588–1605, 2007.

[17] K. S. Jones, “Automatic summarizing: the state of the

art,” Information Processing and Management, Vol. 43,

No. 6, pp. 1449–1481, 2007.

[18] D. Zajic, B.

candidate reduction: sentence compression as a tool for

document summarization tasks,” Information Processing

and Management, Vol. 43, No. 6, pp. 1549–1570, 2007.

[19] G. Erkan and D. R

rality as salience in text summarization,” Journal of

Artificial Intelligence Research, Vol. 22, pp. 457–479,

2004.

[20] D. Radev, E. Hovy, and K. McKeown, “Introduction to

the special issue on summarization,” Computational Lin-

guistics, Vol. 28, No. 4, pp. 399–408, 2002.

[21] D. Shen, J.-T.Sun, H. Li, Q. Yang, and Z. Chen, “Docu-

ment summarization using conditional random fields,”

Proceedings of the 20th International Joint Conference on

Artificial Intelligence (IJCAI’07), Hyderabad, India, pp.

2862–2867, January 6–12, 2007.

[22] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Con-

ditional random fields: probabilistic models for segment-

ing and labeling sequence data,” Proceedings of the 18th

International Conference on Machine Learning, pp. 282–

28 June–01 July 2001.

[23] X. Wan, “A novel document similarity measure based on

earth mover’s distance,” Information Sciences, Vol. 177,

No. 18, pp. 3718–3730, 2007.

[24] S. Fisher and B. Roark, “Query-focused summarization

by supervised sentence ranking and skewed word distri-

butions,” Proceedings of the Document Understanding

Workshop (DUC’06), New York, USA, 8p, 8–9 June

2006.

[25] P. Fung and G. Ngai, “One story, one flow: Hidden

Markov story models for multilingual multidocument

summarization,” ACM Transaction on Speech and Lan-

guage Processing, Vol. 3, No. 2, pp. 1–16, 2006.

[26] D. M. McDonald and H. Chen, “Summary

Searching versus browsing,” ACM Transactions on In-

formation Systems, Vol. 24, No. 1, pp. 111–141, 2006.

[27] R. Mihalcea and P. Tarau, “A language independent algo-

rithm for single and multiple document summarizations,”

Proceedings of the Second Internation

Natural Language Processing (IJCNLP’05), Korea, pp.

602–607, 11–13 October 2005.

[28] J. Li, L. Sun, C. Kit, and J. Webster, “A query-focused

R. ALGULIEV ET AL.

138

p, 26–27 April 2007.

Vol. 11, No. 1, pp. 25–49,

putational Natural Language Learning (EMNLP-

on Sci-

nn, “Data clustering:

naly-

ransactions on Knowledge and Data Engineering,

SA, pp. 250–256, 17–19 September 2007.

rtifi-

en words using web search en-

ms, Man, and Cybernetics –

spaces,” Journal of Global Optimization, Vol. 11,

ysis and

ROUGE: A package for automatic evaluation

ydney, Australia, pp.

Proceedings of the Document Understanding Conference

(DUC’07), New York, USA, 4

[29] X. Wan, “Using only cross-document relationships for

both generic and topic-focused multi-document summa-

rizations, Information Retrieval,

cial

2008.

[30] R. Mihalcea and H. Ceylan, “Explorations in automatic

book summarization, Proceedings of the Joint Conference

on Empirical Methods in Natural Language Processing

and Com

CoNLL’07), Prague, Czech Republic, pp. 380–389, 28–

30 June 2007.

[31] Y. Guo and G. Stylios, “An intelligent summarization

system based on cognitive psychology,” Informati

ences, Vol. 174, No.1–2, pp. 1–36, 2005.

[32] J. Grabmeier and A. Rudolph, “Techniques of cluster

algorithms in data mining,” Data Mining and Knowledg

clus

Discovery, Vol. 6, No. 4, pp. 303–360, 2002.

[33] J. Han and M. Kamber, “Data mining: concepts and tech-

nique (2nd Edition), Morgan Kaufman, San Francisco,

800p, 2006.

[34] A. K. Jain, M. N. Murty, and P. J. Fly

Part

A review,” ACM Computing Surveys, Vol. 31, No. 3, pp.

264–323, 1999.

[35] M. G. H. Omran, A. P. Engelbrecht, and A. Salman, “An

overview of clustering methods,” Intelligent Data A

sis, Vol. 11, No. 6, pp. 583–605, 2007.

[36] K. M. Hammouda and M. S. Kamel, “Efficient phrase-

based document indexing for web document clustering,”

IEEE T

Mach

Vol. 16, No. 10, pp. 1279–1296, 2004.

[37] T. Li, “A unified view on clustering binary data,” Ma-

chine Learning, Vol. 62, No. 3, pp. 199–215, 2006.

[38] Y. Li, C. Luo and S. M. Chung, “Text clustering with

feature selection by using statistical data,” IEEE Transac-

tions on Knowledge and Data Engineering, Vol. 20, No.

20, pp. 641–652, 2008.

[39] Y. Li, D. McLean, Z. A. Bandar, J. D. O’Shea, K. Crock-

ett, “Sentence similarity based on semantic nets and cor-

pus statistics,” IEEE Transactions on Knowledge and

Data Engineering, Vol. 18, No. 8, pp. 1138–1150, 2006.

[40] X. Liu, Y. Zhou, and R. Zheng, “Sentence similarity

based on dynamic time warping,” Proceedings of the First

International Conference on Semantic Computing (ICSC’

07), Irvine, U

Edmo

[41] X. Wan, J. Yang, and J. Xiao, “Manifold-ranking based

topic-focused multi-document summarization, Proceed-

ings of the 20th International Joint Conference on A

Intelligence (IJCAI’07), Hyderabad, India, pp. 2903–

2908, January 6–12, 2007.

[42] D. Bollegala, Y. Matsuo, and M. Ishizuka, “Measuring

semantic similarity betwe

gines,” Proceedings of 16th World Wide Web Conference

(WWW16), Alberta, Canada, pp. 757–766, May 8–12,

2007.

[43] R. L. Cilibrasi and P

. M. B. Vitanyi, “The google similar

ity distance,” IEEE Transaction on Knowledge and Data

Engineering, Vol. 19, No. 3, pp. 370–383, 2007.

[44] Y. Zhao and G. Karypis, “Empirical and theoretical com-

parisons of selected criterion functions for documen

tering,” Machine Learning, Vol. 55, No. 3, pp. 311–

331, 2004.

[45] S. Das, A. Abraham, and A. Konar, “Automatic clustering

using an improved differential evolution algorithm,”

IEEE Transaction on Syste

A: Systems and Humans, Vol. 38, No. 1, pp. 218–237,

2008.

[46] R. Storn and K. Price, “Differential evolution – a simple

and efficient heuristic for global optimization over con-

tinuous

No. 4, pp. 341–359, 1997.

[47] M. Pavan and M. Pelillo, “Dominant sets and pairwise

clustering,” IEEE Transactions on Pattern Anal

ine Intelligence, Vol. 29, No. 1, pp. 167–172, 2007.

[48] M. Porter, “An algorithm for suffix stripping,” Program,

Vol. 14, No. 3, pp. 130–137, 1980.

[49] C.-Y. Lin, “

summaries,” Proceedings of the Workshop on Text Sum-

marization Branches Out, Barcelona, Spain, pp. 74–81,

25–26 July 2004.

[50] C.-Y. Lin and E. H. Hovy, “Automatic evaluation of sum-

maries using n-gram co-occurrence statistics,” Proceed-

ings of the 2003 Conference of the North American

Chapter of the Association for Computational Linguistics

on Human Language Technology (HLT-NAACL’03),

nton, Canada, Vol. 1, pp. 71–78, 27 May–1 June

2003.

[51] H. Nanba and M. Okumura, “An automatic method for

summary evaluation using multiple evaluation results by

a manual method,” Proceedings of the COLING/ACL on

Main Conference Poster Sessions, S

603–610, 17–18 July 2006.