Intelligent Information Management, 2009, 1, 128-138
doi:10.4236/iim.2009.12019 Published Online November 2009 (http://www.scirp.org/journal/iim)
Copyright © 2009 SciRes IIM
Evolutionary Algorithm for Extractive Text
Summarization
Rasim ALGULIEV, Ramiz ALIGULIYEV
Institute of Information Technology, Azerbaijan National Academy of Sciences, Baku, Azerbaijan
Email: rasim@science.az, a.ramiz@science.az
Abstract: Text summarization is the process of automatically creating a compressed version of a given
document preserving its information content. There are two types of summarization: extractive and abstrac-
tive. Extractive summarization methods simplify the problem of summarization into the problem of selecting
a representative subset of the sentences in the original documents. Abstractive summarization may compose
novel sentences, unseen in the original sources. In our study we focus on sentence based extractive document
summarization. The extractive summarization systems are typically based on techniques for sentence extrac-
tion and aim to cover the set of sentences that are most important for the overall understanding of a given
document. In this paper, we propose unsupervised document summarization method that creates the summary
by clustering and extracting sentences from the original document. For this purpose new criterion functions
for sentence clustering have been proposed. Similarity measures play an increasingly important role in
document clustering. Here we’ve also developed a discrete differential evolution algorithm to optimize the
criterion functions. The experimental results show that our suggested approach can improve the performance
compared to sate-of-the-art summarization approaches.
Keywords: sentence clustering, document summarization, discrete differential evolution algorithm
1. Introduction
Text summarization is the process of automatically cre-
ating a compressed version of a given document pre-
serving its information content. Automatic document
summarization is an important research area in natural
language processing (NLP). The technology of automatic
document summarization is developing and may provide
a solution to the information overload problem [1–3].
The process of text summarization can be decomposed
into three phases: analysis, transformation, and synthesis.
The analysis phase analyzes the input text and selects a
few salient features. The transformation phase transforms
the results of the analysis into a summary representation.
Finally, the synthesis phase takes the summary represen-
tation, and produces an appropriate summary corre-
sponding to users’ needs. In the overall process, com-
pression rate, which is defined as the ratio between the
length of the summary and that of the original, is an im-
portant factor that influences the quality of the summary.
As the compression rate decreases, the summary will be
more concise; however, more information is lost. While
the compression rate increases, the summary will be lar-
ger; relatively, more insignificant information is con-
tained. In fact, when the compression rate is 5–30%, the
quality of the summary is acceptable [1–4].
Text summarization can be categorized into two ap-
proaches: extractive and abstractive. Extractive summa-
rization methods simplify the problem of summarization
into the problem of selecting a representative subset of
the sentences in the original documents. Abstractive
summarization may compose novel sentences, unseen in
the original sources [1]. However, abstractive approaches
require deep NLP such as semantic representation, in-
ference and natural language generation, which have yet
to reach a mature stage nowadays [5].
Extractive summarization systems are commonly used
in automatic summarization to produce extractive sum-
maries. Systems for extractive summarization are typi-
cally based on technique for sentence extraction, and
attempt to identify the set of sentences that are most im-
portant for the overall understanding of a given docu-
ment. Most commonly, such systems use some kind of
similarity or centrality metric to identify the set of sen-
tences to include in the summary [1,4,6–15]. For exam-
ple, in [14] a paragraph extraction from a document
based on intra-document links between paragraphs is
proposed. It yields a TRM (Text Relationship Map) from
intra-links, which indicate that the linked texts are se-
mantically related. It proposes four strategies from the
TRM: bushy path, depth-first path, segmented bushy
path, augmented segmented bushy path. An improved
version of this approach is proposed in [1,4,6].
In this paper we demonstrate an extractive text sum-
R. ALGULIEV ET AL. 129
marization method which is based on sentence clustering.
For this purpose new criterion functions for sentence
clustering have been offered. In our study we developed
a discrete differential evolution algorithm to optimize the
criterion functions. The experimental results on an open
benchmark datasets from DUC2001 and DUC2002 show
that our suggested approach can improve the perform-
ance compared to state-of-the-art summarization ap-
proaches.
The rest of this paper is organized as follows: Section
2 introduces related works. The proposed sentence clus-
tering based approach for generic single-document sum-
marization is presented in Section 3. The discrete differ-
ential evolution algorithm for optimization procedure is
given in Section 4. The experiments and results are given
in Section 5. Finally, we conclude our paper in Section 6.
2. Related Work
Automatic document summarization has been actively
investigated in recent years, and most researchers have
concentrated on the extractive summarization method,
but not the abstractive summarization method – see for
example, the 6th issue of the Information Processing and
Management: an International Journal of 2007. The cur-
rent paper contains four references to the articles pub-
lished in this edition [5,16–18]. A comprehensive survey
of document summarization can be found in [17].
The centroid-based method [19,13] is one of the most
popular extractive summarization methods. MEAD
(http://www.summarization.com/mead/) is an implemen-
tation of the centroid-based method for either single- or
multi-document summarizing. It is based on sentence
extraction. For each sentence in a cluster of related
documents, MEAD computes three features and uses a
linear combination of the three to determine what sen-
tences are most salient. The three features used are cen-
troid score, position, and overlap with first sentence
(which may happen to be the title of a document). For
single documents or (given) clusters it computes centroid
topic characterizations using tf-idf-type data. It ranks
candidate summary sentences by combining sentence
scores against centroid, text position value, and tf-idf
title/lead overlap. Sentence selection is constrained by a
summary length threshold, and redundant new sentences
avoided by checking cosine similarity against prior ones
[20]. In [21] each document is considered as a sequence
of sentences and the objective of extractive summariza-
tion is to label the sentences in the sequence with 1 and 0,
where a label of 1 indicates that a sentence is a summary
sentence while 0 denotes a non-summary sentence. To
accomplish this task, a conditional random field is ap-
plied [22]. A novel extractive approach based on mani-
fold-ranking of sentences to query-based multi-document
summarization proposed in [23]. This approach first uses
the manifold-ranking process to compute the manifold-
ranking score for each sentence that denotes the biased
information richness of the sentence, and then uses
greedy algorithm to penalize the sentences with highest
overall scores, which are considered both informative
and novel, and highly biased to the given query.
The summarization techniques can be classified into
two groups: supervised techniques, that rely on machine
learning algorithms trained on pre-existing document-
summary pairs, and unsupervised techniques, based on
properties and heuristics derived from the text. Super-
vised extractive summarization techniques [1,4–6,21,
24–26] treat the summarization task as a two-class clas-
sification problem at the sentence level, where the sum-
mary sentences are positive samples while the non-
summary sentences are negative samples. After repre-
senting each sentence by a vector of features, the classi-
fication function can be trained in two different manners
[21]. The first is in a discriminative way with well-
known algorithms such as SVM (Support Vector Ma-
chine) [4]. In [1], the use of genetic algorithm (GA),
mathematical regression (MR), feed forward neural net-
work (FFNN), probabilistic neural network (PNN) and
Gaussian mixture model (GMM) for automatic text
summarization task have been investigated. This ap-
proach is a trainable summarizer, which takes into ac-
count several features, including sentence position, posi-
tive keyword, negative keyword, sentence centrality,
sentence resemblance to the title, sentence inclusion of
name entity, sentence inclusion of numerical data, sen-
tence relative length, bushy path of the sentence and ag-
gregated similarity for each sentence to generate summa-
ries. The article [25] presents a multi-document, multi-
lingual, theme-based summarization system based on
modeling text cohesion (story flow). Many unsupervised
methods have been developed for document summariza-
tion by exploiting different features and relationships of
the sentences, such as clustering of sentences [7–11], the
hidden topics in the documents [12], graphs based on the
similarity of sentences[13,19,27].
Recently, graph-based methods have been offered to
rank sentences. Lexrank [19] and [27] are two such sys-
tems using the algorithms PageRank and HITS to com-
pute sentence importance. Lexrank is used to compute
sentence importance based on the concept of eigenvector
centrality in a graph representation of sentences for
multi-document summarization task. The graph-based
extractive summarization algorithms succeed in identi-
fying the most important sentences in a text based on
information exclusively drawn from the text itself.
Unlike other systems, which attempt to find out what
makes a good summary by training on collections of
summaries built for other articles, the graph-based meth-
ods are fully unsupervised, and rely on the given texts to
derive an extractive summary[23].
Copyright © 2009 SciRes IIM
R. ALGULIEV ET AL.
130
On the other hand, summarization task can also be
categorized as either generic or query-based. A query-
based summary presents the information that is most
relevant to the given queries [16,23,24,28,29] while a
generic summary gives an overall sense of the docu-
ment’s content [7–14,19,27,30]. The QCS system (Query,
Cluster, and Summarize)[16] performs the following
tasks in response to a query: retrieves relevant docu-
ments; separates the retrieved documents into clusters by
topic, and creates a summary for each cluster. QCS is a
tool for document retrieval that presents results in a for-
mat so that a user can quickly identify a set of documents
of interest. In [29] are developed a generic, a query-
based, and a hybrid summarizer, each with differing
amounts of document context. The generic summarizer
used a blend of discourse information and information
obtained through traditional surface-level analysis. The
query-based summarizer used only query-term informa-
tion, and the hybrid summarizer used some discourse
information along with query-term information.
Automatic document summarization is a highly inter-
disciplinary research area related with computer science,
multimedia, statistics, as well as cognitive psychology. In
[31] is introduced an intelligent system, the event index-
ing and summarization (EIS) system, for automatic
document summarization, which is based on a cognitive
psychology model (the event-indexing model) and the
roles and importance of sentences and their syntax in
document understanding. The EIS system involves syn-
tactic analysis of sentences, clustering and indexing sen-
tences with five indices from the event-indexing model,
and extracting the most prominent content by lexical
analysis at phrase and clause levels.
3. Sentence Clustering
Clustering is the process of discovering natural group-
ings or clusters and identifying interesting distributions
and patterns within multidimensional data based on some
similarity measure. The topic of clustering has been ex-
tensively studied in many scientific disciplines such as
text mining, pattern recognition, IR etc. Document clus-
tering is a central problem in text mining which can be
defined as grouping documents into clusters according to
their topics or main contents. Document clustering has
many purposes including expanding a search space, gen-
erating a summary, automatic topic extraction, browsing
document collections, organizing information in digital
libraries and detecting topics. In the literature a wide
variety of clustering algorithms have been proposed for
different applications and sizes of data sets. The surveys
on the topics [32–35] offer a comprehensive summary of
the different applications and algorithms.
Generally clustering problems are determined by four
basic components [36,37]: 1) the (physical) representa-
tion of the given data set; 2) the distance/dissimilarity
measures between data points; 3) the criterion/objective
function which the clustering solutions should aim to
optimize; and, 4) the optimization procedure. For a given
data clustering problem, the four components are tightly
coupled. Various methods/criteria have been proposed
over the years from various perspectives and with vari-
ous focuses.
3.1. Sentence Similarity Measure Based on
Terms Co-Occurrence
Let a document is decomposed into a set of sen-
tences
D
,, n
DS S12
,...S, where n is the number of
sentences. Let
..., m
t
i
S
12
, ,tTt represents all the dis-
tinct words (terms) occurring in a document , where
is the number of words. In most existing document
clustering algorithms, documents are represented using
the vector space model (VSM) [33]. Each document is
represented using these words as a vector in -dimen-
sional space. A major characteristic of this representation
is the high dimensionality of the feature space, which
imposes a big challenge to the performance of clustering
algorithms. They could not work efficiently in high-di-
mensional feature spaces due to the inherent sparseness
of the data [17]. The vector dimension m is very large
compared to the number of words in a sentence, thus the
resulting vectors would have many null components [38].
In our method, a sentence is represented as a set of
distinct terms appearing in it, , where
is the number of distinct terms in the sentence .
D
m

i
im
t
m
i
m
12
, ,...,tSt
i
S
Similarity measures play an increasingly important
role in NLP and IR. Similarity measures have been used
in text-related research and application such as text min-
ing, information retrieving, text summarization, and text
clustering. These applications show that the computation
of sentence similarity has become a generic component
for the research community involved in knowledge rep-
resentation and discovery. There are more papers on
similarity between documents than between sentences or
short texts, but not few [38,39]. The paper [39] presents a
method for measuring the similarity between sentences
or very short texts, based on semantic and word order
information. First, semantic similarity is derived from a
lexical knowledge base and a corpus. Second, the pro-
posed method considers the impact of word order on the
sentence meaning. The overall sentence similarity is de-
fined as a combination of semantic similarity and word
order similarity. Liu et al. [40] present a novel method to
measure similarity between sentences by analyzing parts
of speech and using Dynamic Time Warping technique.
In [41] proposed a novel measure based on the earth
Copyright © 2009 SciRes IIM
R. ALGULIEV ET AL. 131
mover’s distance (EMD) to evaluate document similarity
by allowing many-to-many matching between subtopics.
First, each document is decomposed into a set of subtop-
ics, and then the EMD is employed to evaluate the simi-
larity between two sets of subtopics for two documents
by solving the transportation problem. The proposed
measure is an improvement of the previous optimal
matching (OM)-based measure, which allows only
one-to-one matching between subtopics. The study of
semantic similarity between words has long been an in-
tegral part of IR and NLP [42]. The method, proposed in
[42], integrates both page counts and snippets to measure
semantic similarity between a given pair of words. In this
paper modified four popular co-occurrence measures;
Jaccard, Overlap (Simpson), Dice and PMI (Point-wise
Mutual Information), to compute semantic similarity
using page counts.
In this section we present a method to measure simi-
larity between sentences using the Normalized Google
Distance (NGD) [43]. First we calculate a similarity
measure between the terms before defining the similarity
measure between the sentences. Using the NGD [43] the
similarity measure between terms and we define
as:
k
tl
t
NGD (,) exp(NGD(,))
kl kl
s
imt tt t (1)
where


maxlog(),log()log()
NGD(, )logminlog(), log()
kl
kl
kl
kl
f
ff
tt nff
, (2)
k
f
is the number of sentences containing the term ,
k
t
kl
f
denotes the number of sentences containing both
terms and , is the number of sentences in the
document.
k
tl
tn
From the properties of NGD follows that [43]:
1) The range of the is between
and ;
NGD (,)
kl
disst t0
1
If kl
tt or if kl
tt but 0,
then ) 1
kl. That is, the semantics of
and l
t, in the Google sense is the same.
klkl
fff 
k
t
NGD (,t tsim
If 0
k
f, 0
l
f and 0
kl
f, we take
N
GD(, )1
kl
tt, then (,)1
kl
t t.
NGD
0sim
If 0kll k
f
ffn and klkl
f
fnf ,
then .
NGD
0(,)1
kl
mt tsi
2) for any . For every pair
and , we have
NGD (,) 1
kk
simt t
l
t
k
t
)
k
t
NGD NGD
(, (,)
kl lk
imttsimt t: It is
symmetric.
Using the formula (1) we define a similarity meas-
ure between sentences and
i
S
j
S as follows:
NGD
NGD
(,)
(, )kil j
kl
tStS
ij
ij
s
imt t
simS Smm

(3)
From the properties of follows that: 1)
the range of the is in and ; 2)
for every ; 3) for every pair
and
),(
NGD lkttsim
),( ji SS
i
S
,() NGD jSSsim
NGD
sim
,( ji SS
0
)
i
1
0),(
NGD
ii SSsim
j
SNGD
sim
i
S
: it is ex-
changeable.
3.2. Objective Functions
Typically clustering algorithms can be categorized as
agglomerative or partitional based on the underlying
methodology of the algorithm, or as hierarchical or flat
(non-hierarchical) based on the structure of the final so-
lution [32–35]. A key characteristic of many partitional
clustering algorithms is that they use a global criterion
function whose optimization drives the entire clustering
process. In recent years, it has been recognized that the
partitional clustering technique is well suited for cluster-
ing a large document database due to their relatively low
computational requirements [44].
Automatic clustering is a process of dividing a set of
objects into unknown groups, where the best number
of groups (or clusters) is determined by the clustering
algorithm. That is, objects within each group should be
highly similar to each other than to objects in any other
group. The automatic clustering problem can be defined
as follows [32–35,45]:
k
The set of sentences
n
SSSD,...,, 21
are clustered
into non-overlapping groups Chere k
C
alled a cluster, k is the unknown number of clusters.
The partition should possess three properties:
k
CC ,...,
1
, w
is c
1) Two different clusters should have no sentences in
common, i.e.
qpCC for qp
kqp ,...,2,1, ;
2) Each sentence should definitely be attached to a
cluster, i.e.
1
k
p
p
CD
;
3) Each cluster should have at least one sentence as-
signed, i.e.
p
C
kp ,...,2,1
.
Partitional clustering can be viewed as an optimization
procedure that tries to create high-quality clusters ac-
cording to a particular criterion function. Criterion func-
tions used in partitional clustering reflect the underlying
definition of the “goodness” of clusters. Many criterion
functions have been proposed in the literature [32–35,44]
to produce more balanced partitions.
We introduce a criterion function that is defined as fol-
lows:
max))(1( 2
F
1
FF sigm (4)
Copyright © 2009 SciRes IIM
R. ALGULIEV ET AL.
132
where is a sigmoid function th)(zsigm
numbers
at maps from
the real into ]1,0[, )exp(1
1
)(zsigm.
The criterion funct4) balust
z
ion (lances both intra-cer
similarity and inter-cluster dissimilarity. This function is
obtained by combining two criteria:
max),(
1,
NGD  
k
simCF
pCSS
lip1
pli
SS (5)
and
min),(
11
1
11
NGD2   

k
p
k
pqCSCS
li
qp piql
SSsim
CC
F (6)
The criterion Function (5) maximizes the average
su
spectively
crlution
Theat can be used to optimize
le
ve
1
F
thm ofe pairwise similarity between the sentences
assigned to each cluster. The 2
F criterion Function (6)
computes the clustering by fing a solution that sepa-
rates each cluster from other clusters. It minimizes the
similarity between the sentences i
S and l
S assigned
to different clusters p
C and q
C, r.
4. Modified Disete Differential Evo
din
e
Algorithm (MDDE)
re are many techniques th
the criterion Functions (4)-(6) described in the previous
Section 3. The paper [17] presents an up-to-date survey
on evolutionary algorithms for clustering. It tries to re-
flect the profile of this area by focusing more on those
subjects that have been given more importance in the
literature. Particularly, the paper has focused mainly on
hard partitional algorithms, though overlapping (soft/
fuzzy) approaches have also been covered. An original
contribution of the present paper is that it discusses key
issues on the design of evolutionary algorithms for data
partitioning problems, such as usually adopted represen-
tations, evolutionary operators, and fitness functions, just
to mention a few. In particular, mutation and crossover
operators commonly described in the literature are con-
ceptually analyzed, giving especial emphasis to those
genetic operators specifically designed for clustering
problems (i.e., cluster-oriented and context-sensitive
operators). In our study these criterion functions were
optimized using a differential evolution (DE) [45,46].
The execution of the differential evolution is similar to
other evolutionary algorithms like genetic algorithms or
evolution strategies. The evolutionary algorithms differ
mainly in the representation of parameters (usually bi-
nary strings are used for genetic algorithms while pa-
rameters are real-valued for evolution strategies and dif-
ferential evolution) and in the evolutionary operators.
Like to other evolutionary algorithm, DE also starts
with a population of N n-dimensional search variab
ctors. The classical DE [45,46] is a population-based
global optimization thuses a real-coded representation.
In our study we use a genetic encoding that deals with
discrete variables (clusters), such that each component of
the chromosome takes a value between 1 and k and
represents the cluster to which the sentence is assigned.
Potential set of solutions to the optimization problem are
represented by a population of chromosomes. The initial
population of chromosomes is generated by producing
the series of integer random numbers. These numbers are
uniformly generated between 1 and k inclusively.
Potential solutions (chromosomes) to the target problem
are encoded as fixed length discrete strings, i.e.,
,1 ,2,
( )[( ),( ),...,()]
rrr rn
at
X
txtxtxt
where
,()1, 2,...,
rs
x
tk,
Nr ,...,2,1
, and ns,...,2,1
, N is the size of the
en the nu population. For exam
4
ple, givmber ofrs cluste
k, the ncumber s 8 and the popula-
tion size 3
of sentenen
N, a population can be as:
]3,4,1,2,4,3,1,2[
1)(
tX
,1,3,2,
, ]1,1,3,2,2,4,)(
2tX , and
]4,4,2,1[)(
3
3,4[
3
tX . In this example, chromosome
idate solution where sentences
6
S are assigned to cluster 1
C; sentences 1
S
5
S are located in cluster 2
C; sentences 3
S and
8
S are assigned to cluster 3
C, and 4
S and S are
located in cluster 4
C.
For each individual vector )(tthat belongs to the
current populationDE
)(
1tX represents a
2
S and
and
cand
7
Xr
omly
nd X
, randples three other
dividuals )(
1tX r, )(
2tX r, a)(
3t
r from the same
generation (for mutually different 321rrrr
sam
in
 ). It
then calcula ) and )(
3tX r,
scales it by a scalar
tes the difference of X(
2t
r
, and creattrial ofring
)]1),...,1(),1([)1( ,2,1,
es a
(
fsp
tytytytYnrrrr by
the result to )(X. , for the
adding
1t
rThus
s
th component of
each vector
)1(ty

otherwise),(
CRrndif)),()(()(
,
,2,2,1
,tx
txtxtx
sr
ssrsrsr
sr
(7
The scaling factor (
)
) and the crossover rate
are control parame DE, are set by the user
va
(C
. B
R)
oth ters of
dlues remain constant uring the search process.
i
C
s
a real-valued factor (usually in range ]1,0[ ), that con-
trols the amplification of differential variations and R
is a real-valued crossover factor in rang]1, control-
ling the probability to choose mutated value for
e 0[
x
in-
stead of its current value. s
rnd is the unify distrib-
uted random numbers within the range ]1,0[ csen
orml
ho
Copyright © 2009 SciRes IIM
R. ALGULIEV ET AL.
Copyright © 2009 SciRes IIM
133
of th
rent in the next generation;
where function to be maxim
he population the process of
tion -
once for each

ns ,...,2,1
If the new offspring yields a better value e objec-
tive function, i
ot
.
t replaces its pa
r
is the objective
After initialization of t
that it uses the mutation operation adopted from genetic
algorithm. The algorithm proposed is based on the muta-
tion adopted from genetic algorithms. In our modifica-
tion, at the iteration 1
t
[)
for each vector cre-
ates a vector
)(tX r
(),..., ,)]11(),1(1( 2,1,

tm nr
tmttm rr mr,
which is defined as:
herwise, the parent is retained in the population, i.e.,


 ))(())1((if),(
))(())1((if),1(
)1( tXftYftX
tXftYftY
tX rrr
r (8)
rr

 otherwise,0
))1((randif,1
)1( ,
,
tysigm
tm srs
sr (12)
)(f ized.
calculaof the fitness is performed. To judge the qual
ity
The vector )1(
tmr represents the changes that will
be needed to move the particle from to
)(tXr
)1(
tX r
)(tX r
. If the component of vector is one,
it means that this component will be copied from the
to
)1(tmr
)1(
tX r. If the component of )1(
tmr is
null, it means that this component will be mutated. The
inversion operator has been used as a mutation operator.
The inversion operator takes the components from the
corresponding to the null components of the
vector
)(tX r
)1(
tmr and puts them to form a new particle.
The pseudo-code of an inversion operator is described in
Figure 1 ( S is the cardinality of the set ) [11]: S
of a partition provided by a chromosome, it is neces-
sary to have a fitness functions. The fitness functions are
defined as
))(())((
1tXtXfitness aa 1
F
(9)
))((
1
))((
2tXfitness a
))((tXfitnessa
zation of the fitness F
r minimi
tX a2
F
Maximiunctions (9)-(11) leads to
maximization (ozation) of the
tio
, (10)
))((tX a
F. (11)
objective Func-
ns (4)-(6), respectively.
The main difference between the traditional discrete
DE (DDE) algorithms and the algorithm proposed here is Let’s consider a following example (Figure 2):



end_if
stopthen0Sifelse
,\SS
)(1)(
)(1)(
Smin
Smax
then0if
end_for
end_if
SS
else
)(1)(then11)(if
:1for
S
operatorinversion procedure
,
,
,,
,,,






ss
txtx
txtx
s
s
S
s
txtxtm
ns
sr
sr
srsr
srsrsr
Figure 1. Pseudo-code of an inversion operator
324 12341
()
r
Xt
:
011 001 01
424 21331
(1)
r
mt
:
(1)
r
Xt
:
Figure 2. Inversion operator
R. ALGULIEV ET AL.
134
The components of a vector , using the
pseudo-code described in Figure 1, are calculated as fol-
lows:
Step 1. then , i.e.
,
)1( tX r
1)1(if,tm sr
2)() 2,  txr
)()1( ,,txtx srsr 
)()1( 3,3,
1(
2, txr4
 txt r
1)()1(8,
xr
8,
,
3)(
5,  txr, and
)1(
5, txr
 txt r
xr
Step 2.
,10
7,6,4 )1(:S tms ,sr
Step 3.
77, , 6,4,1maxSmax 
sSmin
s

17,6,4,1min 
Step 4. ,sr)1)()1(
,txtx sr  and (
,
txsr
and )1(
7,
)(
,tx sr
, i.e. )()1( 7,1,  txtx rr 4
txr
3 )(
1, txr
Step 5.
 
6,47,1\7,6,4,1,\SS   ss
Step 6.

66,4maxSmax
s,

46,
Smin
s
4min
Step 7. and
)()1(,,txtxsrsr   )1(
,
txsr
)1(
6,
), i.e. and
(
,tx sr
1)()1( 6,4,  txtx rr
txr
2)(
4,  txr
Step 8.
 
 6,4\,6,4,\SS ss, i.e.
0 an
S
The stopping criterion of DE could be a given number
of consecutive iterations within which no improveme
on solutions, U time limit, or maximum
number of (fitness calculat, is at
tained. Unless otherwise specified, in this paper we use
the last one as the termination criteria, i.e. the alg
tewhen a maximum number of fitness c
tion is a
5. Experiments and Results
For evaluation the performance of our methods we used
two document datasets DUC2001 and DUC2002 and
corresponding 100-word summaries generated for each
of documents. The DUC2001 and DUC2002 are an open
benchmark datasets which contain 309 and 567 docu-
ments-sumary pairs from Document Understanding
(http://duc.nist.gov). The datasets DUC2001
ization evaluation. It
effective for meas-
d hence stop.
nt
a specified CP
iterationsion), t-
max
orithm
rminates alcula-
chieved.
Extractive summarization works by choosing a subset
of the sentences in the original document. This process
can be viewed as identifying the most salient sentences
in a cluster that give the necessary and sufficient amount
of information related to main content of the cluster
(topic). In a cluster of related sentences, many of the
sentences are expected to be somewhat similar to each
other since they are all about the same topic. The ap-
proach, proposed in papers [13,19], is to assess the cen-
trality of each sentence in a cluster and extract the most
important ones to include in the summary. In centroid-
based summarization, the sentences that contain more
words from the centroid of the cluster are considered as
central. Centrality of a sentence is often defined in terms
of the centrality of the words that it contains. In this sec-
tion we use other criterion to assess sentence salience,
developed in [11] which is based on technique proposed
in [47].
and DUC2002 are clustered into 30 and 59 topics, re-
spectively.
At a preprocessing step, the stopwords in each sen-
tence were removed using the stoplist provided in
ftp://ftp.cs.cornell.edu/pub/smart/english.stop and the
remaining words were stemmed using the Porter’s
scheme [48].
We use the ROUGE (Recall Oriented Understudy for
Gisting Evaluation) toolkit [49,50], which was adopted
m
Conference
by DUC for automatically summar
has been shown that ROUGE is very
uring document summarization. It measures summary
quality by counting overlapping units such as the N-gram,
word sequences and word pairs between the candidate
summary and the reference summary. The ROUGE-N
measure compares N-grams of two summaries, and
counts the number of matches. The measure is defined by
Formula (31) [49–51]:
Ngram
Ngram
OUGE-N (N-gram)
ref
ref
SS
umm S
SSumm S
Count


R
(N-gram)
match
Count
 (14)
where N stands for the length of the N-gram,
N( )gram
match is the maximum number of
N-grams co-occurring in candidate summary and a set of
reference-summaries. )gramN( Count is the number
of N-grams in the reference summaries. We show three
of the ROUGE metrics in the experimental results:
ROUGE-1, ROUGE-2 and ROUGE-SU4. The ROUGE-
1 and ROUGE-2 scores are based on the overlap of uni-
grams and bigrams, respectively, between the candidate
summary and the reference summary. The ROUGE-SU4
score is also based on the overlap of bigrams between
summaries, but allows
Count
for gaps to occur between words
(skip-bigram), with a maximum gap length of words, and
includes unigram co-occurrence statistics as well.
The optimization procedure used here is stoch
nature. Hence, for each criterion function (
) it has been run several times for diffeof
astic in
and
1
F,
ren
2
F
t values F
parameters ]8.0;3.0[
CR and ]5.0;1.0[MR. At ex-
nts the size of population and the number of itera-
tion we kept unchanged changing only parameters CR
and MR with step 0.1. For both datasets we take the same
number of iterations w population size
perime
hich is 1000. The
Copyright © 2009 SciRes IIM
R. ALGULIEV ET AL. 135
is 40 % of the total number of documents in the datasets.
The parameters of the DE are reported in Table 1.
The first experiment compares our methods with other
methods. We compare our proposed methods with both
supervised and unsupervised methods. Among the super-
vised methods we choose SVM [4] and CRF [21]. SVM is
one of the state-of the-art classifiers. CRF combines the
merits of HMM (Hidden Markov Model) and LR (Logis-
tic Regression). HMM extends Naive Bayes (NB) by
considering the sequential information, while LR is a
discriminative version of NB. The unsupervised methods
we compare include QCS [16] and graphalgo-
hm HITS [27] (Among the several options of graph-
based algoriethodauthority
score of HITS on the directed backward graph is the best.
Therefore it is taken by us for comparison). Table 2 and 3
show the results of all the methods in terms ROUGE-1,
ROUGE-2, and ROUGE-SU4 metrics on DUC2001 and
DUC2002 datasets, respectively. As shown in Tables 2
and 3, on DUC2001 dataset, the values of ROUGE-1,
ROUGE-2 and ROUGE-SU4 metrics of all the methods
are better than on DUC2002 dataset. In the Tables 2 and
3 highlighted (bold italic) entries represent the best per-
forming methods.
The numerical comparison of our methods with the
methods CRF, QCS, HITS and SVM is shown in Tables 4
and 5. Here we use relative improvement
100
methodsother for comparison. In
spite of the fact that among our criterion functions the
worst result is obtained by criterion function 1
F but it
shows better result than the other methods.
-based
rit
thm [27] the m based on the
methods)other methodour(
Table 1. Param
Dataset Population size, N Number of itera
Compared with the best method QCS on DUC2001
(DUC2002) dataset the criterion function 1
F improves
the performance by 1.05% (0.57%), 0.37% (0.43%) and
-1, ROUGE-2 and
ers of the DE
0.59% (0.31%) in terms ROUGE
ROUGE-SU4 metrics, respectively.
et
tion,
max
tCrossover rate, CRMutation rate,
M
R
DUC2001 155 1000 0.6 0.2
DUC2002 285 1000 0.6 0.2
Table 2. ROUGE scores for summarization methods on DUC2001 dataset
Methods ROUGE-1 ROUGE-2 ROUGE-SU4
F
0.0.19164 0.215745836 4
1
F
0.45603 0.19046 0.21427
0.45952 0.19338 0.21763
CRF 0.44598 0.18564 0.20934
2
F
QCS 0.45129 0.18976 0.21302
HITS 0.43528 0.18317 0.20627
SVM 0.43132 0.18136 0.20372
Table ROUGE scoummarization methods on DUC2002 dataset
Methods ROUGE-1 R
3.res for s
OUGE-2 ROUGE-SU4
F
0.45119 0.18847 0.21184
1
F
0.44985 0.18896 0.21234
2
F
0.45412 0.18982 0.21268
CRF 0.44155 0.17974 0.20129
SVM 0.43405 0.17084 0.19036
QCS 0.44865 0.18766 0.21119
HITS 0.42806 0.16792 0.18924
Copyright © 2009 SciRes IIM
R. ALGULIEV ET AL.
136
Tads with otherble 4. Comparison our metho methods on DUC2001 dataset
2
F
1
F
F
Methods Metr
provement
ics
% of im
ROUGE-1 3.04 2.25 2.78
ROUGE-2 4.17 2.60 3.23
C
ROUGE-SU4 3.96 2.36 3.06
ROUGE-1 1.82 1.05 1.57
RF
ROUGE-2 1.91 0.37 0.99
QCS
ROUGE-SU4 2.16 0.59 1.28
ROU5.57 4.77 5.30
GE-1
ROUGE-2 5.57 3.98 4.62
HITS
4 ROUGE-SU5.51 3.88 4.59
ROUGE-1 6.54 5.73 6.27
ROUGE-2 6.63 5.02 5.67
SVM
4 ROUGE-SU6.83 5.18 5.90
Table 5. Comds ther methods on DUC2002 dataset parison our methowith o
2
F
1
F
F
Methods M
% of imment
etrics
prove
ROUGE-1 2.85 2.18 1.88
ROUGE-2 5.61 4.86 5.13
CRF
ROUGE-SU4 5.66 5.24 5.49
ROUGE-1 1.22 0.57 0.27
ROUGE-2 1.15 0.43 0.69
QCS
ROUGE-SU4 0.71 0.31 0.54
ROU6.09 5.40 5.09
GE-1
ROUGE-2 13.04 12.53
HITS
ROUGE-SU4 12.39 11.94 12.21
12.24
ROUGE-1 4.62 3.95 3.64
ROUGE-2 11.11 10.32 10.61
SVM
ROUGE-SU4 11.73 11.28 11.55
6. Conclusions
We have presented an unsuperviseh to auto-
matic document summarization. Och consists
of two steps. First sentences are cluthen rep-
resentative sentences are defined a
cluster. In our study we ded discrete
differential evolution algo to oe objective
functions. When comparing our mveral ex-
isting summarization methods on an open DUC2001 and
2001ets, we fat our methods can im-
prove the summarization results significantly. The meth-
were sinOUGE-1, ROUGE-2 and
REES
M. A. Fattah and F. Ren, “GA, MR, FFNN, PNN and
GMM based models for automatic text summarization,”
Computer Speech and Language, Vol. 23, No. 1, pp.
d approac
ur approa ROUGE-SU4 metrics.
stered, and
nd extracted on each REFE
develope
rithm
a modifi
ptimize th
ethods to se
DUC datasound th
ods evaluated ug R
NC
[1]
Copyright © 2009 SciRes IIM
R. ALGULIEV ET AL. 137
126–144, 2009.
Mani, “The challenges of automatic sum-
R. M. Aliguliyev, “Effective su
rization method of text documents,” Proceedings of the
CM International Conference on Web
. 264–271, 19–22 Sep-
tion and Information Sciences, Vol. 40,
l optimization in the summarization of text docu-
and Development in Information
ltiple documents,” Informa-
[15] K. M. Svore, L. Vanderwende, and C. J. C. Burges, “En-
hancing single-document summarization by combining
e, Czech Republic,
J. Dorr, J. Lin, and R. Schwartz, “Multi-
. Radev, “Lexrank: Graph-based lexical
in context:
al Joint Conference
multi-document summarizer based on lexical chains,”
[2] U. Hahn and I.
marization,” IEEE Computer, Vol. 33, No. 11, pp. 29–36,
2000.
[3] I. Mani and M. T. Maybury, “Advances in automated text
summarization,” MIT Press, Cambridge, 442p, 1999.
[4] J-Y. Yeh, H-R. Ke, W-P. Yang, I-H. Meng,Text summa-
rization using a trainable summarizer and latent semantic
analysis,” Information Processing and Management, Vol.
41, No. 1, pp. 75–95, 2005.
[5] S. Ye, T.-S. Chua, M.-Y. Kan, and L. Qiu, “Document
concept lattice for text understanding and summarization,
Information Processing and Management, 2007, Vol. 43,
No. 6, pp. 1643–1662.
[6] R. M. Alguliev and mma-
2005 IEEE/WIC/A
Intelligence (WI’05), France, pp
tember 2005.
[7] R. M. Alguliev and R. M. Alyguliev, “Automatic text
documents summarization through sentences clustering,”
Journal of Automa
cent
No. 9, pp. 53–63, 2008.
[8] R. M. Alguliev, R. M. Aliguliyev, and A. M. Bagirov,
“Globa
ments,” Automatic Control and Computer Sciences, Vol.
39, No. 6, pp. 42–47, 2005.
[9] R. M. Aliguliyev, “A novel partitioning-based clustering
method and generic document summarization,” Proceed-
ings of the 2006 IEEE/WIC/ACM International Confer-
ence on Web Intelligence and Intelligent Agent Technol-
ogy (WI-IAT’06 Workshops) (WI-IATW’06), Hong Kong,
China, pp. 626–629, 18–22 December 2006.
[10] R. M. Aliguliyev, “A new sentence similarity measure
and sentence based extractive technique for automatic
text summarization,” Expert Systems with Applications,
Vol. 36, No. 4, pp. 7764–7772, 2009.
[11] R. M. Aliguliyev, “Clustering techniques and discrete
particle swarm optimization algorithm for multi-document
summarization, Computational Intelligence (accepted).
289,
[12] Y. Gong and X. Liu, “Generic text summarization using
relevance measure and latent semantic analysis,” in: Pro-
ceedings of the 24th Annual International ACM SIGIR
Conference on Research
Retrieval, New Orleans, USA, pp. 19–25, 9–12 Septem-
ber, 2001.
[13] D. R. Radev, H. Jing, M. Stys, and D. Tam, “Centroid-
based summarization of mu
tion Processing and Management, Vol. 40, No. 6, pp.
919–938, 2004.
[14] G. Salton, A. Singhal, M. Mitra, and C. Buckley, “Auto-
matic text structuring and summarization,” Information
Processing and Management, Vol. 33, No. 2, pp. 193–207,
1997.
RankNet and third-party sources,” in: Proceedings of the
Joint Conference on Empirical Methods in Natural Lan-
guage Processing and Computational Natural Language
Learning (EMNLP-CoNLL’07), Pragu
pp. 448–457, 28–30 June 2007.
[16] D. M. Dunlavy, D. P. O’Leary, J. M. Conroy, and J. D.
Schlesinger, “QCS: A system for querying, clustering and
summarizing documents,” Information Processing and
Management, Vol. 43, No. 6, pp. 1588–1605, 2007.
[17] K. S. Jones, “Automatic summarizing: the state of the
art,” Information Processing and Management, Vol. 43,
No. 6, pp. 1449–1481, 2007.
[18] D. Zajic, B.
candidate reduction: sentence compression as a tool for
document summarization tasks,” Information Processing
and Management, Vol. 43, No. 6, pp. 1549–1570, 2007.
[19] G. Erkan and D. R
rality as salience in text summarization,” Journal of
Artificial Intelligence Research, Vol. 22, pp. 457–479,
2004.
[20] D. Radev, E. Hovy, and K. McKeown, “Introduction to
the special issue on summarization,” Computational Lin-
guistics, Vol. 28, No. 4, pp. 399–408, 2002.
[21] D. Shen, J.-T.Sun, H. Li, Q. Yang, and Z. Chen, “Docu-
ment summarization using conditional random fields,”
Proceedings of the 20th International Joint Conference on
Artificial Intelligence (IJCAI’07), Hyderabad, India, pp.
2862–2867, January 6–12, 2007.
[22] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Con-
ditional random fields: probabilistic models for segment-
ing and labeling sequence data,” Proceedings of the 18th
International Conference on Machine Learning, pp. 282–
28 June–01 July 2001.
[23] X. Wan, “A novel document similarity measure based on
earth mover’s distance,” Information Sciences, Vol. 177,
No. 18, pp. 3718–3730, 2007.
[24] S. Fisher and B. Roark, “Query-focused summarization
by supervised sentence ranking and skewed word distri-
butions,” Proceedings of the Document Understanding
Workshop (DUC’06), New York, USA, 8p, 8–9 June
2006.
[25] P. Fung and G. Ngai, “One story, one flow: Hidden
Markov story models for multilingual multidocument
summarization,” ACM Transaction on Speech and Lan-
guage Processing, Vol. 3, No. 2, pp. 1–16, 2006.
[26] D. M. McDonald and H. Chen, “Summary
Searching versus browsing,” ACM Transactions on In-
formation Systems, Vol. 24, No. 1, pp. 111–141, 2006.
[27] R. Mihalcea and P. Tarau, “A language independent algo-
rithm for single and multiple document summarizations,”
Proceedings of the Second Internation
Natural Language Processing (IJCNLP’05), Korea, pp.
602–607, 11–13 October 2005.
[28] J. Li, L. Sun, C. Kit, and J. Webster, “A query-focused
Copyright © 2009 SciRes IIM
R. ALGULIEV ET AL.
Copyright © 2009 SciRes IIM
138
p, 26–27 April 2007.
Vol. 11, No. 1, pp. 25–49,
putational Natural Language Learning (EMNLP-
on Sci-
e
nn, “Data clustering:
naly-
ransactions on Knowledge and Data Engineering,
SA, pp. 250–256, 17–19 September 2007.
rtifi-
en words using web search en-
-
t
ms, Man, and Cybernetics –
spaces,” Journal of Global Optimization, Vol. 11,
ysis and
ROUGE: A package for automatic evaluation
ydney, Australia, pp.
Proceedings of the Document Understanding Conference
(DUC’07), New York, USA, 4
[29] X. Wan, “Using only cross-document relationships for
both generic and topic-focused multi-document summa-
rizations, Information Retrieval,
cial
2008.
[30] R. Mihalcea and H. Ceylan, “Explorations in automatic
book summarization, Proceedings of the Joint Conference
on Empirical Methods in Natural Language Processing
and Com
CoNLL’07), Prague, Czech Republic, pp. 380–389, 28–
30 June 2007.
[31] Y. Guo and G. Stylios, “An intelligent summarization
system based on cognitive psychology,” Informati
ences, Vol. 174, No.1–2, pp. 1–36, 2005.
[32] J. Grabmeier and A. Rudolph, “Techniques of cluster
algorithms in data mining,” Data Mining and Knowledg
clus
Discovery, Vol. 6, No. 4, pp. 303–360, 2002.
[33] J. Han and M. Kamber, “Data mining: concepts and tech-
nique (2nd Edition), Morgan Kaufman, San Francisco,
800p, 2006.
[34] A. K. Jain, M. N. Murty, and P. J. Fly
Part
A review,” ACM Computing Surveys, Vol. 31, No. 3, pp.
264–323, 1999.
[35] M. G. H. Omran, A. P. Engelbrecht, and A. Salman, “An
overview of clustering methods,” Intelligent Data A
sis, Vol. 11, No. 6, pp. 583–605, 2007.
[36] K. M. Hammouda and M. S. Kamel, “Efficient phrase-
based document indexing for web document clustering,”
IEEE T
Mach
Vol. 16, No. 10, pp. 1279–1296, 2004.
[37] T. Li, “A unified view on clustering binary data,” Ma-
chine Learning, Vol. 62, No. 3, pp. 199–215, 2006.
[38] Y. Li, C. Luo and S. M. Chung, “Text clustering with
feature selection by using statistical data,” IEEE Transac-
tions on Knowledge and Data Engineering, Vol. 20, No.
20, pp. 641–652, 2008.
[39] Y. Li, D. McLean, Z. A. Bandar, J. D. O’Shea, K. Crock-
ett, “Sentence similarity based on semantic nets and cor-
pus statistics,” IEEE Transactions on Knowledge and
Data Engineering, Vol. 18, No. 8, pp. 1138–1150, 2006.
[40] X. Liu, Y. Zhou, and R. Zheng, “Sentence similarity
based on dynamic time warping,” Proceedings of the First
International Conference on Semantic Computing (ICSC’
07), Irvine, U
Edmo
[41] X. Wan, J. Yang, and J. Xiao, “Manifold-ranking based
topic-focused multi-document summarization, Proceed-
ings of the 20th International Joint Conference on A
Intelligence (IJCAI’07), Hyderabad, India, pp. 2903–
2908, January 6–12, 2007.
[42] D. Bollegala, Y. Matsuo, and M. Ishizuka, “Measuring
semantic similarity betwe
gines,” Proceedings of 16th World Wide Web Conference
(WWW16), Alberta, Canada, pp. 757–766, May 8–12,
2007.
[43] R. L. Cilibrasi and P
. M. B. Vitanyi, “The google similar
ity distance,” IEEE Transaction on Knowledge and Data
Engineering, Vol. 19, No. 3, pp. 370–383, 2007.
[44] Y. Zhao and G. Karypis, “Empirical and theoretical com-
parisons of selected criterion functions for documen
tering,” Machine Learning, Vol. 55, No. 3, pp. 311–
331, 2004.
[45] S. Das, A. Abraham, and A. Konar, “Automatic clustering
using an improved differential evolution algorithm,”
IEEE Transaction on Syste
A: Systems and Humans, Vol. 38, No. 1, pp. 218–237,
2008.
[46] R. Storn and K. Price, “Differential evolution – a simple
and efficient heuristic for global optimization over con-
tinuous
No. 4, pp. 341–359, 1997.
[47] M. Pavan and M. Pelillo, “Dominant sets and pairwise
clustering,” IEEE Transactions on Pattern Anal
ine Intelligence, Vol. 29, No. 1, pp. 167–172, 2007.
[48] M. Porter, “An algorithm for suffix stripping,” Program,
Vol. 14, No. 3, pp. 130–137, 1980.
[49] C.-Y. Lin, “
summaries,” Proceedings of the Workshop on Text Sum-
marization Branches Out, Barcelona, Spain, pp. 74–81,
25–26 July 2004.
[50] C.-Y. Lin and E. H. Hovy, “Automatic evaluation of sum-
maries using n-gram co-occurrence statistics,” Proceed-
ings of the 2003 Conference of the North American
Chapter of the Association for Computational Linguistics
on Human Language Technology (HLT-NAACL’03),
nton, Canada, Vol. 1, pp. 71–78, 27 May–1 June
2003.
[51] H. Nanba and M. Okumura, “An automatic method for
summary evaluation using multiple evaluation results by
a manual method,” Proceedings of the COLING/ACL on
Main Conference Poster Sessions, S
603–610, 17–18 July 2006.