Using Wikipedia as an External Knowledge Source for Supporting Contextual Disambiguation

doi:10.4236/jsea.2012.512B034

Paper Menu >>

Journal Menu >>

Journal of Software Engineering and Applications, 20 12, 5, 175-180

doi:10.4236 /js ea.2012.512b034 Published Online December 2012 (http://www.SciRP.org/journal/jsea)

175

Using Wikipedia as an External Knowledge Source for

Supporting Contextual Disambiguation

Shahida Jabeen, Xiaoying Gao, Peter Andreae

School of Engineering and Computer Science, Victoria University of Wellington, New Zealand.

Email: shahidar ao@ecs.vuw.ac.nz, Xiaoying. Gao@ ecs .vuw.a c.nz, Peter.Andreae@ecs.vuw.ac.nz

Received 2012

ABSTRACT

Every term has a meaning but there are terms which ha ve mult iple mea ning s. Ident ifyin g the cor rect mea ning o f a term

in a specific context is the goal of Word Sense Disambiguation (WSD) applications. Identifying the correct sense of a

ter m give n a l i mited c ont e xt i s eve n har d er . Thi s research aims at solving the problem of identifying the correct sense of

a ter m give n only one term as its conte xt. The main focus of this research is on using Wikipedia as the external know-

ledge source to decipher the true meaning of each term using a single term as the context. We experimented with the

semantically rich Wikipedia senses and hyperlinks for context disambiguation. We also analyzed the effect of sense

filtering on context extraction and found it quite effective for contextual disambiguatio n. Results have sho wn that dis-

ambiguatio n with filtering works quite wel l on manually disambiguated dataset with the performance accuracy of 86%.

Keywords: Contextual Disambiguatio n; Wikipedia Hyperlinks; Se mantic Related ne ss

1. Introduction

Ambiguity is implicit to natural languages with a large

number of terms having multiple contexts. For instance,

the English noun “bank" can be a financial institution or

it could mean river side. Polysemy is the capability of a

term to have multip le meanings and a term with multip le

senses is termed as “polysemous term”. Resolving am-

biguity and identifying the most appropriate sense is a

critical issue in natural language interpretation and

processing. Identification of the correct sense for an am-

biguous term requires understanding of the context in

which it occurs in natural language text. Word Sense

Disa mbiguation is defined as the task of automatically

selecting the most appropriate sense of a term within a

given context fro m a set of available senses [1-4].

What is the co mmo n cont ext of B ank a nd floo d? What

sense of Tree comes to mind when it occurs with com-

puter? In both cases, the relationship of each pair of

terms is strong but can’t be measured unless considering

the correct senses of either or both terms. The first pair of

terms has a cause and effect relation provided the correct

context of bank i.e Levee is taken into account. Whereas,

to relate the other term pair, the term Tree should be

mapped to Tree (data structure). Making judgments

about deciphering the relevant context of a term is an

apparently ordinary task but actually requires a vast

amount of background knowledge about the concepts.

The task of contextual information extraction for dis-

ambiguatio n of natur al language te xt relies o n knowledge

from a broad range of real world concepts and implicit

relations [5]. To effectively perform such a task, com-

puters ma y r eq ui re an external knowledge source to infer

implicit relations. External knowledge sources can vary

from domain specific thesauri to lexical dictionaries and

from hand crafted taxonomies to knowledge bases. Any

external knowledge source should have the following

three characteristics to sufficiently support contextual

disambi guation:

● Coverage: The coverage of the external knowledge

source is the degree to which it represents collected

knowledge. It should be broad enough to provide in-

formation about all relevant concepts.

● Quality: Information provided by the external know-

ledge source should be accurate, authentic and up-

to-date.

● Lexical semantic resource: The knowledge base

should encode rich lexical and structural semantics.

Fortunately, Wikipedia, a web based freely available

encyclopedia, comes up with all the necessary characte-

ristics to support contextual information extraction and

disambi guation. As a knowledge source, Wikipedia not

only focuses on general vocabulary but also covers a

large number of named entities and domain specific

terms. The specialty of Wikipedia is that each of its ar-

ticles is dedicated to a single topic with an additional

bene fit of heavy li nkin g between articles. Wikipedia also

Using Wikipedia as an External Knowledge Source for Supporting Contextual Disambiguation

176

covers specific senses of a term, surface forms, spelling

variations, abbreviations and derivations. Importantly, it

has a semantically rich hyperlink network relating ar-

ticles that cover different types of lexical semantic rela-

tions such as hyponymy and hypernymy, synonymy, an-

tonymy and polysemy. Hence, Wikipedia sufficiently

demonstrates all the necessary aspects of a good know-

ledge source. It provides a knowledge base for extracting

information in a more structured way than a search en-

gine and with a coverage better than other knowledge

sources [6].

The main goal of this research is to use Wikipedia as

the external knowledge source to decipher true meaning

of a term using the other term as the context. In particular,

we have taken into account Wikipedia hyperlinks struc-

ture and senses for disambiguating the context of a term

pair. The rest of the paper is organized as follows. Sec-

tion 2 discusses two broad categories of disambiguation

work proposed in literature and the corresponding re-

search done in each category. Section 3 discusses our

proposed approach for context extraction and disambigu-

ation. Sectio n 4 comprises of the perfor mance analysis o f

our proposed methodology using a manually designed

dataset consisting of English term pairs. Finally, section

5 concludes our research and discusses some future re-

search directions.

2. Related Work

Word Sense Disambiguation (WSD) is a well explored

area with lot of research work still going on in context

identification and disambiguation. Research in this area

can be broadly categorized into two main streams [7]:

Lexical approaches and taxonomic approaches.

Lexical approaches are based either on the analysis of

a disambiguated corpus or by extracting strings from the

definition of that sense. Such approaches are based on

identifying the lexical features such as occurrence statis-

tics or co-occurrence computations. By contrast, tax-

onomic approaches identify correct nodes in the hie-

rarc hy of sense s and explore the relations between nodes

for c omputing t he semantic closeness.

Early efforts in Lexical approaches were based on

machine readable dictionaries and thesauri, associating

word senses with short definitions, examples or lexical

relations. A simple approach of this type is based on

comparing dictionary definitions of words, also called

glosse s, to the co ntex t words (the words appearing in the

surrounding text) of an ambiguous word. Clearly, the

higher the overlap between context words and the dic-

tionary definitio ns of a particular sense of the ambiguous

word, the higher the chances of getting the correct sense

for the word. Cowie et al. [8] and Lesk [9] based their

approaches on machine readable dictionaries and thesauri.

Pederson et al. [10] adapted the Lesk algorithm for word

sense disambiguation by using lexical database WordNet

instead of standard dictionary as a source of glosses.

They exploited the hierarchy of WordNet semantic rela-

tions for disambiguation task. Patwardhan et al. [1] ge-

neralized adapted Lesk algorithm to a method of term

sense disambiguation. They used the gloss overlap as an

effective measure of semantic relatedness. Pedersen [11]

further explored the use of similarity measures based on

path findings in concept networks, information contents

derived from large corpus and term sense glosses. They

concluded that the gloss based measures were quite ef-

fective for term sense disambiguation. Yarowsky [12]

used Statistical models on Roget’s thesaurus categories

to build context discriminators for the word senses that

are members of conceptual classes. A conceptual class

such as “machine” or “animal” tends to appear in recog-

nizably different contexts. They also used the context

indicators of Roget’s thesaurus as the context indicators

for the members of conceptual categories.

Following taxonomic approaches, Agirre [13] pro-

posed a method of lexical disambiguation over Brown’s

Corpus using no un taxo nomy of Word Net. He computed

the conceptual density by finding the combination of

senses from a set of contiguous nouns that maximized

conceptual distances in taxonomic concepts. Veronis et

al. [14] automatically build a very large Neural Network

from definitio n text in machine readable dictionaries and

used this network for word sense disambiguation. They

used the node activation scheme for moving closer to the

most related senses following the “Winn er-take-all”

strategy in which every active node sent an activation to

another increasingly related node in the network. Mihal-

cea et al. [15] used the page ranking algorithm on Se-

mantic Networks for sense disambiguation. Iterative

graph-based ranking algorithms are essentially a way of

deciding the importance of a node within a graph. When

one node in a graph is connected to another one, it is

casting a vote for that other node. The higher the number

of votes of a node , the higher the importance of that

node. They find the sysnet with highest PageRank score

for each ambiguous word in the text as suming that it will

uniquely identify t he sense of the word.

Wikipedia and Sense Disambiguation

Availability of free online thesaurus, dictionaries and

encyclopedias and other knowledge sources has scaf-

folded the improvement in both lexical and taxonomic

based word sense disambiguation. Different methodolo-

gies are found in literature for computing term sense

disambiguation based on Wikipedia as the external

knowledge source. Ponzetto and Navigli [4] addressed

the problem of knowledge acquisition in term sense

disambiguation. They proposed a methodology for ex-

Using Wikipedia as an External Knowledge Source for Supporting Contextual Disambiguation

177

tending Wor dNet with large amount of semantic relatio ns

derived from Wikipedia. They associated Wikipedia

pages with the WordNet senses to produce a richer lexi-

cal resource. Rada [2] used Wikipedia as a source of

sense annotation for generating sense-tagged data for

building accurate and robust sense classifiers. Turdakov

and Velikhov [16] proposed a semantic relatedness

measure based on Wikipedia links and used it to disam-

biguate terms. They proposed four link-based heuristics

for r educ ing t he searc h space of potentially related topics.

Fogarolli [17] used Wikipedia as a reference to obtain

lexicographic relations and combined them with the sta-

tistical information to deduce concepts related to terms

extracted from a corpus. Cucerzan [18] presented an ap-

proach for the recognition and semantic disambiguation

of named entities based on agreement between informa-

tion extracted from Wikipedia and the context of Web

search results. Bunescu [19] addressed the same problem

of detecting and disambiguating named entities in open

domain text using Wikipedia as the external knowledge

source.

The kind of pro blem that we add ress in this research is

a variant of the main stream of term sense disambigua-

tion research, where the aim is to identify the context of a

single term. We look to find o ut the co ntext of two ter ms

with respect to each other. This task is critical in many

approaches involving r elatedness computation [6,20, 21].

3. Disambiguation Methods

There are two approaches that we have adopted for con-

textual extraction and disambiguation along with their

variants based on two factors: relatedness measure and

sense filte rin g.

3.1. WikiSim Based Disambiguation

Our main approach for contextual disambiguation using

Wikipedia consists of three phases: Context Extraction,

in which we extract the candidate context of both input

terms ; Context Filtering, where we filter o ut certain co n-

texts based on t heir t ype, th us avo iding u nnece ssar y con-

text; Contextual Disambiguation, where semantic simi-

larity between candidate contexts is computed using Wi-

kipedia hyperlink based

1) Context Extraction: We start with identification of

all possible contexts corresponding to each input term

based on Wikipedia senses. Each Wikipedia article is

associated with a number of senses. For instance, there

are various senses for terms Present and Tense. T he best

sense of Present would be Present tense and the best

sense of tense would be Grammatical tense in correct

context of each other. The aim of this phase is to extract

all possible co ntexts corr espond ing to the inpu t ter m pair.

For this purpose, we used Wikipedia disambiguation

pages to extract all listed senses as candidate senses and

populate them in the conte xt set of each inp ut term.

2) Context Filtering: There are three broad categories

of manually annotate d Wikipedia senses.

• Senses with parenthesi s

• Single term senses

• Phrasal senses

The first type of senses are those Wikipedia senses

which are generic and include broader context within

parenthesis followin g the title. Fo r exa mple, Crane (B ird )

and Crane (machine) are two different contexts of the

term Crane. Both of these senses are quite distinctive and

cover broader contexts of machine and bird .

In the second category of senses, single term context

falls. These contexts cover certain very important lexical

relations such as synonymy, hypernymy and hyponymy

and derivations. For instance, Gift is a synonym of the

term present, similarly carnivores is the hypernym of the

ter m tiger . I t is fou nd tha t s uc h sense s a re ver y use ful for

contextual information extr act ion.

The third type of Wikipedia senses are phrasal senses,

which usually have very specific and limited context.

This context can have characteristics of the more general

sense but it would be focused more on some other spe-

cific features which cannot be considered true for the

generalized sense. For instance, corresponding to a term

forest, the phrasal sense might be Forest Township, Mis-

saukee County, Michigan which discusses a geographi-

cally specific context rather than the more general con-

text of forest. Such type of senses might not be ver y u se-

ful in extracting the contextual information. For this rea-

son we excluded this third type of senses from the con-

text set of each input term. Experiments proved that bi-

gram senses still contain the general context of a term. So,

we gathered all uni-gra m and b i-gra m senses, se nses wit h

parenthesis and senses shared by both input terms and

put them in the context set corresponding to each input

term.

3) Contextual Disambiguation: In contextual disam-

biguation, the first step is to extract all uniq ue inlinks (all

articles referring to the input term article) and outlinks

(all articles referred by the input term article) corres-

ponding to each candidate sense of the context set. The

link vector of each candidate sense of first input term is

compared with that of each sense of the second input

term. The assumption behind this comparison is to find

out those se nses which share maximum number of links,

thus indicating a strong relatedness. Each sense pair is

assigned a weight based on a relatedness measure called

WikiS im [20] , as shown below:

In the above formula si∈| Swa |, where |Swa | i s the set

Using Wikipedia as an External Knowledge Source for Supporting Contextual Disambiguation

178

of all senses of term wa whereas, sj sense corresponds to

input term wb and si∈| Swb|. S is the set of all the links

shared between a sense pair and T represents total num-

ber of links of both senses. In other terms, the weight of a

sense p air is the link probability of shared links, or is 0 if

shared links do not exist. Onc e we get scores for all sense

pair s, the next step is to find out the sense pair with

highest score, thus getting most closely related senses of

both input terms. For this purpose, all sense pairs are

sorted based on their WikiSim score and the sense pair

having the highest weight is taken as the disambiguated

context corresponding to the input term p a ir .

3.2. WLM based Disambiguation

In or d er to e va l ua t e t he pe r fo r ma nc e o f our ma i n a pproach

and to analyze the effect of using a different relatedness

measure on disambiguation task, we used WLM related-

ness measure [21] which is also based on Wikipedia

hyperlinks and is proven to be quite effective in compu-

ting term relatedness. We followed the same methodolo-

gy as the WikiSim based Disambiguation for extracting

candidate senses and populating the context vector but

replaced the WikiSim relatedness measure with WLM

measure while computing the sense pair scores.

4. Evaluation

4.1. Experimental setup

We used the version of Wikipedia released in July,

20111. At this point, it contains 31GB of uncompressed

XML markup which corresponds to more than five mil-

lion articles which sufficiently covers all the co ncepts for

Table 1. Accuracy Based Performance Comparison Of

Disambiguation M e thods.

Table 2. Statistics Of Types Of Disambiguation Performed

By Each Metho d.

which manual judgment were available. To explore and

easily draw upon the contents of Wikipedia, we used the

latest version (wi kiped ia-miner-1.2.0) of Wikiminer

toolkit [22] which is an open source Java code2. Since

the problem addressed in this research is a variant of the

standard disambiguation task, where rather than resolv-

ing the context of a single term we do that for a pair of

terms considering each word as the context for the other

word, we needed a different dataset of disambiguated

term pairs. So, we used a manually designed dataset

named WikiContext as shown in Table 3. It consists of

30 E nglish ter m pair s which ar e manual ly disa mbi guated

to corresponding Wikipedia articles in context of the

other input term.

4.2. Experimental Results and Discussion

To compare performance of our proposed methods, we

automatically disambiguated term pairs in the dataset and

compared them with the manually disambiguated Wiki-

pedia articles. To measure the performance of each me-

thod, we used the accuracy measurement:

thod defined as follow:

where, |Pc| is the set of correctly disambiguated ter m

pairs and |Pt| refers to the set of all disambiguated term

pairs. In other words, it is the ratio o f correct disambigu-

ation and t he size of the dataset.

Table 3. Wikisim (With Filtering):Word Pairs And Cor-

respo nding Aut omatic Disambiguation.

Using Wikipedia as an External Knowledge Source for Supporting Contextual Disambiguation

179

When compared performance accuracy of both Wiki-

Sim based disambiguation and WLM based disambigua-

tion, WikiSim based disambiguation is found to be com-

parable to that of WLM based disa mbiguation, both hav-

ing an accuracy of 86% as shown in Table I. To analyze

the effect o f applying sense filterin g in both disambigua-

tion approaches, we skipped context filtering step from

each approach and performed disambiguation with all

possible contexts. Detailed analysis of each approach is

summarized in Table II. Three types of disambiguation

are taken into account in this research: First, those word

pairs which are matched exactly to the manual disam-

biguat ion; se co nd , those word pairs which are matched to

a specialized area or subtopic of the correct context; third,

those word pairs in which one of the term is correctly

disambiguated in context of the other word but the other

term is disambiguated to a wrong context. Results of

WikiSim based disambiguation revealed that there was

no partial match in case of both context filtering and

without filtering. Whereas, some of the specialized

matches were found to be more relevant then the exact

matches. For example in case of the term pair <mole,

chemistry>, mole was disambiguated exactly to Mole

(unit) which is the measurement of amount of chemical

substance but che mistry was disambi guated to even mor e

related context Analytical Chemistry which deals with

quantification of the chemical components. Similarly, the

term pair <bar, drinking> was disambiguated to <Bar

(Establishment), Alcoholic beverages>. In case of WLM

based disambiguation with filtering, three term pairs

were found to be partially matched to the correct context.

Table III shows the results on WikiSim based disambig-

uation (with filtering) on the dataset WikiContext. It

shows disambiguated Wikipedia articles corresponding

to input term pairs.

Table I

Accuracy Based Performance Comparison Of Disam-

biguation Method s

Overall, disambiguation based on filtering is found to

be better than the one without filtering. The accuracy of

WikiSim based method is 3% increased when filtering is

applied. The effect of filtering became more evident in

WLM where the accuracy of filtering based disambigua-

tion increased to 10%. The results of our main approach

are quite encouraging and comparable to the WLM based

disambiguation. In order to avoid any bias in the results

due to smaller size of dataset and to test the effectiveness

of our approach more critically, we plan to use some

bigger dataset in future. To the best of our knowledge,

there is no dataset available that addresses this particular

kind of pro blem. One li mitati on of our approach is that it

relies heavily on Wikipedia senses, which are manually

encoded and may not sufficiently cover all possible con-

texts of so me terms and suffers from inco nsiste nt for mat-

ting d ue to manua l e ncod ing. We bel ieve tha t usi ng o ther

Wikipedia features such as anchor text s , categories,

hyperlinks and redirects for semantic extraction would

definitely help in t his regard.

5. Conclusion

In this paper we proposed and evaluated a novel approach

for extracting contextual information from Wikipedia

and using it to d isambi guate a term using a single ter m as

a given context. Based on Wikipedia hyperlink structure

and senses, our approach used WikiSim, a Wikipedia

based relatedness measure to compute scores of sense

pairs and compare them based on their relatedness. For

the sense disambiguation, we extracted various senses of

first input term and disambiguated each sense in various

contexts of the other input term. We evaluated the per-

formance of our approach by comparing it with WLM

based disambiguation approach and analyzed the effect

of context filtering disambiguation. Results have shown

that with an accuracy of 86%, our approach performs

quite well when compared with manually disambiguated

dataset of term pairs. In future, we plan to apply this

disambiguation approach along with the semantic rela-

tedness in key phrase clustering task for an indirect

evaluation of our approach on a bigger dataset to avoid

any bias in the current results due to smaller dataset size.

REFERENCES

[1] S. Patwardhan, S. Banerjee, and T. Pedersen, “Using

measures of semantic relatedness for word sense disam-

biguation,” in Proceedings of the 4th International Con-

ference on IntelligentText Processing and Computational

Linguistics, February 2003, pp. 241–257.

[2] R. Mihalcea, “Using wikipedia for automatic word sense

disambiguation,” in North American Chapter of the As-

sociation for Computational Linguistics (NAACL 2007),

2007.

[3] D. McCarthy, “Word sense disambiguation: The case for

combinations of knowledge sources,” Natural Language

Engineering, vol. 10, pp. 196–200, June 2004.

[4] S. P. Ponzetto and R. Navigli, “Knowledge-rich word

sense disambiguation rivaling supervised systems,” in

Proceedings of the 48th Annual Meeting of the Associa-

tion for Computational Linguistics, 2010, pp. 1522–1531.

[5] E. Yeh, D. Ramage, C. D. Manning, E. Agirre, and A.

Soroa, “Wikiwalk: random walks on wikipedia for se-

mantic relatedness,” in 2009 Workshop on Graph-based

Methods for Natural Language Processing, 2009, pp.

41–49.

[6] M. Strube and S. P. Ponzetto, “Wikirelate! computing

semantic relatedness using wikipedia,” in proceedings of

the 21st national conference on Artificial intelligence, vol.

Using Wikipedia as an External Knowledge Source for Supporting Contextual Disambiguation

180

2, 2006, pp. 1419–1424.

[7] J. C urtis, J. Cabral, and D . Baxter, “On t he appli cation of

the cyc ontology to word sense disambiguation,” in Pro-

ceedin gs of the 19th International Florida Artificial Intel-

ligence Research Society Conference, 2 006, pp. 652–657.

[8] J. Cowie, J. Guthrie, and L. Guthrie, “Lexical disa mbigu-

ation using simulated annealing,” in Proceedings of the

workshop on Speech and Natural Language, 1992, pp.

238–242.

[9] M. Lesk, “Automatic sense disambiguation using ma-

chine readable dictionaries: how to tell a pine cone from

an ice cream cone,” in Proceedings of the 5th annual in-

ternati onal conference on Systems documentation, 1986,

pp. 24–26.

[10] S. Banerjee and T. Pedersen, “An adapted lesk algorithm

for word sense disambiguation using wordnet,” In Pro-

ceeing of the Third International Conference on Intelli-

gent Text Processing and Computational Linguistics,

2002, pp . 136–145.

[11] T. Pedersen, S. Banerjee, and S. Patwardhan, “Maximiz-

ing Semantic Relatedness to Perform Word Sense Dis-

ambiguation,” University of Minnesota Supercomputing

Institute, Research Report UMSI 200 5/25, Mar c h 20 05.

[12] D. Yarowsky, “Word-sense disambiguation using statis-

tical models of roget’s categories trained on large corpo-

ra,” in Proceedings of the 14th conference on Computa-

tional linguistics - Volume 2, 1992, pp. 454–460.

[13] E. Agirre and G. Rigau, “Word sense disambiguation

using conceptual density,” in Proceedings of the 16th

conference on Computational linguistics - Volume 1,

1996, pp . 16–22.

[14] J. Veronis and N. M. Ide, “Word sense disambiguation

with very large neural networks extracted from machine

readabl e dictionaries,” in Proceedings of the 13th confe-

rence on Computational linguistics - Volume 2, 1990, pp.

389–394.

[15] R . Mihalcea, P. Tarau, and E. Figa, “P agerank on se man-

tic networks, with application to word sense disambigua-

tion,” in Proceedings of the 20th international conference

on Computational Linguistics, 2004.

[16] D. Turdakov and P. Velikhov, “Semantic relatedness

metric for wikipedia concepts based on link analysis and

its application to word sense disambiguation,” SYRCo-

DIS, vol . 35 5, pp. 1–6, 2008.

[17] A. Fogarolli, “Word sense disambiguation based on wi-

kipedia link structure,” in Proceedings of the 2009 IEEE

International Conference on Semantic Computing, 2009,

pp. 77–82.

[18] S. Cucerzan, “Large-scale named entity disambi g uation

based on wikipedia data,” in Proceedings of EMNLP-

CoNLL 2007, 2007, pp. 708–716.

[19] B. Razvan and P. Marius, “Using encyclopedic know-

ledge for named entity disambiguation,” in Proceesings of

the 11th Conference of the European Chapter of the Asso-

ciation for Computational Linguistics (EACL-06), 2006,

pp. 9–16.

[20] S. Jabeen, X. Gao, and P. Andreae, “Improving contex-

tual relatedness computation by leveraging wikipedia se-

mantics, ” in 12th Paci fic R im International Conference on

Artificial Intelligence (To appear), 2012.

[21] D. Miln e and I. H. Witten, “An effective, low-cost meas-

ure of semantic relatedness obtained from wikipedia

links,” in Proceeding of AAAI Workshop on Wikipedia

and Artificial Intelligence: an Evolving Synergy, 2008, pp.

25–30.

[22] D. Milne, “An open-source toolkit for mining Wikipe-

dia,” in Proceeding of New Zealand Computer Science

Research Student Conference, vol. 9, 2009.