Named Entity Recognition for Nepali Text Using Support Vector Machines

Named Entity Recognition aims to identify and to classify rigid designators in text such as proper names, biological species, and temporal expressions into some predefined categories. There has been growing interest in this field of research since the early 1990s. Named Entity Recognition has a vital role in different fields of natural language processing such as Machine Translation, Information Extraction, Question Answering System and various other fields. In this paper, Named Entity Recognition for Nepali text, based on the Support Vector Machine (SVM) is presented which is one of machine learning approaches for the classification task. A set of features are extracted from training data set. Accuracy and efficiency of SVM classifier are analyzed in three different sizes of training data set. Recognition systems are tested with ten datasets for Nepali text. The strength of this work is the efficient feature extraction and the comprehensive recognition techniques. The Support Vector Machine based Named Entity Recognition is limited to use a certain set of features and it uses a small dictionary which affects its performance. The learning performance of recognition system is observed. It is found that system can learn well from the small set of training data and increase the rate of learning on the increment of training size.


Introduction
The term Named Entity (NE) was evolved during the sixth Message Understanding Conference (MUC-6, 1995) [1].NE is the structured information referring to predefined proper names, like persons, locations, and organiza-tions etc. NE task is to identify all named locations, named persons, named organizations, date, times, monetary amounts, percentages etc. in text.
Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is nowadays considered to be fundamental activity for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems [1] [2].Though Support Vector Machine (SVM) [3] technique has been widely applied to NER in several well-studied languages, the use of SVM technique to Nepali Languages (NLs) is very new.The basic principle of proposed Named Entity Recognition is illustrated in Figure 1.

NE recognition in Nepali languages is difficult and challenging as:
No Capitalization English and many other European languages use capitalization to recognize proper names.Orthography of Nepali does not support capitalization.
Agglutinative Nature Agglutinative means that some additional features can be added to the word to add more complex meaning.Agglutinative language form sentences by adding a suffix to the root forms of the word.Nepali is a highly inflectional language providing one of the richest and most challenging sets of linguistic and statistical features resulting in long and complex word forms.For example, let us consider the root word as राजा (king) and suffix as ईश्वर (God) then if we combine these two words then it becomes राजे श्वर (a name of person) as new word.
Proper Name Ambiguity Ambiguity in proper name present in Nepali language as in English.The names like White are ambiguous in English-name or color.Nepali Person Names are more diverse compared to the other languages and a lot of these words can be found in the dictionary with some other specific meanings.There is a surprising amount of ambiguity even among proper names.

Word Order
Languages like Nepali have a different word-order than English and some have a free word-order.Nepali mostly has a word order but depending upon the domain the word order is respected.For example, कमलले पानीको पू रा �गलास �पयो र पानीको �गलास कमलले पु रा �पयो both translate to Kamal drank a whole glass of water.

Loan Words in Nepali
Nepali has a number of loan words.Loan words are words that are not indigenous to Nepali.The named entity recognizer that is based on simple morphological cues will fail to recognize a large number of proper nouns.For example Osama Bin Laden, �बन (Bin) an Arabic cue needs to be used in the middle of the name for the per-

Nested Entities
The named entities that are classified as nested contain two proper names that are nested together to form a new named entity.An example in Nepali is Kathmandu University where Kathmandu is the location name and University marks the whole entity as an organization.
Resource Challenges NER approaches are either based on rule engine or inference engines.In each approach some type of corpus is required; lack of a NE tagged corpus for deriving rules is an issue for Nepali language.Nepali is a resource poor language annotated corpora, name dictionaries; good morphological analyzers, POS taggers etc. are not yet available in the required measure.Although Nepali language have a very old and rich literary history, technological development are of recent origin.Web sources for name lists are available in English, but such lists are not available in Nepali forcing the use of transliteration for creating, such lists.

Support Vector Machine
In their basic form shown in Figure 2, SVM construct the hyperplane in input space that correctly separate the example data into two classes.Hence SVM is a binary classifier.This hyperplane can be used to make the prediction of class for unseen data.The hyperplane always exist for the linearly separable data [4].

Related Work
Considerable amount of work has already been done in the field of NER for English and other language like German, Spanish, Chinese, and Bengali etc.But there is no any work for Nepali language has been done yet.Different approaches like the rule based approach, the stochastic approach and the transformation based learning approach along with modification have been tried and implemented for English and European language.However, if we look at the same scenario for South-Asian language such as Bangla, Hindi, and Nepali, we find out that not much work has been done in the area of NER.
The author of [3] [5] had shown that Conditional Random Fields (CRFs) are undirected graphical models used to calculate the conditional probability of values on designated output nodes given values assigned to other designated input nodes.A conditional random field (CRF) is a type of discriminative probabilistic model used for the labeling sequential data such as natural language text.The author of [6] had shown that the maximum entropy [ME] [3], framework estimates probabilities based on the principle of making as few assumptions as possible, other than the constraints imposed.Such constraints are derived from training data, expressing some relationship between features and outcome.The author of [7] had shown that Name recognition may be viewed as a classification problem, where every word is either part of some name or not part of any name.In recent years, hidden Markov models (HMM's) have enjoyed great success in other textual classification problems most notably part-of-speech tagging [8].The decision tree [9] uses part of speech, character type, and special diction- nary information to determine the probability that a particular type of name opens or closes at a given position in the text.Support Vector Machines (SVMs) based NER system was proposed in [10] for Japanese.His system is an extension of Kudo's chunking system [11] that gave the best performance at CoNLL-2000 shared tasks.The other SVM-based NER systems can be found in [2] [12].

Feature Extraction
The features used in this work are taken form [13]. Following are the details of the set of features that will be apply to solve the NER task for Nepali Text: • First word: This is used to check whether the current token is the first word of the sentence or not.Though Nepali is relatively free order languages, the first word of the sentence is most likely a NE as it appears in the subject position most of the time.• Word length: This binary valued feature is used to check whether the length of the current word is less than two or not.This is based on the observation that the very short words are rarely NEs.

One vs Rest Classification
The SVM described in section 1.2 is used for binary classification and which classify data in binary class.But Named entity recognition is a multiclass classification problem since in natural language there are more than two NE tags.As an instance, for this work, the five tags as listed in Table 1 are used to cover all grammatical categories and in which four tags are NE and fifth tag is used to represent the word which does not belongs to the named entity i.e. other than NE.In this work number of tag represents the number of classes.So binarization of problem must be performed before apply them to NE tagging.[14] has suggested the one vs.rest binarization of problem i.e. a SVM is trained for each NE tag in order to distinguish this class and the rest.When tagging the word, the most confident prediction among the all binary SVM is selected.
But in the case of NER there are five classes, so multiclass SVM is used.Here five SVM are trained that corresponds to five NE tag and for each new word, each of five SVM are evaluated and most confident NE tag is assigned to that word.This can be explained with an example as in Figure 4.

Tool Used
For this work, the SVM multiclass [15] is used.SVM multiclass is an implementation of Support Vector Machines (SVMs) in C programming language.Main feature of this system is that we can integrate our own custom kernel very easily.Because of steepest feasible descent and caching of kernel evaluations, SVM multiclass is real fast.It can easily handle thousands of support vectors and several hundred-thousands of training examples.At first, system learns from training file using customized kernel function and creates a model file.Model file basically learn all the support vectors.This model file is used for classifying new examples.After testing is complete, it produces a prediction file which contains the confidence value of each example for that classification.

Result and Discussion
The study has gone through the empirical analysis of the performance of the NE recognizer.Here, during the development of the model, the impact of the size of the training data and test data on the performance was observed.The experiment was done for three different sizes of the train data; it is shown that the performance of the method depends on the size of train data.
In the Experiment No 1 shown in Table 2, the training data set consists of 5000 tokens and the SVM is trained with these tokens and tested with 10 different test data sets from size 1000 tokens to 5500 tokens.
In the Experiment No 2 shown in Table 3, the training data set consists of 15,000 tokens and the SVM is trained with these tokens and tested with 10 different test data sets from size 1000 tokens to 5500 tokens.
In the Experiment No 3 shown in Table 4, the training data set consists of 29,298 tokens and the SVM is trained with these tokens and tested with 10 different test data sets from size 1000 tokens to 5500 tokens.
The Table 5 shows the average result of each of above three experiments.The results for experiment no. 1 is 65.93% precision, 80.42% recall, and 72.44% F-score, (taken as average of ten runs), for Experiment No.      82.66% precision, 97.27% recall, and 89.36% F-score, (taken as average of ten runs), The result for experiment no. 3 is 86.85%, precision, 98.53%, recall, and 92.31% F-score, (taken as average of ten runs).From these experiments, it is observed that the learning ability of SVM for NE recognition is increased when the size of training data is increased.The learning curve corresponding to the result in Table 5 is presented in Figure 5.

Conclusions and Future Work
In this work, the method for extracting named entities from data of various domains has been presented which is a system useful in the identification and classification of names.The work for Nepali NER is very complex due to the nature of Nepali language which is in free order and lacks of research work in Nepali text.There is no any corpus existing for Named Entity so it is difficult and tedious to create such corpus.For this work, the NE corpus is created manually.The corpus used is comparatively small with respect to other languages and its size can be increased in future.Other classification methods may also be tested for the recognition of NER in future.

Figure 5 .
Figure 5. Learning curve for SVM based NE tagger.
• Digit features: Several binary valued digit features have been defined, depending upon the presence and/or the number of digits in a token (e.g., ContainsDigit [token contains digits], FourDigit [token consists of four digits], TwoDigit [token consists of two digits]), combination of digits and punctuation symbols (e.g., Con-tainsDigitAndComma [token consists of digits and comma], combination of digits and symbols (e.g., Con-tainsDigitAndSlash [token consists of digit and slash], ContainsDigitAndHyphen [token consists of digits and hyphen], ContainsDigitAndPercentage [token consists of digits and percentages]).These binary valued features are helpful in recognizing miscellaneous NEs, such as time expressions (Age, Date, Year), measurement expressions (Weight, Height etc) and numerical numbers etc. • Gazetteer Lists: Various gazetteer lists are used.○ Person name: This list contains the name of persons.The feature PersonName is set to +1 for the current word.○ Location name: This list contains the location names and the feature LocationName is set to +1 for the current word.○ Organization name: This list contains the organization names and the feature OrgnizationName is set to +1 for the current word.○ Month name: This list contains the name of all twelve different months of both English and Nepali calendars.The feature MonthName is set to +1 for the current word.○ Day name: This list contains the name of all seven different days of Nepali calendars.The feature Day-Name is set to +1 for the current word.○ PersonPrefix: This list contains the person prefix such as श्री, श्रीमान, श्रीम�त etc. ○ MiddleName: This list contains nepali middle name such as बहादु र, क ु मार, क ु मार�, दे बी, राज, प्रसाद etc. ○ SurName: This list contains nepali sur name such as बम, पन्त, जो�श, भट्ट, दाहाल etc. ○ CommonLocationWord: This list contains common location word such as रोड, बाटो, राजमागर् , नगर etc. ○ Action Verb: A set of action verbs like सु न, भन, गर, खाउ, जाउ etc. often determine the presence of person names.Person names generally appear before action verbs.○ Designation Word: This list contains designation word such as प्रोफे शर, डा., मिन्त्र, रास्ट्रप�त, स�चब, अध्य�, महास�चव etc. ○ Organization Suffix Word: This list contains organization suffix word such as �मल, प्रा�ल, कम्प�न, स�म�त, सं घ, कायार् लय etc.

Table 1 .
2 is Named Entity Tag-set for Nepali NER.

Table 5 .
The Precision, Recall and F-Score for different training data size.