Mobile SMS Spam Filtering for Nepali Text Using Naïve Bayesian and Support Vector Machine

Spam is a universal problem with which everyone is familiar. A number of approaches are used for Spam filtering. The most common filtering technique is content-based filtering which uses the actual text of message to determine whether it is Spam or not. The content is very dynamic and it is very challenging to represent all information in a mathematical model of classification. For instance, in content-based Spam filtering, the characteristics used by the filter to identify Spam message are constantly changing over time. Naïve Bayes method represents the changing nature of message using probability theory and support vector machine (SVM) represents those using different features. These two methods of classification are efficient in different domains and the case of Nepali SMS or Text classification has not yet been in consideration; these two methods do not consider the issue and it is interesting to find out the performance of both the methods in the problem of Nepali Text classification. In this paper, the Naïve Bayes and SVM-based classification techniques are implemented to classify the Nepali SMS as Spam and non-Spam. An empirical analysis for various text cases has been done to evaluate accuracy measure of the classification methodologies used in this study. And, it is found to be 87.15% accurate in SVM and 92.74% accurate in the case of Naïve Bayes.


Introduction
Spam can be defined as unsolicited (unwanted, junk) email for a recipient or any email that the users do not wanted to have in their inboxes.Spam filtering is a special problem in the field of document classification and machine learning.In recent years, the technological development in mobile devices has increased in computational power, and other powerful systems have been capable to be connected to mobile phone networks.This has also increased the communication through SMS.Nobody wants the unwanted SMS on his cell phone's inbox and they want their inboxes to be free from such annoying SMS.SMS has certain characters that are different from mails.A mail consists of certain structured information such as subject, mail header, salutation, sender's address etc. but SMS lacks such structured information.These make the SMS classification task much difficult.This situation makes the necessity for developing an efficient SMS filtering method.The basic principle of Spam filtering is shown in Figure 1.

Related Work
Before 1990, some Spam prevention tools began to emerge in response to the Spammers who started to automate the process of sending Spam email.The first Spam prevention tool has used simple approach, based on language analysis by simply scanning emails for some suspicious senders or phrases like "click here to buy" and "free of charge".In late 1990s, blacklisting and whitelisting methods were implemented at the Internet Service Provider (ISP) level.However, these methods suffered from some maintenance problems.
There are many efforts underway to stop the increase of Spam that plagues almost every user on the mobile network.Various techniques have been used to filter the Spam messages.Naïve Bayes [1] classifier is a simple probabilistic classifier.Its main advantage is that naïve Bayes classifiers can be trained very efficiently in a supervised learning.Naïve Bayesian classifiers are used for parameter estimation in numerous practical applications.In supervised learning, the parameters are estimated by Maximum Likelihood Estimation (MLE) method.Decision Tree [2] is one of the most famous tools of decisionmaking theory.Decision tree is a classifier in the form of a tree structure that shows the reasoning process.Support Vector Machines [3] is a linear maximal margin binary classifier.It can be interpreted as finding a hyper-plane in a linearly separable feature space that separates the two classes with maximum margin-the instances closest to the hyper-plane are known as the "support vectors" as they support the hyper-plane on both sides of the margin.Using these techniques, different software has been developed to filter the Spam emails.The basic concept of these techniques is the classification of SMS or email using trained classifier that can automatically predict if an incoming SMS or email is Spam or legitimate.This automatic process increases filtering performance and provides better usability than manual classification.
Some more complex approaches were also purposed against Spam problem.Most of them were implemented by using machine learning methods.A Naïve Bayes algorithm is used frequently which has shown a considerable success in filtering Spam e-mails in English [4].Knowledge-based and rule-based systems were also used by researchers for English Spam filters [5,6].SVM is used for text classification [7], which can also be applied for Spam filtering.
There is no work done for Nepali text SMS Spam filtering yet and it is much more necessary to start the work.The resource such as training SMS corpus is also not available for Nepali language and the corpus used in this work is created manually.The training corpus developed during this study can be made available for research proposes.

Methodology: A Proposed Framework for Spam SMS Filtering
Spam filtering engine flowchart is given in Figure 2.This describes top level data flow diagram of Spam classification problem used in this research work.The proposed system framework contains three steps: preprocessing, feature extraction and classification.

Preprocessing
The purpose of pre-processing is to transform messages in SMS into a uniform format that can be understood by the learning algorithm.The first step of text mining process is text pre-processing in which the collection of documents is analysed syntactically or semantically.The text message document is considered as a bag of words because the words and their occurrences are used to represent the document.The algorithm applied in this stage are stemming and stop word removal, number removal and strip whitespaces.

TF-IDF Calculation and Feature Vector Construction
In this work, the most widely adopted feature weighting scheme known as TF-IDF scheme, in Information Retrieval (IR), TF-IDF, to represent the email as a vector in a vector space model, and it is calculated as Equation (1): where tf ij is SMS in the training set and DF i is the number of SMS, containing the term i.The importance of a term in a SMS is measured by the frequency and its inverse document frequency.

Classification
Consider the problem of classifying documents or message (SMS) by their content, for example, into Spam and Non-Spam Messages.A document is drawn from set of documents (Spam and Non-Spam) which can be modeled as sets of words.The (independent) probability that the i th word of a given document occurs in a document from class C can be written as ( ) i p w C .Then the probability that a given document D contains all of the words w i , given a class C, ( ) ( ) Now by definition ( ) Bayes' theorem manipulates these into a statement of probability in terms of likelihood ( ) Assume for the moment that there are only two mutually exclusive classes, S and ¬S (i.e.Spam and not Spam), such that every element (message) is in either one or the other: Using the Bayesian result above  the hyper-plane in input space that correctly separate the example data into two classes.Hence, SVM is a binary classifier.This hyper-plane can be used to make the prediction of class for unseen data.The hyper-plane always exists for the linearly separable data [8].Each SMS is converted into feature vector on the Bag of word basis and the length of feature vector is equal to number of words in the Dictionary.The Dictionary consists of feature word from the training corpus.Some frequent Spam words are also included in dictionary.

Experimental Setup and Results
Java programming language is used for the implementation of the proposed framework.SVM light [9] is used for as classification tool for SVM and Naïve Bayes is implemented in Java.
Naïve Bayes and Support Vector Machine algorithms have been implemented for the Spam filtering task.The study has gone through the empirical analysis of the performance of both the Spam filters (SVM and Naïve Bayes) for Nepali SMS.It is observed from the experiment that the Spam Filter based on Naïve Bayes outperforms the Spam Filter based on SVM.Extensive tests have been performed with varying numbers of data set sizes.The success rates reach their maximum using all the messages and all the words in training corpus.
Tables 1-3 show the results of experiment and it is shown that the learning methods perform well when they are trained using more examples.

Conclusions and Future Work
The main concern for this study was to examine the efficiency of Naïve Bayesian and SVM Spam filters.The comparison of efficiency between these Spam filters was done on the basis of the accuracy, precision and recall.This comparison helps to find the best algorithm for Spam filtering.The classification accuracy of 92.74% was obtained for the Naïve Bayes classifier and 87.15% accuracy was obtained for SVM classifier on Nepali Spam dataset.On the basis of accuracy, Naïve Bayes is a better classification technique than SVM-based classifier.
No hundred percent filtering Spam system is invented till now.The classification accuracy of Naïve Bayesian and SVM proposed in this research work, however, can be further improved.Here, the TF-IDF scheme was used to make feature vector, which did not consider the individual word in SMS.i.e., it only considers the weighted words.Some techniques that use context base features can be used.
The features used to convert given Spam into vector can be enriched so that the higher accuracy can be achieved.Due to the small SMS corpus size, there is the

OPEN ACCESS IJIS
unknown word problem in Naïve Bayes classifier.Hence, some other techniques to handle the unknown word can be used.The size of SMS corpus can be increased by collecting more real SMS in the future.

Figure 1 .
Figure 1.The basic idea of Spam filtering.

9 )
Finally the document can be classified as follows.It isSpam if ( ) ( ) p S D p S D > ¬In their basic form shown in Figure3, SVM construct