Support Vector Machine for Sentiment Analysis of Nigerian Banks Financial Tweets

The rise of social media paves way for unprecedented benefits or risks to several organisations depending on how they adapt to its changes. This rise comes with a great challenge of gaining insights from these big data for effective and efficient decision making that can improve quality, profitability, productivity, competitiveness and customer satisfaction. Sentiment analysis is the field that is concerned with the classification and analysis of user generated text under defined polarities. Despite the upsurge of research in sentiment analysis in recent years, there is a dearth in literature on sentiment analysis applied to banks social media data and mostly on African datasets. Against this background, this study applied machine learning technique (support vector machine) for sentiment analysis of Nigerian banks twitter data within a 2-year period, from 1st January 2017 to 31st December 2018. After crawling and preprocessing of the data, LibSVM algorithm in WEKA was used to build the sentiment classification model based on the training data. The performance of this model was evaluated on a pre-labelled test dataset generated from the five banks. The results show that the accuracy of the classifier was 71.8367%. The precision for both the positive and negative classes was above 0.7, the recall for the negative class was 0.696 and that of the positive class was 0.741 which shows the prediction did better than chance in addition to other measures. Applying the model in predicting the sentiments of the five Nigerian banks twitter data reveals that the number of positive tweets within this period was slightly greater than the number of negative tweets. The scatter plots for the sentiments series indicated that, majority of the data falls between 0 and 100 sentiments per day, with few outliers above this range.


Introduction
In the dynamic world that we now live in, the ideas, thoughts, beliefs, opinions and decisions of people are now being shared in real time on different platforms like Twitter, Facebook, LinkedIn, to name a few. However, gaining insights from these textual data for effective decision making may be a herculean task, partly due to the huge volume of information being shared from a large population and also from the complexity associated in quantifying the text data for modelling purposes. For instance, Twitter had over 310 million active users as at the first quarter of 2017 who posted over 500 million tweets per day [1]. Achieving this herculean task of gaining insights from textual data falls within the domain of sentiment analysis. Sentiment analysis, also called opinion mining, is the field of study that analyzes people's opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes [2]. Simply put, sentiment analysis is a field that is concerned with the classification and analysis of user generated text under defined polarities. There exist three major approaches to sentiment analysis, namely, machine learning approach [3] [4], lexicon-based approach [5] [6] [7] and hybrid approach [8] [9]. Machine learning approach can be further classified into supervised, unsupervised and semi-supervised machine learning. Supervised machine learning algorithms such as Support Vector Machine (SVM), Naïve Bayes Classifier, Maximum Entropy, to name a few, perform classification on the target corpus using an already labelled training data while unsupervised machine learning algorithms use unlabelled input data to find structure in the data, which is then used to determine text polarity. Lexicon-based approach utilizes dictionary based annotated corpus in classifying the polarity of a text while the hybrid approach combines both the lexicon-based approach and machine learning approach in determining the sentiment of a text.
The importance of sentiment analysis cannot be overemphasised as its application and impact span diverse fields and domains. Organisations, companies, agencies and governments can leverage on sentiment analysis in gaining insights which can enhance efficient and effective decision making. Also, by implementing sentiment analysis, organisations can take appropriate measures to ensure that they remain competitive in the market place, by determining product and services that customers are not satisfied with and improving such either through price reallocation, quality improvement or addition of new features. Similarly, in the academic domain, according to [2] there is a widespread and growing interests among researchers since the early 2000s on sentiment analysis and its applications in several fields with works like [10] [11] and later different works have been done. In fact, a review of sentiment analysis by [2] has been cited over 8000 Journal of Data Analysis and Information Processing times, this shows the growing interest and awareness of sentiment analysis. Sentiment analysis has been shown to be beneficial in the finance sector with works such as [12] [13]. In a similar vein, [14] demonstrated that the stock market itself can be considered as a measure of social mood. [ However, despite the upsurge of research in sentiment analysis, there is a dearth in literature on sentiment analysis applied to banks social media data and mostly on African datasets. To the best of our knowledge, this paper is the first work to apply machine learning technique for sentiment analysis of Nigerian banks social media data. It is based on these findings that this research is carried out. This work will be of great benefit not only in expanding the domain of sentiment analysis but also be of profound help to Nigerian banks on customer intelligence and education so as to improve their satisfaction. It will also emphasis the utilization of social media analytics, in promoting their products and services, risk management, business forecasting, competitive analysis, product and service design.
The rest of the paper is structured accordingly: Section 2 presents our proposed framework for sentiment analysis, data and software used in the study. In Section 3, we present the twitter data preprocessing techniques employed in cleaning the data while in Section 4, we give a general overview of the support vector machine, the mathematical theory behind its solution, the libSVM algorithm in WEKA for its implementation and the model evaluation parameters. Section 5 presents, the results and discussions of the twitter classification model and its results when implemented in the five Nigerian Banks considered in the study. Finally, Section 6 presents the conclusion and recommendation of the study.

Framework
We have presented our proposed framework for sentiment analysis, illustrated in Figure 1. This shows the process of data gathering, using Twitter's Application Programming Interface (API), preprocessing of the gathered twitter data,

1) Nigerian banks twitter data
The data used for this study was retrieved from Twitter's API covering a 2-year period from 1st January 2017 to 31st December 2018. Since the data crawled were publicly available data and the usage was within the stipulated twitter data usage there was no need for ethical review. The data was retrieved from the following Nigerian banks using their specific twitter filter operators as shown in Table 1 below.
F. C. Onwuegbuche et al. and negative (800,000 tweets) using the assumption that tweets with positive emoticons are positive and tweets with negative emoticons are negative. Since training a classifier with 1.6 million tweets on a normal computer can take days before completing, after performing the preprocessing on this 1.6 million tweets dataset we then randomly select 75,000 positive and 75,000 negative tweets which forms our training dataset of 150,000 tweets.

3) Test data
The test dataset used in this study was created from the crawled Nigerian Banks Twitter datasets. We randomly select 100 tweets from each bank. Since we have five banks, this sums up our test data to 500. We then manually classified these 500 tweets into positive and negative. This test data was used to evaluate the performance of the SVM classifier, which was trained using the 150,000 tweets obtained from the Sentiment140 dataset.

4) Software used in the study
This study utilized Waikato Environment for Knowledge Analysis (WEKA) software for training the SVM classification model and deploying its results on the unlabelled datasets. WEKA is a data mining and machine learning software, developed by the University of Waikato, New Zealand. This software is used by most researchers to solve data mining and machine learning problems, because of its general public license and its graphical user interface for several functionalities such as data analysis, classification, predictive modelling and visualizations. In addition, the data preprocessing described in Section 3 was implemented in Python. Python is a very powerful open source programming language that is Journal of Data Analysis and Information Processing effective in solving diverse problems in different learning domains.

Twitter Data Preprocessing
In order to obtain accurate and reliable results, it is paramount that the twitter data be preprocessed so as to enable the classifier run smoothly. Data preprocessing, is a data mining technique that seeks to clean raw datasets by, removing the noise and uninformative parts from it, thereby making it suitable for data analysis with the aim of achieving accurate and reliable results. Furthermore, [24] performed an elaborate research, to evaluate the effectiveness of different twitter data preprocessing techniques. The research found that adapting effective preprocessing of twitter data can improve system accuracy and also opined that preprocessing is the first step in sentiment analysis when given a raw datasets. In this work, we apply the following preprocessing techniques on the Twitter datasets.

Basic Data Cleaning
This phase involves removing unimportant parts from the tweets. They are tagged unimportant since they do not contain any sentiment and removing them does not change the meaning of the tweet. Basic twitter data cleaning involves removing: 1) Twitter handles such as @user, 2) Uniform Resource Locators (URLs): URLs such as http://www.twitter.com, 3) hashtags (#): in this case we remove only the hashtag symbol # and not the word, since most words after hashtags contains sentiments, 4) stock market tickers like $GE, 5) old style retweet text "RT", 6) multiple whitespace items and replacing them with single space, 7) punctuations, numbers, and special characters, and 8) converting of all texts to lower case.

Removal of Stopwords
Stopwords are words which are used very frequently in a language and have very little meaning. They are mostly pronouns and articles, for example, words like, "is", "and", "was", "the", etc. Also, we created a stop list of frequent Nigerian stopwords such as "na", "ooo", "kuku", etc. These words are filtered out from the tweets in other to save both space and time.

Stemming
Stemming is a technique that seeks to reduce a word to its word stem. For example, words like "quick", "quickly", "quickest", and "quicker" are considered to be from the same stem "quick". This helps in decreasing entropy and thereby increasing the relevance of the word. Before stemming is applied, the text has to be tokenized. Tokenization is the process of splitting a string of text into individual words.

Removal of Empty Fields
After performing the required preprocessing, some tweets may become empty.
For example, if a tweet contains only URL and mention (@user). When URL and handles are removed, this tweet becomes empty. It is therefore important to remove all empty fields since an empty field contains no sentiment. Table 2 shows the number of tweets for the different datasets before and after preprocessing. The reduction in the number of tweets is due to the removal of empty fields after applying all stated preprocessing techniques, since empty fields do not contain sentiments.

Support Vector Machine
Support Vector Machine (SVM) algorithm is a supervised machine learning approach that has proven to be very successful in tackling regression and classification problems. It was developed by [25].
Consider a training set vector and i y is the associated class label, which can take values +1 or −1. The goal of the SVM is to find the optimal hyperplane that best separates the data between the two classes. This goal is achieved by maximizing the margin between the two classes (e.g. blue class and red class in Figure 2 below). The support vectors are the points lying on the boundaries and the middle of the margin is called the optimal separating hyperplane.
Definition 4.1 (Hyperplane). The hyperplane is a subspace of one dimension less than its ambient space. It is a set of points satisfying the equation

Formulation of the SVM Optimization Problem
From the foregoing, it has been buttressed that the goal of SVM is to find the optimal separating hyperplane that best segregates the data. Given a vector w  of any length that is constrained to be perpendicular to the median line and an unknown vector u  . We are interested in knowing if u  belongs to class A or B, illustrated in Figure 2. Thus, we project u  to w  which is perpendicular to In which the variable i y is the associated class label such that This implies that, for points on the hyperplane, Equation (1) becomes Therefore, we want the distance (margin) between the positive and negative samples to be as wide as possible.
The distance (margin) between the two support vectors (illustrated by the two green lines in Figure 3 below) is the dot product of the difference vector and the unit vector: Utilizing Equation (2)

Solution of the SVM Optimization Problem
The strategy that helps in finding the local maxima and minima of a function subject to equality constraint was developed by the Italian-French mathematician Giuseppe Lodovico Langrangia, also known as Joseph-Louis Lagrange.

The SVM Lagrangian Problem
From Equation (7) our objective function is: and m constraint functions: Introducing the Lagrangian function, we have where i α is a Lagrange multiplier for each constraint function.
The SVM Lagrangian problem in Equation (11) can be rewritten using the duality principle to aid its solvability. To obtain the solution of the primal problem, we need to solve the Lagrangian problem. The duality principle tells us that an optimization problem can be viewed from two perspectives. The first one is the primal problem, a minimization problem in our case, and the other one is a dual problem, which will be a maximization problem. Since we are solving a convex optimization problem, then Slater's condition holds for affine constraints, and Slater's theorem tells us that strong duality holds. This implies that the maximum of the dual problem is equal to the minimum of the primal problem.
The minimization problem is to be solved by taking the partial derivatives of ( ) in Equation (11) with respect to w  and b. Therefore, differentiat- with respect to w  and setting it to zero we have: Equation (12) shows that the vector w  , is a linear sum of the samples.
Substitute Equation (13) into Equation (14) gives: ( ) The main advantage of the Wolfe dual problem over the Lagrangian problem is that the objective function w  now depends only on the Lagrange multipliers.
Also, the optimization problem depends only on the dot product of pairs of samples (i.e. i j ⋅ x x   ). This aids computation using software.

KKT Optimality Condition for the SVM Optimization Solution
The Karush-Kuhn-Tucker (KKT) conditions are necessary and sufficient conditions for an optimal point of a positive definite Quadratic Programming problem. Thus, for a solution to be optimal, it has to satisfy the KKT conditions. According to [26], "solving the SVM problem is equivalent to finding a solution to the KKT conditions". Therefore, since we've solved the SVM optimization problem, by [26] we've also obtained the solution to the KKT conditions.

Soft Margin SVM
The hard margin SVM demands that the data be linearly separable. However, since in real-life data is often noisy, due to issues such as mistyped value, presence of outlier, etc. To solve this problem [27] developed a modified version of the original SVM which permits classifier to make some mistakes. Thus, the aim is not to make zero mistakes, but to make as few mistakes as possible. This modification is made possible by introducing a variable ζ . The constraint Therefore, the Wolfe dual problem is:

Non-Linearly Separable Data
When the data is not linearly separable in two dimensions and SVM is to be applied, it therefore becomes pertinent to transform the data to higher dimensions so that it can be separated. This can be done with the aid of kernel function. A kernel is a function that returns the result of a dot product performed in another space.

Types of Kernel
There are different types of Kernel that can be used to achieve the goal of the SVM optimization. 1) Linear kernel: linear kernel is known as the simplest kernel. Given two vectors x  and ′ x  , the linear kernel is defined as: 2) Polynomial kernel: Given two vectors x  and ′ x  , the polynomial kernel is defined as: Where 0 c ≥ is a constant term and 2 d ≥ represent the degree of the kernel.
3) Radial basis function (RBF) or Guassian kernel: A radial basis function is a function whose value depends only on the distance from the origin or from some point. Given two vectors x  and ′ x  , the RBF or Guassian kernel is defined as: The RBF kernel returns the result of a dot product performed in ∞  .

LibSVM Algorithm
We utilized the LibSVM algorithm running under WEKA environment in solving the SVM optimization problem. LibSVM is a library for SVM developed by [28] of the National Taiwan University. LIBSVM implements the Sequential Minimal Optimization (SMO) algorithm for kernelize SVMs, supporting classification and regression. It is more flexible and faster when compared to the SMO algorithm invented by [29] at Microsoft Research.

Model Evaluation Parameters
After classification using SVM, it is needful to evaluate the performance of the From Table 3 class. It is given as: 2) Recall: measures the completeness of the classifier with respect to each class. It is given as: 3) F-Measure: is the harmonic mean of precision and recall. It is given as:

Results and Discussions
From Table 4 we had a total of 490 instances (tweets) from the test data. The correctly classified instances also known as the accuracy is 352 instances (tweets) which forms 71.8367% while 138 instances (tweets) were incorrectly classified which forms 28.1633%. The Kappa statistic is a chance-corrected measure of agreement between the classifications and the true classes. It's calculated by taking the agreement expected by chance away from the observed agreement and dividing by the maximum possible agreement. Since the Kappa statistic is greater than 0 it means that the classifier is doing better than chance. The mean absolute error is used to measure how close predictions are to the eventual outcomes. Table 5 shows that the precision of our model in classifying into both the positive and negative classes is above 0.7 which is very good. Also, the recall for the negative class is 0.696 and that of the positive class is 0.741 which are greater than 0.5 which shows the prediction did better than chance. Similarly, the F-measure scores for both classes are greater than 0.7, which shows our model     Figure 4 shows the twitter sentiment analysis results for the five Nigerian banks between the period of 1st January 2017 to 31st December 2018.
From the results, it can be seen that the number of positive tweets within this period is slightly greater than the number of negative tweets for each of the five banks. Based on the number of tweets crawled, GUARANTY had a greater number of tweets followed by FBNH, next to ACCESS, which was followed by ZENITHBANK and with UBA having the least number of crawled tweets within this period. December 2018, we obtained the number of positive and negative sentiments, this data shows intuitively the distribution of sentiments for each bank within the period considered. From the plots, it can be seen that majority of the data falls between 0 and 100, with few outliers above this range.

Conclusions
This study proposed a framework for sentiment analysis of five Nigerian banks     number of tweets followed by FBNH, next to ACCESS, which was followed by ZENITHBANK and with UBA having the least number of crawled tweets within this period. In order to obtain accurate and reliable results, and also ensure that the classifier runs smoothly, these datasets were preprocessed in Python in which twitter handles, URLs, hashtags, stock market symbols, "RT", multiple whitespaces, Journal of Data Analysis and Information Processing  punctuations, numbers, and special characters were all removed. Similarly, stopwords were removed, and the tweets were stemmed so as to decrease the entropy of a word. Thereafter, empty fields were removed.
After preprocessing of the training, test and five banks datasets, LibSVM algorithm in WEKA was used to build the sentiment classification model based on the training data. The performance of this model was evaluated on a pre-labelled test dataset generated from the five banks. Our results show that the accuracy of the classifier was 71.8367%, the Kappa statistics was greater than 0 and implies that the classifier performed better than chance. The precision for both the positive and negative classes was above 0.7 which is very good. Also, the recall for the negative class is 0.696 and that of the positive class is 0.741 which are greater than 0.5 which shows the prediction did better than chance. Similarly, the F-measure scores for both classes are greater than 0.7, which shows our model performed better than chance. Since our result for Matthews Correlation Coefficient (MCC) is 0.437, which is positive and above 0, it indicates that our prediction is good.
Since the model performed very well on the test data, it was deployed in predicting the sentiments of the five Nigerian banks twitter data. Our results show that the number of positive tweets within this period was slightly greater than the number of negative tweets for each of the five banks. Plots for the sentiments series indicated that, majority of the data falls between 0 and 100 sentiments per day, with few outliers above this range.
This research will assist the Nigerian banks in better understanding their customers and foster risk management, business forecasting, competitive analysis, products and services improvements. Future studies can use several machine learning techniques and compare their performance on the datasets to ascertain the best classifier. Another research direction will be classifying the tweets into other polarities other than positive and negative.