Churn Prediction Using Machine Learning and Recommendations Plans for Telecoms

Keeping customers satisfied is truly essential for saying that business is successful especially in the telecom. Many companies experience different techniques that can predict churn rates and help in designing effective plans for customer retention since the cost of acquiring a new customer is much higher than the cost of retaining the existing one. In this paper, three machine learning algorithms have been used to predict churn namely, Naïve Bayes, SVM and decision trees using two benchmark datasets IBM Watson dataset, which consist of 7033 observations, 21 attributes and cell2cell dataset that contains 71,047 observations and 57 attributes. The models’ performance has been measured by the area under the curve (AUC) and they scored 0.82, 0.87, 0.77 respectively for IBM dataset and 0.98, 0.99, 0.98 respectively for cell2cell dataset. The proposed models also obtained better accuracy than the previous studies using the same datasets.


Introduction
Customers retaining is the most important asset for any business as it is stated that "the cost of acquiring a new customer can be higher than that of retaining a customer by as much as 700%; increasing customer retention rates by a mere 5% could increase profits by 25% to 95%" [1]. So one of the best solution to retain the customers is to reduce churn rate, where "churn" means moving the customer from service provider to another one, or stopping using specific services over specific periods for many reasons that can be detected previously if the

Related Work
Many studies are available for churn problem from different viewpoints with different datasets, algorithm and for different industries where churn analysis is one of the world wide used to analyze the customer behaviors and predict the customers who are about to leave the service agreement from a company. Studies revealed that gaining new customers is 5 to 10 times costlier than keeping existing customers happy and loyal in today's competitive conditions, and that an average company loses 10 to 30 percent of customers annually [6] [7]. Most of the literature focused more on data mining algorithms, but only a few of them focused on distinguishing the important input variables for churn prediction and on enhancing the data samples through efficient pre-processing to be used for data mining algorithms implementation [8] [9]. Amin, A., et al. [10] presented a novel churn prediction approach based on the classifier's certainty estimation using distance factor where they grouped the dataset into different zones based on the distance which are then divided into two categories with high and low certainty, they used 4 datasets with different samples and they have been discretized by size, the values that exists in each attribute of the dataset, and then assigned certain labels and at the end produced specific list of values in different number of groups of an attribute. They used Naïve Bayes as classifier and it obtained high accuracy in the zone with greater distance factor's value (i.e., customer churn and non-churn with high certainty) than those placed in the zone with smaller distance factor's value (i.e., customer churn and non-churn with low certainty). Accuracy in the last tenth iteration was ( on the 4 datasets used. Andrews, R., et al. [3] used dataset of 10,000 client records from telecom each with 21 attribute, in which 2900 are churners from customers of a Telecom Company in Belgium. They applied profound learning models and they used 10-overlap cross approval methods to check the predic-tion exactness and the area under curve score is 0.89. Ahmad, A. K., Jafar, A. and Aljoumaa, K. [2] developed machine learning techniques on big data platform for analyzing data from SyriaTel telecom contained all customers' information over 9 months. The model experimented four algorithms: Decision Tree, Random Forest, Gradient Boosted Machine Tree "GBM" and Extreme Gradient Boosting "XGBOOST". The AUC for the four models were 83, 87.76, 90.89 and 93.3. The best results were obtained by applying XGBOOST and it obtained 93.3% where it used (SNA) features, which enhanced the performance of the model from 84% to 93.3%. The model was prepared and tested through Spark environment. Saraswat, S. and Tiwari, A. [11] described a framework that was proposed to conduct for the churn prediction model using Naïve Bayes algorithm for classification task and then apply Elephant Herding Optimization algorithm for solving optimization task used the dataset which was obtained from https://www.kaggle.com and it contains 21 attributes and 3333 instances. Data contains 483 churn' customer where predicted 244 correctly as churner customer using naïve equation and after applying Elephant Herding Optimization Algorithm 199 churner, model accuracy is 87%. Different algorithms are used by Ahmed, A.A. and D. Maheswari [12], which are Firefly algorithm and the Hybrid Firefly algorithm on Orange Dataset which contains 50,000 samples and 230 attributes. The dataset was segregated with 90% data for training and 10% for testing. The search space was populated with 20 fireflies and classification was carried out with a maxgen of 1000. The ACC obtained is (86.36%, 86.38%). Some researchers compared between different models as Kumar, N. and C. Naik [13] who used three models Logistic regression, random forest and balanced random forest on dataset contains from 25,000 samples and 110 attributes and used PCA for feature selection and partitions used 70% & 30% for training and testing. The result presented that Logistic regression model has the highest area under the curve where the ACC of the three models (0.861, 0.83, 0.83).

The Research Strategy
The method used in this paper has been summarized in Figure 1 and it has been explained in detail in the next paragraphs.

Datasets Visualization
There are two datasets used in this study. The first dataset consists of 7034 samples and 20 attributes while the second dataset contains 71,047 samples and 57 attributes. Datasets details are as shown in Table 1. Both datasets have been visualized using Orange.
In Figure 2 & Figure 3 the churn class histogram for both datasets were illustrated. The 0's value refers to the non-churned customers and shown in blue color and the 1's value refers to the churned customers and shown in orange color.
The samples from IBM dataset shown in Table 2 are the features which have been used in prediction models.
And Table 3 includes the samples with the features of cell2cell dataset.

IBM Dataset Visualization and Preprocessing
The dataset is for customers who left within the last month. The column is called Churn where it contains the below attributes [14]:  Services that each customer has signed up, internet, online security, online backup, device protection, tech support, and streaming TV and movies;  Customer account information how long they've been a customer, contract, payment method, paperless billing, monthly charges, and total charges;  Demographic info about customers-gender, age range, and if they have partners and dependents.  Their payment method was electronic check;  They don't use "device protection" or "online backup" services, rather they use phone service;  Their tenure was less than 14 months. Therefore, the predictor attributes have been selected according to this analysis. Figure 10 and Figure 11 show the correlation between Total charges, Monthly Charge and Tenure.

Cell2cell Dataset Visualization and Preprocessing
Cell2cell is the 6th largest wireless company in the US, Cell2cell dataset consists of 71,047 signifying whether the customer had left the company two months after observation and 57 attributes [17].  The numbers of models issues are less than 2;  Their prizm code refer to town;  Their handsets have web capability.
The excluded data according to the churn class has been illustrated in Figure   18 and Figure 19. Where Figure 18

Naïve Bayes Algorithm
The Naive Bayes algorithm is a classification algorithm based on Bayes rule and a set of conditional independence assumptions [18]. To predict the class label of X, ( ) ( ) | i i P X C P C is evaluated for each class C i . The classifier predicts that the class label of tuple X is the class C i if and only if In other words, the predicted class label is the class C i for which ( ) ( ) | i i P X C P C is the maximum [19]. Models posterior probabilities according to Bayes rule. That is, for all 1, , where: Y is the random variable corresponding to the churn class index of an observation.
are the predictors of an observation.

( )
Y k π = is the prior probability that a class index is k.
The model use mean and standard deviation to distrubite the predictors within each class.
Naive Bayes classification classify data into the training data, the method estimates the parameters of a probability distribution, assuming predictors are conditionally independent given the class. Prediction step: For any unseen test data, the method computes the posterior probability of that sample belonging to each class. The method then classifies the test data according the largest posterior probability.

Support Vector Machine Algorithm
SVM algorithm for the classification of both linear and nonlinear data. It transforms the original data into a higher dimension, from where it can find a hyperplane for data separation using essential training tuples called support vectors [19]. The SVM binary classification algorithm searches for an optimal hyperplane that separates the data into two classes. For separable classes, the optimal hyperplane maximizes a margin (space that does not contain any observations) surrounding itself, which creates boundaries for the positive and negative classes.
The data for training is a set of points (vectors) x j along with their categories y j . For some dimension d, the d j x R ∈ , and the y j = ±1. The equation of a hyperplane is [20] ( ) where d R β ∈ and b is a real number.
As the data used is not allow for a separating hyperplane, the SVM used a soft margin, meaning a hyperplane that separates many, but not all data points.
 The L 2 -norm problem is: In these formulations, it can be used C places more weight on the slack variables ξj, meaning the optimization attempts to make a stricter separation between classes. Equivalently, reducing C towards 0 makes misclassification less 1 j kwk kwk xjk µ * = ∑ * ∑ * (8) X jk is observation k (row) of predictor j (column).

Decision Tree Algorithm
Decision tree induction is the learning of decision trees from class-labeled training tuples. A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label.
The topmost node in a tree is the root node [19]. The Classification Tree splits nodes based on either impurity or node error. Impurity means one of several things, depending on the Split Criterion name-value pair argument:  Gini's Diversity Index (gdi)-the Gini index of a node is ( ) where the sum is over the classes i at the node, and p( i) is the observed fraction of classes with class i that reach the node. A node with just one class (a pure node) has Gini index 0; otherwise the Gini index is positive. So the Gini index is a measure of node impurity.
 Deviance ("deviance")-with p(i) defined the same as for the Gini index, the deviance of a node is

Models Evaluation's Methods
The models have been evaluted using the holdout method and k-fold cross-validation. In the holdout partition method, the given data are randomly partitioned into two independent sets, a training set and a test set. [19]. And in this partition type, a scalar parameter (let's say "p") which randomly selects approximately p* n observations for the test set. The p value used here is 0.3 which divided datasets into 70% for training and 30% for testing. In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or "folds", D 1 , D 2 , ... D k , each of approximately equal size. Training and testing is performed k times [19]. The datasets here have been divided into 10 folds.

Experiments and Results
The three models trained on IBM and cell2cell datasets and have been divided into training and test sets using cross validation with partition types "hold-out" 30% and "k-fold" where the k value used is 10. The training and testing error shown in Table 4 and it shows the best result obtained from training and testing.
The models have been trained from 4 to 5 times for each dataset and they didn't give better accuracy.
The ROC curve for IBM dataset shown in Figures 20-22 according to Table 2 for each model output. Whereas Figures 23-25 show the ROC for cell2cell dataset according Table 4 too. ROC curve for each of the three models shows the trade-off between the true positive rate (TPR) and the false positive rate (FPR).
Given a test set and a model, TPR is the proportion of positive (churned) tuples that are correctly labeled by the model; FPR is the proportion of negative (nochurn) tuples that are mislabeled as positive [19].
In the following experiment the models were evaluted with k-fold value of 10, as shown in Table 5 for both datasets respectvely. There are small variances between error rates within k-fold cross-validation experiment. The best result obtained from SVM model in fold number 8 on IBM dataset and in fold number 1 on cell2cell dataset.        In order to check the models, they have been compared with previous papers which used similar datasets. The result approved that the model is more accurate as shown in Table 4.
ApurvaSree, G., et al. [4] and Induja, S. & D. V. P. Eswaramurthy [21] used IBM Waston dataset with different algorithms including SVM for the first paper & Naïve Bayes for the second mentioned paper, both results are similar to the results obtained from this paper. However, our proposed method obtained higher accuracy by using SVM model on IBM dataset with k-fold partition, k value = 10 to produce an area under curve reached to 0.86548. As for cell2cell dataset, the papers in [21] [22] [23] [24] also used different algorithms including SVM where the best accuracy for previous studies was 94.13 for AUC whereas the AUC in the proposed model using SVM is 0.99 as shown in Table 6.
The AUC' values have been plotted for the three models as shown in Figure  26 & Figure 27, which shown that best result obrained using algorithm from SVM for both datasets.

Conclusion
This paper analyzed two datasets, IBM Watson dataset consists of 7033 observations, 21 attribute and cell2cell dataset consists of 71,047 observations and 57 attribute where they have been visualized using orange software. The three predictive models "Naïve Bayes, SVM and decision tree" have been implemented in Matlab. The paper aims to find the best accurate model for churn prediction in telecom and selecting the most important reasons that let customers churn. The models performance has been measured by area under curve where the best AUCs are (0.82, 0.87, 0.78) for IBM dataset & (0.98, 0.99, 0.98) for cell2cell dataset. The AUC, which obtained using SVM algorithm, is better compared with the previous papers. As noticed that the churned customers have some similar services, which means that any telecom company can detect the predictors and retain their customers. The paper concluded that telecom operators can get best predictive models if they analyzed their whole records and tracked the customers' behavior so they can build different marketing approaches to retain the churners based on the predictors which can be detected when analyzing the historical customer's records. All churn prediction models in this paper can be used in other customer response models as well, such as cross-selling, up-selling, or customer acquisition.