^{1}

^{*}

^{2}

Keeping customers satisfied is truly essential for saying that business is successful especially in the telecom. Many companies experience different techniques that can predict churn rates and help in designing effective plans for customer retention since the cost of acquiring a new customer is much higher than the cost of retaining the existing one. In this paper, three machine learning algorithms have been used to predict churn namely, Na?ve Bayes, SVM and decision trees using two benchmark datasets IBM Watson dataset, which consist of 7033 observations, 21 attributes and cell2cell dataset that contains 71,047 observations and 57 attributes. The models’ performance has been measured by the area under the curve (AUC) and they scored 0.82, 0.87, 0.77 respectively for IBM dataset and 0.98, 0.99, 0.98 respectively for cell2cell dataset. The proposed models also obtained better accuracy than the previous studies using the same datasets.

Customers retaining is the most important asset for any business as it is stated that “the cost of acquiring a new customer can be higher than that of retaining a customer by as much as 700%; increasing customer retention rates by a mere 5% could increase profits by 25% to 95%” [

Many studies are available for churn problem from different viewpoints with different datasets, algorithm and for different industries where churn analysis is one of the world wide used to analyze the customer behaviors and predict the customers who are about to leave the service agreement from a company. Studies revealed that gaining new customers is 5 to 10 times costlier than keeping existing customers happy and loyal in today’s competitive conditions, and that an average company loses 10 to 30 percent of customers annually [

The method used in this paper has been summarized in

There are two datasets used in this study. The first dataset consists of 7034 samples and 20 attributes while the second dataset contains 71,047 samples and 57 attributes. Datasets details are as shown in

In

The samples from IBM dataset shown in

And

Dataset | Dataset 1 | Dataset 2 |
---|---|---|

Samples | 7034 | 71,047 |

Features | 20 | 57 |

Classes | 2 | 2 |

Missing values % | 0.0% | 0.7% |

negative samples | 1869 (73.46%) | 20,609 (29.01%) |

Data sources | IBM Watson [ | Cell2cell [ |

Churn | Dependents | Tenure | Phone Service | Multiple Lines | Internet Service | Online Backup | Device Protection | Contract | Paperless Billing | Payment Method | Monthly Charges | Total Charges |
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | 1 | 0 | 2 | 1 | 1 | 0 | 1 | 1 | 1 | 29.9 | 29.85 |

0 | 0 | 34 | 1 | 0 | 1 | 0 | 1 | 2 | 0 | 2 | 57 | 1890 |

1 | 0 | 2 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 2 | 53.9 | 108.2 |

0 | 0 | 45 | 0 | 2 | 1 | 0 | 1 | 2 | 0 | 3 | 42.3 | 1841 |

1 | 0 | 2 | 1 | 0 | 2 | 0 | 0 | 1 | 1 | 1 | 70.7 | 151.7 |

1 | 0 | 8 | 1 | 1 | 2 | 0 | 1 | 1 | 1 | 1 | 99.7 | 820.5 |

0 | 1 | 22 | 1 | 1 | 2 | 1 | 0 | 1 | 1 | 4 | 89.1 | 1949 |

churn | revenue | mou | recchrge | changem | custcare | mourec | months | phones | models | eqpdays | creditaa | prizmtwn | webcap |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | −6.2 | 0 | −6 | 0 | 0 | 0 | 7 | 1 | 1 | 203 | 0 | 0 | 1 |

0 | −5.9 | 0 | −5 | 0 | 0 | 0 | 15 | 1 | 1 | 452 | 0 | 0 | 1 |

0 | −2.5 | 211 | 0.5 | NA | 1 | 1.69 | 18 | 2 | 2 | 281 | 0 | 0 | 1 |

1 | 0 | 2 | 0 | NA | 0 | 0 | 27 | 2 | 2 | 597 | 1 | 0 | 1 |

1 | 0 | 55 | 0 | NA | 5 | 7.06 | 26 | 3 | 3 | 371 | 0 | 0 | 1 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 1 | 1 | 199 | 0 | 0 | 1 |

0 | 0 | 76 | 30 | 0 | 1 | 11.2 | 30 | 1 | 1 | 883 | 0 | 0 | 1 |

0 | 0.2 | 12 | 0 | 0 | 0 | 7.24 | 31 | 3 | 3 | 263 | 0 | 0 | 1 |

The dataset is for customers who left within the last month. The column is called Churn where it contains the below attributes [

• Services that each customer has signed up, internet, online security, online backup, device protection, tech support, and streaming TV and movies;

• Customer account information how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges;

• Demographic info about customers—gender, age range, and if they have partners and dependents.

Figures 4-12 show the attributes and their distributions according to the churn class where orange color indicates the churn customers and the blue for non-churn. As noticed that:

• Most churned customers have internet service type Fiber optic;

• They use paperless billing;

• Most of the customers were dependents;

• Their payment method was electronic check;

• They don’t use “device protection” or “online backup” services, rather they use phone service;

• Their tenure was less than 14 months.

Therefore, the predictor attributes have been selected according to this analysis.

Cell2cell is the 6th largest wireless company in the US, Cell2cell dataset consists of 71,047 signifying whether the customer had left the company two months after observation and 57 attributes [

• Churned customers have average (mean) monthly minutes of use which is less than 530’ minute;

• They have service for only 11 - 15 months;

• The numbers of days for their equipment were between 300 & 361 day;

• The numbers of models issues are less than 2;

• Their prizm code refer to town;

• Their handsets have web capability.

The excluded data according to the churn class has been illustrated in

The Naive Bayes algorithm is a classification algorithm based on Bayes rule and a set of conditional independence assumptions [_{i}. The classifier predicts that the class label of tuple X is the class C_{i} if and only if

P ( X | C i ) P ( C i ) > P ( X | C j ) P ( C j ) for 1 ≤ j ≤ m , j ≠ i (1)

In other words, the predicted class label is the class C_{i} for which P ( X | C i ) P ( C i ) is the maximum [

P ^ ( Y = k | X 1 , ⋯ , X p ) = π ( Y = k ) ∏ j = 1 P P ( X j | Y = k ) ∑ k = 1 K π ( Y = k ) ∏ j = 1 P P ( X j | Y = k ) (2)

where:

Y is the random variable corresponding to the churn class index of an observation.

X 1 , ⋯ , X p are the predictors of an observation.

π ( Y = k ) is the prior probability that a class index is k.

The model use mean and standard deviation to distrubite the predictors within each class.

Naive Bayes classification classify data into the training data, the method estimates the parameters of a probability distribution, assuming predictors are conditionally independent given the class. Prediction step: For any unseen test data, the method computes the posterior probability of that sample belonging to each class. The method then classifies the test data according the largest posterior probability.

SVM algorithm for the classification of both linear and nonlinear data. It transforms the original data into a higher dimension, from where it can find a hyperplane for data separation using essential training tuples called support vectors [_{j} along with their categories y_{j}. For some dimension d, the x j ∈ R d , and the y_{j} = ±1. The equation of a hyperplane is [

f ( x ) = x ′ β + b = 0 (3)

where β ∈ R d and b is a real number.

As the data used is not allow for a separating hyperplane, the SVM used a soft margin, meaning a hyperplane that separates many, but not all data points. There are two standard formulations of soft margins. Both involve adding slack variables ξ = ( ξ 1 , ξ 2 , ⋯ , ξ N ) and a penalty parameter C.

• The L^{1}-norm problem is:

min β , b , ξ ( 12 β ′ β + C ∑ j ξ j ) (4)

such that

y j f ( x j ) ≥ 1 − ξ j ξ j ≥ 0 (5)

• The L^{2}-norm problem is:

min β , b , ξ ( 12 β ′ β + C ∑ j ξ j 2 ) (6)

In these formulations, it can be used C places more weight on the slack variables ξj, meaning the optimization attempts to make a stricter separation between classes. Equivalently, reducing C towards 0 makes misclassification less important.

The propsed SVM model standardizes the predictors using their corresponding weighted means and weighted standard deviations. Means it standardizes predictor j (x_{j}) using

x j ∗ = x j − μ j ∗ σ j (7)

μ j ∗ = 1 ∑ k w k ∗ ∑ k w k ∗ x j k (8)

X_{jk} is observation k (row) of predictor j (column).

( σ j ∗ ) 2 = v 1 v 12 − v 2 ∑ k w k ∗ ( x j k − μ j ∗ ) 2 (9)

v 1 = ∑ j w j ∗ (10)

v 2 = ∑ j ( w j ∗ ) 2 (11)

Decision tree induction is the learning of decision trees from class-labeled training tuples. A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node [

• Gini’s Diversity Index (gdi)—the Gini index of a node is

Gini ( D ) = 1 − ∑ i = 1 m P i 2 , (12)

where the sum is over the classes i at the node, and p(i) is the observed fraction of classes with class i that reach the node. A node with just one class (a pure node) has Gini index 0; otherwise the Gini index is positive. So the Gini index is a measure of node impurity.

• Deviance (“deviance”)—with p(i) defined the same as for the Gini index, the deviance of a node is

∑ i p ( i ) log 2 p ( i ) (13)

A pure node has deviance 0; otherwise, the deviance is positive.

The models have been evaluted using the holdout method and k-fold cross-validation. In the holdout partition method, the given data are randomly partitioned into two independent sets, a training set and a test set. [_{1}, D_{2}, ... D_{k}, each of approximately equal size. Training and testing is performed k times [

The three models trained on IBM and cell2cell datasets and have been divided into training and test sets using cross validation with partition types “hold-out” 30% and “k-fold” where the k value used is 10. The training and testing error shown in

The ROC curve for IBM dataset shown in Figures 20-22 according to

In the following experiment the models were evaluted with k-fold value of 10, as shown in

Dataset | Model | Training Error | Testing Error | AUC | ACC |
---|---|---|---|---|---|

IBM Waston | Naïve Bayes | 0.23829 | 0.25334 | 0.81721 | 76% |

SVM | 0.20564 | 0.20335 | 0.83683 | 80% | |

Decision Tree | 0.086392 | 0.23684 | 0.76198 | 76.3% | |

Cell2cell | Naïve Bayes | 0.021036 | 0.019907 | 0.98149 | 90% |

SVM | 0.0082883 | 0.0089536 | 0.99212 | 98.2% | |

Decision Tree | 0.12620 | 0.012142 | 0.9855 | 98.8% |

Model | Fold | IBM Dataset | Cell2cell Dataset | |||||||
---|---|---|---|---|---|---|---|---|---|---|

Training Error | Testing Error | AUC | ACC | Training Error | Testing Error | AUC | ACC | |||

Naïve Bayes | 1 | 0.2436 | 0.2270 | 0.8135 | 77.30% | 0.0309 | 0.0305 | 0.9665 | 97 | |

2 | 0.2435 | 0.2482 | 0.8149 | 75.20% | 0.0295 | 0.0302 | 0.9671 | 97 | ||

3 | 0.2394 | 0.2639 | 0.8156 | 73.60% | 0.0314 | 0.0254 | 0.9667 | 97 | ||

4 | 0.2403 | 0.2596 | 0.8181 | 74.00% | 0.0295 | 0.0270 | 0.9681 | 97.3 | ||

5 | 0.2423 | 0.2386 | 0.8117 | 76.10% | 0.0312 | 0.0328 | 0.9663 | 96.7 | ||

6 | 0.2390 | 0.2613 | 0.8166 | 73.90% | 0.0300 | 0.0321 | 0.9691 | 96.8 | ||

7 | 0.2440 | 0.2400 | 0.8125 | 76.00% | 0.0318 | 0.0343 | 0.9647 | 96.6 | ||

8 | 0.2425 | 0.2457 | 0.8123 | 75.40% | 0.0304 | 0.0316 | 0.9679 | 96.8 | ||

9 | 0.2447 | 0.2272 | 0.8133 | 77.30% | 0.0312 | 0.0330 | 0.9652 | 96.7 | ||

10 | 0.2448 | 0.2187 | 0.8106 | 79.10% | 0.0309 | 0.0326 | 0.9677 | 96.7 | ||

SVM | 1 | 0.2048 | 0.2033 | 0.8300 | 79.90% | 0.0087 | 0.0072 | 0.9927 | 99 | |

2 | 0.2046 | 0.1836 | 0.8262 | 81.70% | 0.0086 | 0.0084 | 0.9908 | 98.9 | ||

3 | 0.2048 | 0.1838 | 0.8498 | 81.70% | 0.0087 | 0.0075 | 0.9935 | 99 | ||

4 | 0.1993 | 0.2330 | 0.8064 | 76.70% | 0.0085 | 0.0085 | 0.9920 | 98.9 | ||

5 | 0.2037 | 0.1925 | 0.8447 | 80.80% | 0.0086 | 0.0082 | 0.9924 | 98.8 | ||

6 | 0.2018 | 0.2150 | 0.8138 | 78.60% | 0.0083 | 0.0109 | 0.9882 | 98.7 | ||

7 | 0.2042 | 0.1945 | 0.8239 | 80.50% | 0.0086 | 0.0079 | 0.9917 | 99 | ||

8 | 0.2062 | 0.1831 | 0.8655 | 81.70% | 0.0085 | 0.0092 | 0.9926 | 98.8 | ||

9 | 0.2002 | 0.2272 | 0.8089 | 77.30% | 0.0086 | 0.0081 | 0.9926 | 99 | ||

10 | 0.2033 | 0.2118 | 0.8224 | 78.80% | 0.0085 | 0.0091 | 0.9913 | 98.8 | ||

Decision Tree | 1 | 0.0858 | 0.2499 | 0.7240 | 75% | 0.0001 | 0.0120 | 0.9853 | 98.8 | |

2 | 0.0841 | 0.2469 | 0.7667 | 75.30% | 0.0001 | 0.0130 | 0.9859 | 98.7 | ||

3 | 0.0896 | 0.2298 | 0.7447 | 77% | 0.0002 | 0.0123 | 0.9826 | 98.8 | ||

4 | 0.0882 | 0.2795 | 0.7168 | 72.10% | 0.0001 | 0.0124 | 0.9846 | 98.9 | ||

5 | 0.0880 | 0.2618 | 0.7264 | 73.90% | 0.0002 | 0.0109 | 0.9887 | 98.9 | ||

6 | 0.0882 | 0.2684 | 0.6994 | 73.20% | 0.0001 | 0.0110 | 0.9853 | 98.8 | ||

7 | 0.0868 | 0.2542 | 0.7125 | 74.60% | 0.0000 | 0.0136 | 0.9831 | 98.6 | ||

8 | 0.0841 | 0.2641 | 0.7321 | 73.60% | 0.0001 | 0.0117 | 0.9854 | 98.8 | ||

9 | 0.0844 | 0.2400 | 0.7079 | 76% | 0.0002 | 0.0127 | 0.9863 | 98.7 | ||

10 | 0.0899 | 0.2854 | 0.7020 | 71.40% | 0.0001 | 0.0120 | 0.9847 | 98.8 |

In order to check the models, they have been compared with previous papers which used similar datasets. The result approved that the model is more accurate as shown in

ApurvaSree, G., et al. [

The AUC’ values have been plotted for the three models as shown in

Paper | Algorithms | Result | Dataset | Proposed Model Result |
---|---|---|---|---|

[ | Random forest, Logistic regression, SVM | Accuracy (80.75%, 80.88%, 82%) | IBM Waston | AUC for SVM AUC (87%) |

[ | Naive Bayes | AUC (83%) | IBM Waston | AUC for Naive Bayes same (83%) |

[ | Decision trees, Logistic regression, Neural networks and SVM | Accuracy (62.98, 61.65, 61.40 and 61.78) | Cell2cell | ACC for decision trees (98%) and SVM (99%) |

[ | GP-AdaBoost | AUC (0.91) | Cell2cell | The best AUC for SVM (99%) AdaBoost not used |

[ | C4.5 decision tree | AUC (63.04) | Cell2cell | AUC decision trees (98%) |

[ | SVM | AUC (94.13) | Cell2cell | AUC for SVM (99%) |

This paper analyzed two datasets, IBM Watson dataset consists of 7033 observations, 21 attribute and cell2cell dataset consists of 71,047 observations and 57 attribute where they have been visualized using orange software. The three predictive models “Naïve Bayes, SVM and decision tree” have been implemented in Matlab. The paper aims to find the best accurate model for churn prediction in telecom and selecting the most important reasons that let customers churn. The models performance has been measured by area under curve where the best AUCs are (0.82, 0.87, 0.78) for IBM dataset & (0.98, 0.99, 0.98) for cell2cell dataset. The AUC, which obtained using SVM algorithm, is better compared with the previous papers. As noticed that the churned customers have some similar services, which means that any telecom company can detect the predictors and retain their customers. The paper concluded that telecom operators can get best predictive models if they analyzed their whole records and tracked the customers’ behavior so they can build different marketing approaches to retain the churners based on the predictors which can be detected when analyzing the historical customer’s records. All churn prediction models in this paper can be used in other customer response models as well, such as cross-selling, up-selling, or customer acquisition.

This work is supported by the International University of Africa, the authors would like to thank the international university of Africa for the support in research and development. In addition, the authors would like to thank the IBM Waston and Cell2cell companies for providing the datasets freely available for the research. The authors also immensely grateful to Prof Saad Subair for his support to publish in this journal.

The authors declare no conflicts of interest regarding the publication of this paper.

Ebrah, K. and Elnasir, S. (2019) Churn Prediction Using Machine Learning and Recommendations Plans for Telecoms. Journal of Computer and Communications, 7, 33-53. https://doi.org/10.4236/jcc.2019.711003