Machine Learning-Based Approach for Identification of SIM Box Bypass Fraud in a Telecom Network Based on CDR Analysis: Case of a Fixed and Mobile Operator in Cameroon ()
1. Introduction
Cameroon’s economy pays a high price for international telephone calls made through the SIM Box fraud system. In 2015, the loss of revenue reached 22.2 billion FCFA. That is 18 billion for the 4 local telephone operators, namely CAMTEL, MTN Cameroon, Orange Cameroon, and Nexttel. This is the bill for 100 million minutes of calls made from abroad. As for the state, it loses 4.2 billion in terms of uncollected taxes. In 2014, the overall losses were CFAF 9.3 billion. Operators bore 7.5 billion and the state 1.8 billion [1] . The SIM Box consists of making an international call for a local call via the internet. The receiver sees a local number displayed while the call comes from outside. Commonly especially in Africa and Asia, this fraud causes financial losses of between 2.3 and 7 billion dollars worldwide [2] . Fraud is a major problem for mobile network operators worldwide, costing them more than 38 billion U.S. dollars per year [3] . In many countries, the rate for routing international calls (ITR) is considerably higher than the rate for routing local calls. Fraudsters make considerable profit by bypassing the routing of the licensed international operator to terminate calls in the country. As a result, fraudsters pay the local rate, which is lower than the International Routing Rate (ITR). This practice is illegal in most countries and is an important issue for many operators because of the associated loss of revenue.
In the context of this research, we worked on the case of a fix and mobile operator in Cameroon. The operator has implemented several solutions to reduce SIM box frauds, but so far these methods of fighting SIM Box fraud are not effective and the operator continues to suffer financial losses. The existing solution used by the operator does not allow for obtaining a real time analysis of CDRs for the detection of fraud by SIM Box. However, these security measures have many limitations in terms of real time analysis of CDRs and detection of fraud. Therefore, we propose a Machine Learning based approach for real-time CDR analysis and efficient SIM Box fraud detection. The proposed method can be used in every telecommunications network, we apply it on this network operator as a case study given that the real data was collected there.
In SIM Boxes, local SIM cards are used for rerouting/bypassing international calls from mobile network operators then transfer them over the Internet and deliver them back by means of VoIP gateway device called SIM-Box, as local calls to the operator’s cellular network [4] . Figure 1 and Figure 2 respectively present the case of a normal international call and the case of a fraud using a SIM box.
A number of researches have been conducted using different tools and techniques or methods to solve the problem related to SIM-Box detection using machine-learning techniques.
![]()
Figure 1. Legitimate route of international call, adopt from [4] [5] .
![]()
Figure 2. SIM-Box fraud rout of international call, adopt from [4] [6] .
D. I. Ighneiwa and H. S. Mohamed in [7] used unsupervised learning algorithms to cluster SIMs to get insights on how they could improve the designed algorithm; different models were trained to detect SIMs used in SIM boxes.
A. Krenker, M. Volk, U. Sedlar, J. Bešter, and A. Kos in [8] prove that using a bidirectional neural network (bi-ANN) to predict generic cell phone fraud in real time yielded a high percentage of accuracy. The bi-directional neural network is used to predict the time series of subscriber call duration to identify any unusual behavior. The results show that the Bi-ANN is able to predict these time series with a rate of 90% in an optimal network configuration.
A. H. Elmi, R. Sallehuddin, S. Ibrahim, and A. M. Zain in [9] used a set of 234,324 calls made by 6415 subscribers of a single cell ID over a two-month period for analysis. The dataset included 2126 fraudulent subscribers and 4289 normal subscribers, equivalent to two-thirds of legitimate subscribers and one-third of fraudulent SIM boxes. The researchers extracted 9 features, such as total number of calls, total number of minutes and average number of minutes, etc. They then used the extracted features to train an artificial neural network (ANN) classifier. They found that the best architecture was the one with two hidden layers, each with five hidden neurons, with a learning rate of 0.6. Accuracy reached 98.7% with only 20 counts wrongly classified as false positives.
DEUSSOM Eric et al., in [10] detect fraud by analyzing CDRs and internet traffic. The Differential Privacy model was used to encrypt users’ personal information, and the k-means algorithm and DBSCAN were used here to group users into different clusters. Using a plane representation, they were able to visualize the users that are suspected of fraud. These were the users who were very far away from the different cluster centres.
S. Subudhi and S. Panigrahi in [11] presented a new approach to detect fraudulent activities in mobile telecommunications networks using possibilistic fuzzy c-means clustering. First, the optimal values of the clustering parameters were estimated experimentally. The modelling of the subscriber behaviour profile is then performed by applying the clustering algorithm on two relevant call features selected from the subscriber’s historical call records. All symptoms of intrusive activity are detected by comparing the most recent call activity with their normal profile. Through the following authors presented, we can see that machine learning can be used in many use cases, like fraud detection, network maintenance [12] and so one. The rest of this paper is organized as follows: in section 2, the materials and methods are presented followed by the results and comments in section 3 and finally a conclusion.
2. Materials and Methods
We used machine learning to analyze CDRs to develop a collaborative model capable of identifying SIM Box fraud using three machine learning algorithms: Random Forest, SVM and XGBOOST. Since the CDR data is labeled (data belonging to a fraudster and a non-fraudster respectively), the classification method is the best way to distinguish between fraudulent and non-fraudulent numbers. These three algorithms have many advantages; they are simple, fast and easy to understand and above all they give a result with good accuracy.
In this work, in order to explore data that has been shown to work well with unbalanced datasets, we implemented three learning algorithms.
2.1. The Random Forest Algorithm
The Random Forest algorithm is a classification algorithm that reduces the variance of the predictions of a single decision tree, thus improving their performance, by combining multiple decision trees in a bagging approach. In its most classical form, it performs parallel learning on multiple randomly constructed decision trees trained on different subsets of data. The random forest algorithm is known to be one of the most efficient “out-of-the-box” classifiers (i.e., requiring little data pre-processing) [13] . The random forest algorithm works by 4 steps that won’t be presented again here. Figure 3 presents an illustration of the random forest.
2.2. The SVM Algorithm
Support vector machines (SVM) are a set of supervised learning techniques designed to solve problems. They were developed in the 1990’s from the theoretical considerations of Vladimir Vapnik on the development of a statistical theory of learning: the Vapnik-Chervonenkis theory. They were quickly adopted for their ability to work with high-dimensional data, the low number of hyperparameters, their theoretical guarantees, and their good results in practice [15] .
2.3. The XGBoost Algorithm
XGBoost was originally started as a research project by Tianqi Chen in the Distributed (Deep) Machine Learning Community (DMLC) group. XGBoost is a popular and efficient open-source implementation of the gradient boosted tree algorithm. Gradient boosting is a supervised learning algorithm, which attempts to accurately predict a target variable by combining estimates from a set of simpler and weaker models [16] .
For the present work, Python version 3.8 was used as the programming language of choice for running machine learning algorithms. Anaconda is the Python distribution used; it is delivered with all the tools and libraries needed to do machine learning, such as Numpy, Matplotlib, sklearn, Jupiter, Spider...etc.
2.4. Data Collection and Preparation
· Data collection
Recall that the purpose of this study is to contribute to the creation of an effective fraud detection model for a telecommunication network in order to reduce or eliminate losses caused by fraud. Therefore, we need to develop a model that can identify each fraudster and stop his activities. In order to do this, we started by collecting and processing data. As a result, we obtained CDR files with 60,000 call lines that we sorted then selected the fields we needed to build our model. We were granted a special permission to use this data while preserving the confidentiality of the user’s information.
· Data preparation
It is our responsibility to understand, analyze and determine what data can be used to build our model.
· Description of the data
The CDRs data we collected from the MSOFTX3000 are in .csv format.
The CDRs from the MSOFTX3000 are dated APRIL 2021. The following Table 1 lists the different fields present:
![]()
Table 1. Overview of MSOFTX3000 CDR fields.
· Exploring the Data
It is important to visualize the data as it was collected and to show how the different domains relate to each other. The choice of datasets to be manipulated is crucial. Below is an image of some of the data fields.
· Investigations of fraudulent numbers
The fraudulent SIM Box accounts were investigated by the operator’s fraud department and cancelled due to their malicious activity. As a result, we have obtained data tagged for the month of APRIL 2021; this presented by Figure 4 and Figure 5 presents a sample of data collected from the HUAWEI MSOFTX3000 which is the core network switching equipment.
![]()
Figure 4. HUAWEI MSOFTX3000 CDR Observations.
![]()
Figure 5. Sample of data from HUAWEI MSOFTX3000 CDR.
Note: In Figure 4 and Figure 5 subscribers’ phone numbers were blurred intentionally to protect their privacy.
2.5. Evaluation Method
Confusion matrix
The confusion matrix is the commonly used method to describe and characterize the performance of the classification model in the fraud detection system. The confusion matrix is a kind of summary of the prediction results for a particular classification problem. It compares the actual data for a target variable to that predicted by a model. Right and wrong predictions are revealed and divided by class, allowing them to be compared with defined values. The results of a confusion matrix are classified into four broad categories: true positives, true negatives, false positives, and false negatives [17] .
Different metrics can be calculated from the contingency Table 2 to facilitate interpretation. This is, as example, the case for the error rate, Accuracy, precision, recall and F1 score. These indicators allow a better appreciation of the quality of the model’s precision.
2.6. Construction and Training of the Model
In this part, we built the columns materializing the volume of incoming and outgoing calls of each number and build the fraud target variable (binary variable worth 1 if the call is fraudulent and 0 otherwise), Figure 6 presents the construction of new columns.
· Labelling:
We have for the column “is_fraudulent” labelled the SIM Box numbers. Thus, on each line we apply the lambda function which searches in a line; if there is a fraudulent number; it returns 1 and if not it returns 0, this is represented in Table 3.
· Data transformation
We determined the outgoing call volume for each number:
Outcoming_call_volume: in our dataset, we select the numbers that appear several times, then we group them together by calling numbers and for each group we add the number of times it is in the dataset more precisely at the level of the calling number column.
We have determined the incoming call volume for each number:
Incoming_call_volume: We select the outgoing call number from the list of aggregated numbers and search for the number of times it appears in the list of called numbers within the initial dataset.
We determined the average call duration that a number had to make:
Mean_call_duration: represents the ratio between the total duration of calls by the total number of calls.
· Data normalization
Since the machine learning platform does not understand strings, we had to encode the classes of the cause for term variables using the one hot encoding method which is a very common approach An encoding creates new (binary) columns, indicating the presence of each possible value from the original data: “normalRelease” to represent normal hang-up, “partialRecord” to represent Partial record and “nsuccesfulCallAttempt” to represent Unsuccessful calls.
· Search for dependency between variables
The closer the value is to 1 (a solid red), the stronger and more positive is the correlation. On the other hand, if the correlation is close to 0 (dark blue), the correlation is very negative. This is presented by Figure 7.
· Training the model
Therefore, we used the normal dataset splitting rule found in the Python Pandas library for our dataset with 80% set for training and 20% for testing. The function used to split the data set into training data and test data is present below.
X_train, X_test, y_train, y_test = train_test_split(X_train_res, y_train_res, random_state = 40, test_size = 0.2)
3. Results and Discussion
3.1. Learning and Creating Prediction Models
· Prediction with the Random Forest
To do this, we imported the algorithm from the sklearn library via the following code:
from sklearn.ensemble import RandomForestClassifier
Then we created a Random forest classifier of 100 trees via the following code:
rf = RandomForestClassifier(n_estimators = 100, random_state = 40)
And we launched the training on our training dataset with the following python code:
rf.fit(X=X_train, y=y_train)
After training our Random Forest model, we obtained the following in Figure 8.
We had an accuracy to determine the fraudsters of 0.86 with an f1-score of 0.94 and an accuracy of the non-fraudsters of 0.95 with an f1-score of 0.88, and a total accuracy of 0.91 so our model predicted well in training.
After testing the Random Forest model, we obtained the result presented in Figure 9.
For the test, on the one hand we had an accuracy to determine the fraudsters of 0.88 with an f1-score of 0.96 and on the other hand we had an accuracy to determine the non-fraudsters of 0.96 with an f1-score of 0.89, and the trained model had a general accuracy of 0.92 so our model reacted well to the data test.
· Prediction with the SVM
To do this, we imported the algorithm from the sklearn library via the following code:
from sklearn.svm import SVC
Then we created an SVM whose C value determines the penalty for the classifier. Presented via the following codes:
svc = SVC(random_state = 40, C = 20)
And we launched the training on our training dataset with the following python code:
svc.fit(X = X_train, y = y_train)
Training our SVM model, we obtained the following result presented in Figure 10.
We had an accuracy to determine fraudsters of 0.74 with an f1-score of 0.83 and an accuracy of non-fraudsters of 0.95 with an f1-score of 0.83, and a general accuracy of 0.83. Here our model performed at a lower accuracy for the detection of fraudsters and non-fraudsters in training.
Testing our SVM model, we obtained the result in Figure 11:
For the test, on the one hand we had an accuracy to determine the cheaters of 0.89 with an f1-score of 0.53 and on the other hand an accuracy of the non cheaters of 0.64 with an f1-score of 0.53, and the trained model had a general an accuracy of 0.69 so our model did not react well to the data test.
· Prediction with the XGBoost
To do this, we imported the algorithm from the sklearn library via the following code:
from xgboost import XGBClassifier
Then we created a GaussianNB via the following code:
nb = GaussianNB()
And we launched the training on our training dataset with the following python code:
nb.fit(X = X_train, y = y_train)
Training our XGBoost model, we got the following result in Figure 12:
We had accuracy for determining fraudsters of 0.71 with an f1-score of 0.82 and accuracy for non-fraudsters of 0.96 with an f1-score of 0.80, and a general accuracy of 0.81 so our model predicted well in training.
Testing XGBoost model, we obtained the results presented in Figure 13.
For the testing, on the one hand we had an accuracy to determine the fraudsters of 0.72 with an f1-score of 0.83 and on the other hand an accuracy of the non-fraudsters of 0.72 with an f1-score of 0.79, and the trained model had a general accuracy of 0.81 so our model had an acceptable reaction to the data test.
3.2. Evaluation of the Model by the Confusion Matrix
· Random Forest algorithm
Figure 14 presents Random Forest confusion matrix, in this confusion matrix, the number of false negatives is 23 so we predicted “no” but they are fraudsters while the number of false positives is 66 we predicted “yes” but they are not fraudsters. The number of true positives is 519, thus we predicted that they are not fraudsters and indeed they are not fraudsters, and the number of true negatives is 492 so we predicted that they are fraudsters and indeed they are fraudsters.
· SVM algorithm
Figure 15 presents SVM confusion matrix, in this confusion matrix, the number of false negatives is 322 so we predicted that they are not fraudsters but they are fraudsters, the number of false positives is 24 we predicted “yes” but they are not fraudsters, the number of true positives is 561 we predicted that they are not fraudsters and indeed they are not fraudsters, and the number of true negatives is 193 we predicted that they are fraudsters and indeed they are fraudsters.
· XGBoost algorithm
Figure 16 presents XGBoost confusion matrix, in this confusion matrix, the number of false negatives is 16 so we predicted that they are not fraudsters but they are fraudsters, the number of false positives is 192 we predicted yes but they are not fraudsters, the number of true positives is 393 we predicted that they are not fraudsters and indeed they are not fraudsters, and the number of true negatives is 499 we predicted that they are fraudsters and indeed they are fraudsters
3.3. Discussion
As a follow-up to the experimental research we have done in the previous paragraphs, the machine learning model we propose is the Random Forest model. Indeed, this model is retained because it predicts the optimal SIM Box fraud detection solution with an accuracy of 0.92 and a score of 0.92, as presented in Table 4:
We made the prediction with our best performing model, and determine if Random Forest model is able to correctly determine a case of SIM Box fraud. Figure 17 presents the command which can be used.
Then we tested each line of the dataset to bring out the fraudulent and non-fraudulent numbers. We obtained the dataset with the list of fraudulent and non-fraudulent numbers.
Finally we tested each line of our dataset to highlight the only fraudulent numbers without the lines of the dataset; we obtained the dataset with the list of fraudulent numbers. For that we used the following code and the figures presenting that results are Figure 18 and Figure 19:
Df_fraudulents = dataframe_predictions_with_numbers[dataframe_ predictions_with_numbers[“is_fraudulent”] = = [“fraudulent”]
Due to the rapid evolution of the SIM Box fraud, we think that it is necessary to refresh the detection model periodically, like every quarter and always use the more accurate model for fraud detection.
4. Conclusions
The objective of this paper consists of researching and implementing a SIM Box fraud detection system for a telecommunications network operator, with a case study based on data collected to a fixed and mobile network operator in Cameroon. The project aims to quickly identify SIM Box fraud and reduce or eliminate the financial loss caused by the scam in the company’s turnover.
We used machine learning techniques to effectively identify SIMboxing fraud based on CDR analysis and prevent it from harming telecom companies in terms of revenue, quality of service and security. In order to detect the SIM Box scam, since the dataset is unbalanced, we used classification algorithms. After this step, we performed a comparison of the incoming and outgoing call rates, and then we determined the total duration of a call in a day. Thus, an individual not detected during the first hours may be detected in the following hours. We ran the data under different Machine Learning models of unsupervised learning in order to compare the performance of different models based on their accuracy and select the best one for fraud detection. From the experiment, we found that Random Forest, SVM and XGBoost are able to detect the bypass SIM box fraud. The experimental results showed that Random Forest has the best accuracy compared to the others. Random Forest gave 92% accuracy while SVM model gave 76% accuracy and XGBoost gave 84% accuracy. Therefore, the Random Forest approach is more suitable for the classification model used for SIM BOX fraud detection with 92% accuracy. Then this model has been used to identify the fraudulent numbers in the mobile operator’s network successfully.