Credit Card Fraud Detection Using Weighted Support Vector Machine

Credit card fraudulent data is highly imbalanced, and it has presented an overwhelmingly large portion of nonfraudulent transactions and a small portion of fraudulent transactions. The measures used to judge the veracity of the detection algorithms become critical to the deployment of a model that accurately scores fraudulent transactions taking into account case imbalance, and the cost of identifying a case as genuine when, in fact, the case is a fraudulent transaction. In this paper, a new criterion to judge classification algorithms, which considers the cost of misclassification, is proposed, and several under-sampling techniques are compared by this new criterion. At the same time, a weighted support vector machine (SVM) algorithm considering the financial cost of misclassification is introduced, proving to be more practical for credit card fraud detection than traditional methodologies. This weighted SVM uses transaction balances as weights for fraudulent transactions, and a uniformed weight for nonfraudulent transactions. The results show this strategy greatly improve performance of credit card fraud detection


Introduction
Credit card use is popular in the United States, and with continued and stable growth. The opportunity for fraudulent credit card transactions will also increase  [1]. Because of the large reported credit card fraud dollar amounts, the importance of credit card fraud detection has invoked increased interest in industry and academia.
Researchers have developed machine learning algorithms to predict credit card fraud, although this research has progressed well, it is still challenging in the following areas: how to preprocess imbalanced data, choosing a criterion to judge the performance of different algorithms, and finding an efficient and effective algorithm.
Real-world credit card data set transactions are highly imbalanced with many non-fraudulent records versus only a few fraudulent records, and for the data set used in this study, illicit transactions account for only 0.17% of all transactions.
Note however, misclassification of a fraudulent transaction is far more serious than misclassification of a non-fraudulent transaction, since, misclassification of a fraudulent transaction as a non-fraud transaction will result in a financial loss for the bank, but misclassification of a non-fraudulent transaction as fraud the bank only needs to send a verification message to the customer. Misclassification of different fraudulent transactions also costs the bank differently, and in the data set used in this paper, the minimum fraudulent transaction is $0.10, and the maximum fraud transaction is $25,691.00, so misclassification of the maximum transaction costs much more than the minimum for fraud transactions.
Most industry practitioners, and academics, use accuracy, precision-recall, or area under curve (AUC) to measure the performance of a classifier, which does not reflect the financial cost of individual misclassifications, however in this paper, researchers will include a new measure to overcome this flaw.
Because of the enormous size of credit card transaction data sets, training on a data set without any modification by using some type of sophisticated machine learning models, will use a large amount of computational resources adding to the computational time, and therefore is impractical and inefficient. The imbalanced characteristics of the data set will also cause problems for training, which skew to the prevailing class, and as long as classifying all cases as the prevailing class can achieve a very high, but useless accuracy rate, the need to accommodate the imbalance is critical.
Usually, undersampling is employed by taking all the positive cases and a sample (possibly of equal size, maybe some percentage greater) of negative cases, and preprocessing the data as needed before applying machine learning models to classify the data set. To help with the computational processing, an online SVM algorithm can also be used to reduce training time when SVM model is investigated. This online SVM (LASVM), automatically only chooses the most informative data, instead of using all the available data, in every iteration [2].
Researchers have found, and logic indicates, that if a data set is highly imbalanced, classification accuracy using standard SVM or logistic regression will be biased toward the prevailing class, however, this bias can be corrected by introducing a weight in the regularization process associated with the loss function, and this weight, helps to alleviate the model classification bias produced by data set imbalance. Assigning different weights to different classes in the regularization process will help ameliorate the imbalance bias, and it has been found that for SVM, if the weights for two classes are the inversion of their data sizes, this will achieve unbiased accuracy for both classes [3].
Although this modification could help improving prediction accuracy of minority class, in our case, fraudulent transaction, further improvement can be achieved by assigning different weights to different individual data points. By assigning large weights to the fraudulent transactions with higher loss, the weight assignment can guarantee that those points will have a higher chance to be correctly detected by the model, and achieving this outcome is the one of the main goals of this study.
The rest of the paper is arranged as follows: Section 2 will introduce related work in this field, including evaluation measurements for imbalanced data classification, preprocessing techniques, and model development for credit card fraud detection. Section 3 will establish the theoretical framework, including developing a weighted SVM considering the financial cost for individual instances, and a new evaluation measurement specifically designed for credit card fraud detection is introduced, and in Section 4, we will be conducting experimental computations, and the last section we will give the final conclusion of this study.

Classification Measurement
Broadly speaking, credit card fraud detection belongs to the category of imbalanced data classification, and research achievements within the area of imbalanced data classification can be automatically applied to credit card fraud detection. In any data classification problem, the most fundamentally important issue is choosing a valid measure to precisely and accurately classify data, and in the imbalance data framework case, its significance is doubled.
Besides the famous, and widely used, measures of accuracy, AUC score, and precision-recall, researchers also developed and investigated other measurements, such as G-mean, discriminant power, and likelihood ratio [4] [5] [6]. But there was no clear winner that the author's recommended from these measurements, although accuracy is apparently misleading, which every author agrees.
The idea of adjusting methodologies using a weighting scheme to account for imbalance and cost, as well as adjusting the evaluation metrics, is not well researched, although [7] created a metric, wtdAcc, and examines the following weighting schema: ( ) is extremely ad-hoc, and not linked to the formal modeling framework. [8], use a schema that 1) weights the sample, and 2) updates a model weight taking cost into view.
converting sample-dependent costs into sample weights, are also known as cost-sensitive learning by example weighting. The weighted training samples are then applied to standard learning algorithms. This approach is at the data level without changing the underlying learning algorithms.
For updates to the model weight taking cost into view [8] indicates: the tree-building strategies are adapted to minimize the misclassification costs.
Using these methodologies, [8] employs traditional unadjusted evaluation measures to determine a model's performance for each of the above two procedures potentially biasing the decision process.
The approach in this paper not only updates weights with respect to cost or sample, but presents a new evaluation metric to account for the "cost", in the form of balance, enhancing model selection, minimizing costs, and ameliorating, in part, data imbalance.

Resampling Techniques
Most standard machine learning techniques cannot handle data sets with highly skewed data distribution. The accuracy will highly bias toward to prevailing class. In the case of credit card fraud detection, only 0.2% of all transaction are illicit, which will predict accuracy for minority class poorly. To overcome the problems introduced by data imbalance, resampling techniques are often applied to the original data set to adjust imbalance and to create unbiased prediction.
Random undersampling or oversampling are simple, which can help solve the problem of data skewness, but often introduce non-informative or ill-informative sub-structures in data set. [9] investigated different resampling techniques and concluded K-Medoids technique based undersampling can achieve best overall result using AUC score as the evaluation measurement. With the popularity of deep learning, [10] used generative adversarial network to oversample the minority class, and achieved good result using recall as the evaluation measurement. [11] investigated different resampling techniques, and concluded oversampling technique, SMOTE + ENN, can achieve best performance using recall as the evaluation measurement and logistic regression as the model. The shortcomings of resampling include possible information loss and extra computing cost. Besides resampling, a method of active learning was introduced by [2], which will choose only a small portion of data from the data set at every training iteration.

Supervised Machine Learning Models
There are varieties of choice for the purpose of classification of credit card fraud. Logistic regression (LR), SVM, and random forest (RF) are the three most fre-quently chosen classification techniques used in many different applications. [7] compared these three techniques, using a real-world data set of credit card transaction. They found logistic regression can achieve comparative performance with other two more sophisticated models, of which no parameter was tuned and optimized.
AdaBoost has also been a good choice because it is within the algorithm that assigning weights to different classes can be achieved, which will help to predict minority class more accurately. [12] used an AdaBoost as black-box model for credit card fraud considering financial loss of misclassification and achieved comparable results with start-of-art commercial system. [8] developed new weights updating strategies of AdaBoost, which assigned higher weights to the misclassified instances of minority class.
Deep learning techniques, such as convolutional neural network or recurrent neural network, have great success in computer vision and language processes, which need very sophisticated algorithms to distinguish different features, however, in the area of credit card fraud detection, these methods have not had such success to date. [13] used long short-term memory (LSTM) for credit card fraud detection, considering the characteristics of time series in the data set, and found the LSTM model did not improve the performance detection dramatically compared with RF, and finally, the authors suggested an ensemble model combining these two methods, LSTM and RF, could achieve better results than using only one of them as the classifier.

Unsupervised Machine Learning Models
Besides supervised classification algorithms, unsupervised learning algorithms can also be employed for the purpose of fraud detection, as [14] specifically mention in their data mining work, the use of K-mean Clustering could be used to implement a fraud detection algorithm, and [15], implemented a combination of PCA and Simple K-mean Clustering, in the WEKA machine learning environment [16], to obtain an optimized combination of dimension reduction and clustering achieving 100% precision on a generated credit card transaction data set. [17] [18] also compared supervised and unsupervised learning algorithms on fraud detection, and found using unsupervised learning algorithms is more difficult, and performance is worse, than using supervised learning algorithms.

Support Vector Machine
The principle ideas surrounding the support vector machine started with [19], where the authors express neural activity as an all-or-nothing (binary) event that can be mathematically modeled using propositional logic, and which, as ( [20], p. 244) succinctly describe is a model of a neuron as a binary threshold device in discrete time.
Thus for binary classification, when two classes can be completely separated the classification problem is characterized as considering a training data set n n x y x y x y , in which i x is a vector of d dimensions, and y is a scale { } 1, 1 ∈ + − . Therefore, y is a label of the data belonging to one class or the other class, and assuming linear separability, a straightforward algorithm finds a hyperplane which is linear combination of i x that separates the two classes. If we know the linear separator, in which Φ is called feature function specified by hand, w and b are parameters determined by the learning algorithm on training data. The criteria for deciding a data point belongs to a specific class is: Rosenblatt in 1962 described this algorithm with the perceptron, as mentioned in ( [21], p. 192), and produced in [22], with a mechanism to discover a hyperplane which can separate two classes with maximum margins between two categories. The margin is defined as the distance from nearest points from both classes to the separating hyperplane, and these nearest points are called support vectors and are only a small fraction of all data. The perceptron methodology assumes that the two classes are completely separable. Equation (1) can be used to solve w, b assuming the hyperplane achieve maximum margins between the two categories.
In the real world, the classes cannot always be separated clearly. Sometimes, there are some points on the wrong side of the hyperplane, that is a separable hyperplane may not exist ( [23], p. 343), and to deal with classification error, a soft margin is introduced, which allows some data to be classified on the wrong side of the hyperplane.
As shown in Equation (2), a second term, i x , is introduced, which is used to handle the cases of misclassification. The user-specified parameter C is weight for the cost of misclassification. Setting a large C gives high penalty for misclassification, and a small C gives low penalty for misclassification. As shown in Figure 1, only data points beyond the right side of margin space has no penalty.
Equation (2) is called primal equation, which is also a constrained optimization problem.
Above primal problem, Equation (2), can also be transformed to a dual problem, Equation (3)   tation is at least quadratic to data size, which make SVM hard to train on large data set [24]. In this paper, different undersampling techniques will be used to trim the training data set.

Weighted Support Vector Machine
In the standard SVM methodology, a weight for the penalty of misclassification is the same for every datapoint, nevertheless, a weighted SVM can be further dis- In this paper, weights will be introduced to the loss function to reflect the financial importance between each transaction. The weights for legitimate transactions, nf S , are assigned same. The weights for fraudulent transactions are assigned proportional to the amount of money transferred. To solve the weighted SVM optimization problem, lagrange multiplier can also be introduced, the resulted dual equations is similar to that of standard SVM, Equation (3). The only difference is that the constraint of 0

Result
To align this paper with other papers in the literature, fraudulent transactions will be labeled positive cases, and non-fraudulent transactions will be labeled negative cases, and again following the literature, false negative cases (type II errors) are cases which are fraud but are classified as non-fraud cases, and false positive cases (type I errors) are cases which are non-fraud, but are classified as fraud. The accuracy is defined as Equation (5). The definition of precision and recall will follow the accepted conventions, see Equation (6) and Equation (7). The contribution of this paper is the introduction of a new measure, which we call financial recovery, and is defined as the portion of total detected monetized fraudulent transactions in Equation (8) divided by all monetized fraudulent transactions. The new financial recovery measure developed in this paper is a much more practical measure than other measures for this application, since the objective of detecting fraudulent transactions is to minimize financial loss for a firm or financial institution.
The data set used in this paper was downloaded from Kaggle.com, an online data science platform with publicly available data, and the records downloaded include credit card transactions from a European bank over a two-day period in September 2013. Each record in the data set includes 30 features, which are all derived principal components from a set of original features, except for the first feature which is time, and the last feature which is monetized card transaction amount, both these features are native to the original data set.
The total number of credit card transactions exhibited in the database is 284,807 with a scarcity of fraudulent transactions presenting, only 492, which is a mere 0.17% of all records, and as shown in Figure 2, most transaction amounts were lower than $100, but it could be more than $10,000.  This classification problem will be carried as following: • Three commonly used algorithms, LR, SVM, and RF, will be applied to the data. The results will be used as the benchmark for the more advanced algorithms.
• The weighted SVM will be applied to the data. Three undersampling techniques are compared and sampling size will be optimized.
• The weights of nonfraudulent classes will be optimized while individual weights of fraudulent data points are assigned as the transaction balance.
• The impact of kernel functions, such as sigmoid, polynomial, and radial basis function (RBF), on the performance of SVM algorithm will be investigated. Python is the language for programming with the sci-kit learn package used for all the aforementioned algorithms, and data preprocessing was carried out with preprocessing function, StandardScaler, to put all the data on the same scale, which is very important for SVM, and finally, to keep the results consistent and repeatable across algorithms, the random-state seed was set when calling functions from the Sci-kit learn package.

Benchmark Results
The benchmark classification process consisted of using the three most used algorithms, Linear Regression (LR), Support Vector Machines (SVM), and Random Forests (RF) under the benchmark rubric of no parameters tuned, and all the parameters using the default settings.
The results in Table 1 shows that the accuracy of all three algorithms is 99.9%, not unexpected as there are only 0.17% fraud transactions in the data, but this metric gives no differentiation capacity to assess which methodology best classifies fraud. For the precision metric, the methodology with higher precision, the less non-fraudulent transactions will be classified as fraudulent transactions which will give customers a better shopping experience with less verification messages or calls being be made, and the precision of SVM presents at 89.5% a little better than LR and RF.
The recall measure provides the percentage of fraudulent transactions found in the data, and the recall of SVM presents at 61.6% which is 2.9% less than LR and 0.7% more than RF giving mixed results for SVM giving no clear methodological winner. The very important, newly introduced performance metric is financial recovery, which represents the percentage of the amount of fraudulent transactions that have been correctly detected compared to sum of all transaction amounts. The financial recovery score for SVM is 47.5%, the highest compared with the other two algorithms presented, and is of primary importance since financial recovery is of paramount importance to the company.
Overall, the above results indicate the SVM model maintains the most consistency among the three algorithms presented, and therefore will be used as a benchmark model for further evaluation and investigation in the remainder of the paper.

Comparison of Undersampling Techniques
The computational burden of SVM may extend running time beyond practical limits as [25] indicate below: Training an SVM requires solving a constrained quadratic programming problem, which usually takes ( ) is the number of support vectors and is usually proportional to m. Consequently, SVMs' training time and prediction time to a lesser extent on a very large data set can be quite long, thus making it impractical for some real-world applications.
And as Kramer indicates above, runtime for the credit card data set considered here with 284,807 records could be extensive.
To accelerate the computational efficiency of SVM, random, nearmiss, and k-nearest neighbors (KNN) undersampling techniques were employed where the fraudulent transactions were kept untouched and the nonfraudulent transactions were undersampled. The random technology, as [11] on page 2 of his work indicates below, is a reasonable undersampling methodology: A simple undersampling technique is uniformly random undersampling of the majority class. This can potentially lead to loss of information about the majority class. However, in cases where each example of the majority class is near other examples of the same class, this method might yield good results.
[11] opines on the nearmiss technology on page 2 as shown below: D. F. Zhang et al.
In NearMiss-1, those points from L (majority class) are retained whose mean distance to the k nearest points in S (minority class) is lowest, where k is a tunable hyperparameter.
[11] also discusses KNN, which is characterized in More's work as CNN, described on page 4 as follows: In CNN undersampling, the goal is to choose a subset U of the training set T such that for every point in T its nearest neighbor in U is of the same class.
First, below these algorithms are discussed in relation to measures of accuracy, precision, recall, and the new measure introduced here, financial recovery, and second, the most important parameter for random undersampling, sample size is considered. Note, in relation to sample size, the sample size of random undersampling sets non-fraudulent samples at 10 times the number of fraudulent samples, and nearmiss, as well as KNN, undersampling techniques used default values.
The results in Table 2 shows that random undersampling is the best technique for the data considered here, with these three algorithms achieving similar results for accuracy, precision, recall, and financial recovery, however, random undersampling can achieve 35.7% precision far superior than the other two undersampling techniques.
At the same time, using random undersampling, the total samples were reduced from 199,364 to 3894, and compared to standard SVM, the training time using random undersampling was reduced from more than two hours to less than 1 minute using a PC with intel core i7 CPU and 32 GB memory. Compared with the SVM benchmark model, random undersampling, not only dramatically improves the computation efficiency, but also increase financial recovery from 47.5% to 84.6%, although precision was reduced from 89.5% to 35.7% as shown in Table 1 and Table 2.
Then the ratio of samples of nonfraudulent to fraudulent transactions for SVM with the random undersampling technique was varied to find the optimized ratio, and Figure 3 shows the model performs well when the ratio is15. Note, when the ratio is higher than 15, financial recovery does not improve and training with SVM becomes slower, and when the ratio is lower than 15, precision deteriorates. Consequently, random sampling with a ratio 15 will be used in the weighted SVM algorithm.

Optimization of Weighted Linear SVM
Usually, weights are assigned exactly the same, i.e. identical, for each class member [3], however, in practice, different fraudulent transactions have different costs since a fraudulent transaction worth $10,000 is much more important than that worth $100. In this paper, the weights assigned to the fraudulent transactions in the training step, pertain to the dollar amount of transactions scaled by total dollar amount of all transactions, and unlike the fraudulent class, the weights assigned to all nonfraudulent transactions are same. The logic behind this decision is that the cost of misclassification of a nonfraudulent as fraudulent transaction is the same for each nonfraudulent record, that is, the cost of the misclassification is the cost of investigating the record transaction which is similar for each misclassified record. After assigning the transaction amount as the weight to each according fraudulent transaction, the optimized weight of all nonfraudulent transactions was investigated, and in Table 3, the weight of the nonfraudulent transaction is in the first column, labeled as nf S . It can be found that with the increased weight in the nf S , column accuracy increased. The best performance happened at a weight of 10 with the financial recovery at 99.6%, accuracy at 97.6%, precision at 5.8%, and note financial recovery decreases significantly when the weight is more than 10. Precision increases when the nonfraudulent weight increases, and because higher precision and higher financial recovery is better, the balance of these two performance measures merits setting the weight at 10 for nonfraudulent transactions.

Optimization of Weighted Nonlinear SVM
In all the above results, SVM employed a linear kernel, and to further improve the performance of the weighted SVM, three nonlinear kernel functions were investigated: Radial Basis Function (RBF), polynomial, and sigmoid which will enhance performance in the face of diverse data structures [26]. Table 4 presents the kernel functions investigated in this study. The most important parameter for kernels above is the gamma, and the polynomial kernel has a degree parameter. Besides these kernel parameters, optimized weights for nonfraudulent transactions, nf S , also needs to be found. A grid search approach is used to find the best parameters for these three kernel functions. The three kernels researched here, and reported on in this study, provide extensions of the linear kernel examined above, , and the results are presented in Table 5.  The best weight for the RBF kernel nonfraudulent class is 0.1, with best gamma reported at 0.05, and the best weight for the polynomial kernel nonfraudulent class reported at 0.5 with polynomial of degree of 2 for the kernel function. The optimized weight for nonfraudulent class using sigmoid function is 0.8, with best gamma at 0.01. It can also be found that using kernel functions did not significantly improve the performance of classification. This demonstrates that the most important factor to improve SVM algorithms is to use right weights for individual data point.
• Recall improves materially from the linear case which overpowers the drop in accuracy since financial recovery increases with the kernel functions.
• By using linear kernel, about 2% unfraudulent will be classified as fraudulent.
It increases to 6% using RBF kernel or polynomial, and more than 20% using sigmoid kernel.
• The drop in precision, which results in more unfraudulent as fraudulent, cannot adjust minor increase of financial recovery. The linear kernel performance is superior to more complex kernels in the face of optimal weighting of the nonfraudulent cases thus satisfying the desirable statistical property of parsimonious modeling.

Confusion Matrix
To evaluate the performance of our model, the confusion matrix results of a weighted SVM model and standard SVM are compared in Figure 4. In this confusion matrix with a cost function [28], we assume that TN and TP have no cost since both of them are classified correctly. The cost of FN is considered as the balance of transaction since the balance will be lost when it is classified incorrectly. The cost of FP is also considered as 0 since only a verification message or email is sent from bank to clients. From the Figure 4, we can see that the financial cost of using a standard SVM is $10,396. The financial cost will be reduced to $90 when using the weighted SVM with undersampling techniques. This is only two days of transaction data of a European bank, the annually saving amount will be a great benefit for the bank.

Conclusion
A new criterion, financial recovery, is created to judge the performance of classification algorithms based on financial lost. A weighted SVM model with random undersampling methodology, using the amount of transaction, as a weight for fraudulent data points is developed and applied to records of credit card transactions in a bank in Europe which occurred over a two-day period. The result shows that using the new criteria and novel weight scheme can greatly improve the performance of credit card fraud detection. Most importantly, this strategy will minimize the financial loss of bank in the aspect of credit card fraud.