^{1}

^{*}

^{1}

^{1}

Credit card fraudulent data is highly imbalanced, and it has presented an overwhelmingly large portion of nonfraudulent transactions and a small portion of fraudulent transactions. The measures used to judge the veracity of the detection algorithms become critical to the deployment of a model that accurately scores fraudulent transactions taking into account case imbalance, and the cost of identifying a case as genuine when, in fact, the case is a fraudulent transaction. In this paper, a new criterion to judge classification algorithms, which considers the cost of misclassification, is proposed, and several undersampling techniques are compared by this new criterion. At the same time, a weighted support vector machine (SVM) algorithm considering the financial cost of misclassification is introduced, proving to be more practical for credit card fraud detection than traditional methodologies. This weighted SVM uses transaction balances as weights for fraudulent transactions, and a uniformed weight for nonfraudulent transactions. The results show this strategy greatly improve performance of credit card fraud detection.

Credit card use is popular in the United States, and with continued and stable growth. The opportunity for fraudulent credit card transactions will also increase as documented in 2019 by the Federal Trade Commission’s Consumer Sentinel Network Databook January 2020 ( [

Researchers have developed machine learning algorithms to predict credit card fraud, although this research has progressed well, it is still challenging in the following areas: how to preprocess imbalanced data, choosing a criterion to judge the performance of different algorithms, and finding an efficient and effective algorithm.

Real-world credit card data set transactions are highly imbalanced with many non-fraudulent records versus only a few fraudulent records, and for the data set used in this study, illicit transactions account for only 0.17% of all transactions.

Note however, misclassification of a fraudulent transaction is far more serious than misclassification of a non-fraudulent transaction, since, misclassification of a fraudulent transaction as a non-fraud transaction will result in a financial loss for the bank, but misclassification of a non-fraudulent transaction as fraud the bank only needs to send a verification message to the customer. Misclassification of different fraudulent transactions also costs the bank differently, and in the data set used in this paper, the minimum fraudulent transaction is $0.10, and the maximum fraud transaction is $25,691.00, so misclassification of the maximum transaction costs much more than the minimum for fraud transactions.

Most industry practitioners, and academics, use accuracy, precision-recall, or area under curve (AUC) to measure the performance of a classifier, which does not reflect the financial cost of individual misclassifications, however in this paper, researchers will include a new measure to overcome this flaw.

Because of the enormous size of credit card transaction data sets, training on a data set without any modification by using some type of sophisticated machine learning models, will use a large amount of computational resources adding to the computational time, and therefore is impractical and inefficient. The imbalanced characteristics of the data set will also cause problems for training, which skew to the prevailing class, and as long as classifying all cases as the prevailing class can achieve a very high, but useless accuracy rate, the need to accommodate the imbalance is critical.

Usually, undersampling is employed by taking all the positive cases and a sample (possibly of equal size, maybe some percentage greater) of negative cases, and preprocessing the data as needed before applying machine learning models to classify the data set. To help with the computational processing, an online SVM algorithm can also be used to reduce training time when SVM model is investigated. This online SVM (LASVM), automatically only chooses the most informative data, instead of using all the available data, in every iteration [

Researchers have found, and logic indicates, that if a data set is highly imbalanced, classification accuracy using standard SVM or logistic regression will be biased toward the prevailing class, however, this bias can be corrected by introducing a weight in the regularization process associated with the loss function, and this weight, helps to alleviate the model classification bias produced by data set imbalance. Assigning different weights to different classes in the regularization process will help ameliorate the imbalance bias, and it has been found that for SVM, if the weights for two classes are the inversion of their data sizes, this will achieve unbiased accuracy for both classes [

Although this modification could help improving prediction accuracy of minority class, in our case, fraudulent transaction, further improvement can be achieved by assigning different weights to different individual data points. By assigning large weights to the fraudulent transactions with higher loss, the weight assignment can guarantee that those points will have a higher chance to be correctly detected by the model, and achieving this outcome is the one of the main goals of this study.

The rest of the paper is arranged as follows: Section 2 will introduce related work in this field, including evaluation measurements for imbalanced data classification, preprocessing techniques, and model development for credit card fraud detection. Section 3 will establish the theoretical framework, including developing a weighted SVM considering the financial cost for individual instances, and a new evaluation measurement specifically designed for credit card fraud detection is introduced, and in Section 4, we will be conducting experimental computations, and the last section we will give the final conclusion of this study.

Broadly speaking, credit card fraud detection belongs to the category of imbalanced data classification, and research achievements within the area of imbalanced data classification can be automatically applied to credit card fraud detection. In any data classification problem, the most fundamentally important issue is choosing a valid measure to precisely and accurately classify data, and in the imbalance data framework case, its significance is doubled.

Besides the famous, and widely used, measures of accuracy, AUC score, and precision-recall, researchers also developed and investigated other measurements, such as G-mean, discriminant power, and likelihood ratio [

The idea of adjusting methodologies using a weighting scheme to account for imbalance and cost, as well as adjusting the evaluation metrics, is not well researched, although [

[

converting sample-dependent costs into sample weights, are also known as cost-sensitive learning by example weighting. The weighted training samples are then applied to standard learning algorithms. This approach is at the data level without changing the underlying learning algorithms.

For updates to the model weight taking cost into view [

the tree-building strategies are adapted to minimize the misclassification costs.

Using these methodologies, [

The approach in this paper not only updates weights with respect to cost or sample, but presents a new evaluation metric to account for the “cost”, in the form of balance, enhancing model selection, minimizing costs, and ameliorating, in part, data imbalance.

Most standard machine learning techniques cannot handle data sets with highly skewed data distribution. The accuracy will highly bias toward to prevailing class. In the case of credit card fraud detection, only 0.2% of all transaction are illicit, which will predict accuracy for minority class poorly. To overcome the problems introduced by data imbalance, resampling techniques are often applied to the original data set to adjust imbalance and to create unbiased prediction.

Random undersampling or oversampling are simple, which can help solve the problem of data skewness, but often introduce non-informative or ill-informative sub-structures in data set. [

There are varieties of choice for the purpose of classification of credit card fraud. Logistic regression (LR), SVM, and random forest (RF) are the three most frequently chosen classification techniques used in many different applications. [

AdaBoost has also been a good choice because it is within the algorithm that assigning weights to different classes can be achieved, which will help to predict minority class more accurately. [

Deep learning techniques, such as convolutional neural network or recurrent neural network, have great success in computer vision and language processes, which need very sophisticated algorithms to distinguish different features, however, in the area of credit card fraud detection, these methods have not had such success to date. [

Besides supervised classification algorithms, unsupervised learning algorithms can also be employed for the purpose of fraud detection, as [

The principle ideas surrounding the support vector machine started with [

Thus for binary classification, when two classes can be completely separated the classification problem is characterized as considering a training data set { x 1 , y 1 } , { x 2 , y 2 } , ⋯ , { x n , y n } , in which x i is a vector of d dimensions, and y is a scale ∈ { + 1 , − 1 } . Therefore, y is a label of the data belonging to one class or the other class, and assuming linear separability, a straightforward algorithm finds a hyperplane which is linear combination of x i that separates the two classes. If we know the linear separator, y = w ⋅ Φ ( x i ) + b , in which Φ is called feature function specified by hand, w and b are parameters determined by the learning algorithm on training data. The criteria for deciding a data point belongs to a specific class is:

y i ⋅ ( w ⋅ Φ ( x i ) + b ) ≥ 1

Rosenblatt in 1962 described this algorithm with the perceptron, as mentioned in ( [

min w , b 1 2 ‖ w ‖ 2

s . t . y i ⋅ ( w ⋅ Φ ( x i ) + b ) ≥ 1 , i = 1 , 2 , ⋯ , l (1)

In the real world, the classes cannot always be separated clearly. Sometimes, there are some points on the wrong side of the hyperplane, that is a separable hyperplane may not exist ( [

min w , b 1 2 ‖ w ‖ 2 + C ∑ i = 1 l ξ i

s . t . y i ⋅ ( w ⋅ Φ ( x i ) + b ) ≥ 1 − ξ i , i = 1 , 2 , ⋯ , l

ξ i ≥ 0 , i = 1 , 2 , ⋯ , l (2)

As shown in Equation (2), a second term, x i , is introduced, which is used to handle the cases of misclassification. The user-specified parameter C is weight for the cost of misclassification. Setting a large C gives high penalty for misclassification, and a small C gives low penalty for misclassification. As shown in

Above primal problem, Equation (2), can also be transformed to a dual problem, Equation (3), according to Lagrange duality where α i is the Lagrange multiplier.

min α 1 2 ∑ i = 1 l ∑ j = 1 l α i α j y i y j K 〈 x i , y j 〉 − ∑ i = 1 l α i

s . t . ∑ i = 1 l α i y i = 0

0 ≤ α i ≤ C , i = 1 , 2 , ⋯ , l (3)

Equation (3) will naturally introduce the kernel function to SVM, which is the most powerful characteristics of SVM. Kernel function, K 〈 x i , y j 〉 , can be applied to transform the linear hyperplane to nonlinear hypersurface. It also maps low dimension features to high dimension features, for some cases, infinite dimensions, without explicitly building high dimension features, which can circumvent the curse of dimensionality. The only drawback of SVM is the computation is at least quadratic to data size, which make SVM hard to train on large data set [

In the standard SVM methodology, a weight for the penalty of misclassification is the same for every datapoint, nevertheless, a weighted SVM can be further distinguished from the uniform weight paradigm, if the penalty for an individual datapoint is different for different transactions according to the potential for financial loss. Introducing another model parameter, S i , representing the weighted financial loss of each misclassified datapoint, the weighted SVM can then augmented and improved, and be written as exhibited in Equation (4) below:

min w , b 1 2 ‖ w ‖ 2 + C ∑ i = 1 l S i ξ i

s . t . y i ⋅ ( w ⋅ Φ ( x i ) + b ) ≥ 1 − ξ i , i = 1 , 2 , ⋯ , l

ξ i ≥ 0 , i = 1 , 2 , ⋯ , l (4)

In this paper, weights will be introduced to the loss function to reflect the financial importance between each transaction. The weights for legitimate transactions, S n f , are assigned same. The weights for fraudulent transactions are assigned proportional to the amount of money transferred. To solve the weighted SVM optimization problem, lagrange multiplier can also be introduced, the resulted dual equations is similar to that of standard SVM, Equation (3). The only difference is that the constraint of 0 ≤ α i ≤ C is changed to 0 ≤ α i ≤ C S i .

To align this paper with other papers in the literature, fraudulent transactions will be labeled positive cases, and non-fraudulent transactions will be labeled negative cases, and again following the literature, false negative cases (type II errors) are cases which are fraud but are classified as non-fraud cases, and false positive cases (type I errors) are cases which are non-fraud, but are classified as fraud. The accuracy is defined as Equation (5). The definition of precision and recall will follow the accepted conventions, see Equation (6) and Equation (7).

The contribution of this paper is the introduction of a new measure, which we call financial recovery, and is defined as the portion of total detected monetized fraudulent transactions in Equation (8) divided by all monetized fraudulent transactions. The new financial recovery measure developed in this paper is a much more practical measure than other measures for this application, since the objective of detecting fraudulent transactions is to minimize financial loss for a firm or financial institution.

a c c u r a c y = t r u e p o s i t i v e + t r u e n e g a t i v e t o t a l s a m p l e s (5)

p r e c i s i o n = t r u e p o s i t i v e t r u e p o s i t i v e + f a l s e p o s i t i v e (6)

r e c a l l = t r u e p o s i t i v e t r u e p o s i t i v e + f a l s e n e g a t i v e (7)

f i n a n c i a l r e c o v e r y = a m o u n t o f t r a n s a c t i o n s i n t u e p o s i t i v e a m o u n t o f t r a n s a c t i o n s o f a l l p o s i t i v e (8)

The data set used in this paper was downloaded from Kaggle.com, an online data science platform with publicly available data, and the records downloaded include credit card transactions from a European bank over a two-day period in September 2013. Each record in the data set includes 30 features, which are all derived principal components from a set of original features, except for the first feature which is time, and the last feature which is monetized card transaction amount, both these features are native to the original data set.

The total number of credit card transactions exhibited in the database is 284,807 with a scarcity of fraudulent transactions presenting, only 492, which is a mere 0.17% of all records, and as shown in

This classification problem will be carried as following:

· Three commonly used algorithms, LR, SVM, and RF, will be applied to the data. The results will be used as the benchmark for the more advanced algorithms.

· The weighted SVM will be applied to the data. Three undersampling techniques are compared and sampling size will be optimized.

· The weights of nonfraudulent classes will be optimized while individual weights of fraudulent data points are assigned as the transaction balance.

· The impact of kernel functions, such as sigmoid, polynomial, and radial basis function (RBF), on the performance of SVM algorithm will be investigated.

Python is the language for programming with the sci-kit learn package used for all the aforementioned algorithms, and data preprocessing was carried out with preprocessing function, StandardScaler, to put all the data on the same scale, which is very important for SVM, and finally, to keep the results consistent and repeatable across algorithms, the random-state seed was set when calling functions from the Sci-kit learn package.

The benchmark classification process consisted of using the three most used algorithms, Linear Regression (LR), Support Vector Machines (SVM), and Random Forests (RF) under the benchmark rubric of no parameters tuned, and all the parameters using the default settings.

The results in

The recall measure provides the percentage of fraudulent transactions found in the data, and the recall of SVM presents at 61.6% which is 2.9% less than LR and 0.7% more than RF giving mixed results for SVM giving no clear methodological winner.

Model | Accuracy | Precision | Recall | Financial recovery |
---|---|---|---|---|

LR | 0.999 | 0.856 | 0.645 | 0.428 |

SVM | 0.999 | 0.895 | 0.616 | 0.475 |

RF | 0.999 | 0.875 | 0.609 | 0.379 |

The very important, newly introduced performance metric is financial recovery, which represents the percentage of the amount of fraudulent transactions that have been correctly detected compared to sum of all transaction amounts. The financial recovery score for SVM is 47.5%, the highest compared with the other two algorithms presented, and is of primary importance since financial recovery is of paramount importance to the company.

Overall, the above results indicate the SVM model maintains the most consistency among the three algorithms presented, and therefore will be used as a benchmark model for further evaluation and investigation in the remainder of the paper.

The computational burden of SVM may extend running time beyond practical limits as [

Training an SVM requires solving a constrained quadratic programming problem, which usually takes O ( m 3 ) computations where m is the number of examples. Predicting a new example involves O ( s v ) computations where is the number of support vectors and is usually proportional to m. Consequently, SVMs’ training time and prediction time to a lesser extent on a very large data set can be quite long, thus making it impractical for some real-world applications.

And as Kramer indicates above, runtime for the credit card data set considered here with 284,807 records could be extensive.

To accelerate the computational efficiency of SVM, random, nearmiss, and k-nearest neighbors (KNN) undersampling techniques were employed where the fraudulent transactions were kept untouched and the nonfraudulent transactions were undersampled. The random technology, as [

A simple undersampling technique is uniformly random undersampling of the majority class. This can potentially lead to loss of information about the majority class. However, in cases where each example of the majority class is near other examples of the same class, this method might yield good results.

[

In NearMiss-1, those points from L (majority class) are retained whose mean distance to the k nearest points in S (minority class) is lowest, where k is a tunable hyperparameter.

[

In CNN undersampling, the goal is to choose a subset U of the training set T such that for every point in T its nearest neighbor in U is of the same class.

First, below these algorithms are discussed in relation to measures of accuracy, precision, recall, and the new measure introduced here, financial recovery, and second, the most important parameter for random undersampling, sample size is considered. Note, in relation to sample size, the sample size of random undersampling sets non-fraudulent samples at 10 times the number of fraudulent samples, and nearmiss, as well as KNN, undersampling techniques used default values.

The results in

At the same time, using random undersampling, the total samples were reduced from 199,364 to 3894, and compared to standard SVM, the training time using random undersampling was reduced from more than two hours to less than 1 minute using a PC with intel core i7 CPU and 32 GB memory. Compared with the SVM benchmark model, random undersampling, not only dramatically improves the computation efficiency, but also increase financial recovery from 47.5% to 84.6%, although precision was reduced from 89.5% to 35.7% as shown in

Then the ratio of samples of nonfraudulent to fraudulent transactions for SVM with the random undersampling technique was varied to find the optimized ratio, and

Undersampling | Accuracy | Precision | Recall | Financial recovery |
---|---|---|---|---|

Random | 0.997 | 0.357 | 0.862 | 0.846 |

Nearmiss | 0.975 | 0.054 | 0.899 | 0.847 |

KNN | 0.973 | 0.053 | 0.935 | 0.890 |

Usually, weights are assigned exactly the same, i.e. identical, for each class member [

In all the above results, SVM employed a linear kernel, and to further improve the performance of the weighted SVM, three nonlinear kernel functions were investigated: Radial Basis Function (RBF), polynomial, and sigmoid which will enhance performance in the face of diverse data structures [

S n f | Accuracy | Precision | Recall | Financial recovery |
---|---|---|---|---|

0.1 | 0.723 | 0.006 | 0.964 | 0.998 |

1 | 0.893 | 0.013 | 0.949 | 0.997 |

4 | 0.954 | 0.032 | 0.928 | 0.996 |

10 | 0.976 | 0.058 | 0.913 | 0.996 |

20 | 0.987 | 0.098 | 0.884 | 0.953 |

30 | 0.992 | 0.144 | 0.862 | 0.846 |

40 | 0.994 | 0.199 | 0.855 | 0.846 |

80 | 0.998 | 0.417 | 0.841 | 0.846 |

Kernel name | Kernel functions [ |
---|---|

Random | exp ( − γ ‖ x , x ⋅ ‖ 2 ) |

Nearmiss | ( γ 〈 x , x ⋅ 〉 + r ) d |

KNN | tanh ( γ 〈 x , x ⋅ 〉 + r ) |

Kernel | S n o n f r a u d | γ | Degree | r | Accuracy | Precision | Financial recovery |
---|---|---|---|---|---|---|---|

Linear | 10 | n/a | n/a | n/a | 0.976 | 0.058 | 0.996 |

RBF | 0.1 | 0.05 | n/a | n/a | 0.937 | 0.024 | 0.997 |

Polynomial | 0.5 | 0.01 | 2 | 0.1 | 0.938 | 0.024 | 0.997 |

Sigmoid | 0.8 | 0.01 | n/a | 1e−5 | 0.738 | 0.006 | 0.997 |

The best weight for the RBF kernel nonfraudulent class is 0.1, with best gamma reported at 0.05, and the best weight for the polynomial kernel nonfraudulent class reported at 0.5 with polynomial of degree of 2 for the kernel function. The optimized weight for nonfraudulent class using sigmoid function is 0.8, with best gamma at 0.01. It can also be found that using kernel functions did not significantly improve the performance of classification. This demonstrates that the most important factor to improve SVM algorithms is to use right weights for individual data point.

· Recall improves materially from the linear case which overpowers the drop in accuracy since financial recovery increases with the kernel functions.

· By using linear kernel, about 2% unfraudulent will be classified as fraudulent. It increases to 6% using RBF kernel or polynomial, and more than 20% using sigmoid kernel.

· The drop in precision, which results in more unfraudulent as fraudulent, cannot adjust minor increase of financial recovery.

The linear kernel performance is superior to more complex kernels in the face of optimal weighting of the nonfraudulent cases thus satisfying the desirable statistical property of parsimonious modeling.

To evaluate the performance of our model, the confusion matrix results of a weighted SVM model and standard SVM are compared in

A new criterion, financial recovery, is created to judge the performance of classification algorithms based on financial lost. A weighted SVM model with random undersampling methodology, using the amount of transaction, as a weight for fraudulent data points is developed and applied to records of credit card transactions in a bank in Europe which occurred over a two-day period. The result shows that using the new criteria and novel weight scheme can greatly improve the performance of credit card fraud detection. Most importantly, this strategy will minimize the financial loss of bank in the aspect of credit card fraud.

The authors declare no conflicts of interest regarding the publication of this paper.

Zhang, D.F., Bhandari, B. and Black, D. (2020) Credit Card Fraud Detection Using Weighted Support Vector Machine. Applied Mathematics, 11, 1275-1291. https://doi.org/10.4236/am.2020.1112087