Comparison of Several Data Mining Methods in Credit Card Default Prediction

LightGBM is an open-source, distributed and high-performance GB framework built by Microsoft company. LightGBM has some advantages such as fast learning speed, high parallelism efficiency and high-volume data, and so on. Based on the open data set of credit card in Taiwan, five data mining methods, Logistic regression, SVM, neural network, Xgboost and LightGBM, are compared in this paper. The results show that the AUC, F1-Score and the predictive correct ratio of LightGBM are the best, and that of Xgboost is second. It indicates that LightGBM or Xgboost has a good performance in the prediction of categorical response variables and has a good application value in the big data era.

used discriminant analysis to score the credits and behaviors of borrowers; Yeh and Lien [2] used Logistic regression, decision trees, artificial neural networks and other algorithms to predict customer default payments in Taiwan, and compared the predictions of these algorithms. Accuracy, finally found that the correct rate of artificial neural network is slightly higher than the other five methods. Mei Ruiting, Xu Yang and Wang Guochang [3] explored the key factors affecting customer credit by establishing Lasso-Logistic and random forest models. The results show that the accuracy of random forest prediction is higher than that of Lasso-Logistic.

Description of the Data Feature Description
This article is based on credit card customer data from April to September 2005 in Taiwan

Logistic Regression
Logistic regression is a special linear regression model. However, the two-category response variable violates the normal hypothesis of the general regression model.

Neural Network
Artificial neural networks use nonlinear mathematical equations to continuously establish meaningful relationships between input and output variables through the learning process. We apply backpropagation networks to classify data. Backpropagation neural networks use feedforward topology and supervised learning.
The structure of a backpropagation network typically consists of an input layer, one or more hidden layers, and an output layer, each layer consisting of several neurons. Artificial neural networks can easily handle the nonlinearities and interactions of explanatory variables. The main disadvantage of artificial neural networks is that they do not give the result of a simple classification probability formula.

Support Vector Machine (SVM)
SVM is a pattern recognition method based on statistical learning theory. Which used the kernel function to map the data X of the input space into a high-dimensional feature space, and then at high In the dimensional space, the generalized optimal classification surface is obtained, and then the data that is linearly inseparable in the original space can be linearly classified in the high-dimensional space. The main difficulty of SVM is that after the kernel function is determined, when solving the problem classification, the quadratic programming of the solution function is required, which requires a large amount of storage space.

Xgboost
Boosting is a very effective integrated learning algorithm [4]. Boosting method can transform weak classification into strong classifier to achieve accurate classification effect. The steps are as follows: 1) All the training set samples are given the same weight; 2) The mth iteration is performed, and each iterations is classified by a classification algorithm, and the error rate

LightGBM
LightGBM is a gradient learning framework based on tree learning. The main difference between it and the Xgboost algorithm is that it uses a histogram-based algorithm to speed up the training process, reduce memory consumption, and adopt a leaf-wise leaf growth strategy with depth limitation [5]. The following describes the histogram algorithm and the leaf growth strategy with depth-limiting Leaf-wise.

Histogram Algorithm
The basic idea of the histogram algorithm is to discretize successive floating-point eigenvalues into k integers and construct a histogram of width k.
When traversing the data, the statistic is accumulated in the histogram according to the discretized value as an index. After traversing the data once, the histo-  Figure 1.

Leaf-Wise Leaf Growth Strategy with Depth Limitation
The decision tree's growth strategy is generally Level-wise, which is an inefficient algorithm because it treats the leaves of the same layer indiscriminately, resulting in a lot of unnecessary memory consumption. Leaf-wise is a more efficient strategy. Every time from all the leaves, find the leaf with the highest split gain, then split and cycle. Therefore, compared with Level-wise, Leaf-wise can reduce more errors and get better precision when the number of splits is the same. The disadvantage of Leaf-wise is that it may grow a deeper decision tree and produce over-fitting. Therefore, LightGBM adds a maximum depth limit above Leaf-wise to prevent over-fitting while ensuring high efficiency. The leaf-wise leaf growth process is shown in Figure 2.

K-fold cross-validation is a commonly used accuracy test method in machine
learning. Its purpose is to obtain a reliable and stable model. In the general problem, when the response variable is a quantitative variable, the cross-validation uses the mean square error as an indicator to measure the test error. On the classification problem, when the response variable is a qualitative variable, cross-validation uses the CV error rate as a measure. The form of the K-fold CV error rate as follows:

Classification Evaluation
Under normal circumstances, for the two classification labels 0 and 1, there are definitions as follows: The expression of the Acc is defined as follow:

TP TN Acc
TP TN FP FN Accuracy (Prec) and recall (Rec) are used to represent the general characteristics of the classifier. Accuracy is the percentage of cases that are marked as positive and indeed are indeed positive. The recall rate, also known as the true positive rate, is the percentage of cases that should have been correctly identified as positive. According to Table 2, the accuracy and recall rate are expressed as follows: ( ) In general, Prec is high, Rec is low, Rec is high, and Prec is low. We need to balance the two and use F -score β [6] to reconcile the two. Expressed as follows: ( )

Ten Times Ten Fold Cross Validation Results
In this paper, ten times of ten-fold cross-validation is used to verify the model established by different data mining methods. The 10% CV error rate results are as follows: The 10% CV error rate of the 7th 10-fold cross-validation is extracted. The result is shown in Figure 3. It can be seen that the 10% CV error rate of the five data mining methods has a certain fluctuation, but the fluctuation range and fluctuation times of LightGBM is less than the others.   Figure 3. The seventh times 10-fold CV error rate. Table 3 shows the average 10% CV error rate of the five methods. It can be seen from Table 3 that the 10% CV error rate of the five methods is low, indicating the five data mining methods have certain reliability. The average 10% CV error rate difference between the five methods is small, but LightGBM's 10 average 10% CV error rate is slightly lower than the other four methods.

Classification Results
The classification results obtained from 10-fold cross-validation are shown in Table 4.
From Table 4 we can know that the accuracy rates of the five data mining methods are all above 79%, and the difference is not large. The correct rate of LightGBM is 82.29%, which indicates that these five methods have better classification effects. AUC has a big difference. LightGBM has an AUC of 0.7904, and the other four methods have lower AUCs than LightGBM. At the same time, LightGBM is 89.34% higher than the other methods, indicating that LightGB has the best classification effect on the classification problem of this paper.

Conclusion
This paper discusses the classification effect of five data mining methods on classification problems. Taking a typical credit card default data set as an example, a classifier model is established. With a 10-fold cross-validation, we know that the five classifier models are reliable and stability. Ten times of 10-fold  cross-validation was performed to obtain the average AUC and correct rate of the model, and LightGBM was the highest among the three evaluation indicators, indicating that the data mining method has a good classification effect, and the classification effect is better than other four data mining methods.