P2P network lending, as a new type of lending model for Internet finance, is favored by people because of its fast and low cost. However, borrower default has always been one of the core issues of platform concern. Because borrower characteristic data has the characteristics of high dimensionality and multicollinearity, how to select key features to judge borrowing default behavior has been a hot topic. To solve this problem, this paper uses the data of the lending club lending platform to introduce the recursive feature elimination method (RFE) to select key variables, and combines with the classification model to predict the borrower’s default behavior. The research results show that the recursive feature elimination method can screen the key variables affecting the default of the borrower. After the recursive feature elimination method, the accuracy of the classification model is over 95%.
With the development of Internet technology, private lending has developed from offline to online. P2P (peer-to-peer) online lending is a new model of Internet finance, in which lenders and borrowers borrow directly through the Internet platform and no longer need banks or other financial institutions as intermediaries. Its main characteristics are low threshold, convenient, both lenders and borrowers can complete transactions through the online, so it is favored by people, and constantly faced with risks and challenges. In particular, the “thunderstorms” of P2P platforms kept happening in 2018, which greatly affected the healthy development of P2P online lending industry. According to the “2018 Online Loan Industry Data Summary” released by Rong 360 Big Data Research Institute, as of the end of December 2018, there were 1082 online loan platforms operating normally in the country, 848 problem platforms, and 254 more problem platforms than 2017. In the current period, the number of active lenders and borrowers was 3.169 million and 6.0205 million. Compared with December 2017, the number of lenders borrowed was 6.865 million and the number of borrowers was 6.933 million. One of the reasons why there are so many problematic platforms is that a large number of borrowers have defaulted and lost contact, which has seriously damaged the interests of platforms and investors and hindered the sustainable development of P2P online lending. In P2P online lending industry, loan customers losing contact refers to loan customers who cannot repay the loans due and cannot be contacted by P2P platforms through various means. In recent years, there have been frequent incidents of P2P platforms running off the road and losing contact with each other. In fact, the malicious default of loan customers and the loss of contact have broken the capital chain of P2P platforms, which is one of the important reasons leading to the closure and loss of P2P platforms, bringing huge challenges to the sustainable and healthy development of P2P platforms. In February 2019, the Beijing Internet finance industry association published “a notice on the list of borrowers and institutions that evaded and abandoned their debts by online lending institutions in Beijing”2, which published 300 borrowers who evaded and abandoned their debts. Among these 300 borrowers, more than 100 borrowers’ overdue time occurred during the “storm” of P2P platforms in 2018. In addition, only 66 of the 300 borrowers have not lost contact, with a loss ratio of 78% and overdue amount of 164,100 yuan, which has brought huge losses to investors and platforms. Therefore, it is urgent for the current P2P industry to identify borrowers’ default behaviors. This paper takes foreign Lending club as the research object. Lending club is a good P2P Lending platform in foreign countries. In this paper, the recursive feature elimination method is used to select the key information of the borrower, eliminate the multicollinearity of the data, and predict the default behavior of P2P borrowers by combining Logistic regression model, CART decision tree and BP neural network model.
At present, researches on P2P online lending mainly focus on the influencing factors of borrowers’ default behaviors and how to choose the credit evaluation model to identify and predict borrowers’ default situations. Lin et al. [
Gradually increased in recent years, the problem of P2P lending platform, the main cause of this problem is caused by the borrower overdue behavior caused by not getting paid on time, and P2P platform can’t afford the risk of default, leaving investors interest is damaged, so to strengthen the credit of the borrower risk assessment is the key to the healthy development of the P2P lending platform. Early scholars mainly adopted traditional linear models, among which Logistic regression model was the most widely used. Wigintio [
Based on the above, we find that there are many factors influencing the borrower’s default, but there are few researches on how to choose the variables that can significantly affect the borrower’s default behavior scientifically and rationally. Therefore, this paper starts from variable selection and combines several classification models to study the default recognition and prediction of P2P borrowers. The marginal contributions of this paper are as follows: 1) at present, recursive feature elimination method is widely used in the fields of biology and medicine, and there are few researches on P2P network lending. This paper USES recursive feature elimination method to screen and rank high-dimensional variables. 2) The simulation effect and prediction accuracy of the classification model are compared, which provides references for the selection of P2P credit risk assessment model in the future.
In this paper, logistic regression model, decision tree and BP neural network are used to identify and predict the default behavior of P2P borrowers. Logistic regression model [
1) Logistic regression model
Logistic regression model is a classical classification model, which uses the basic information of the borrower and the data of the loan information for analysis. The model is as follows:
g ( z ) = 1 1 + e − z (1)
where Z can be estimated by the following multiple regression equation:
Z = β 0 + β 1 X 1 + ⋯ + β p X p (2)
If the probability of credit default of the borrower is P ( Y = 1 | X ) = π ( Y ) , the following formula can be obtained:
P ( Y = 1 | X ) = exp ( β 0 + β 1 X 1 + ⋯ + β p X p ) 1 + exp ( β 0 + β 1 X 1 + ⋯ + β p X p ) (3)
where Y is a binary variable, 1 represents default and 0 represents non-default. The maximum likelihood estimation method is adopted for parameter estimation, and the formula is:
l ( β ) = ∏ i = 1 p π ( Y i ) Z i [ 1 − π ( Y i ) ] 1 − Z i (4)
where i = 1 , 2 , ⋯ , p , the formula of the de-log-likelihood function is as follows:
log ( l ( β ) ) = ∑ i = 1 p [ ( Z i log ( π ( Y i ) ) ) + ( 1 − Z i ) log ( 1 − π ( Y i ) ) ] (5)
The goal of the maximum likelihood estimation method is what is the value of β when l ( β ) takes the maximum value. In this paper, the gradient rise method is used to calculate the value of parameter β .
2) CART decision tree
At present, CART decision tree is one of the most widely used decision learning methods. The CART decision tree USES a gini index to partition attributes. The gini index represents the uncertainty of set D. The larger gini index is, the greater the uncertainty of set is. The formula of gini index is as follows:
G i n i ( D ) = 1 − ∑ k = 1 k ( | C k | | D | ) 2 (6)
where C k represents the sample subset of the k class, and k is the number of classes. If D is divided into D 1 and D 2 under feature A, then the gini index of set D is:
G i n i ( D , A ) = | D 1 | | D | G i n i ( D 1 ) + | D 2 | | D | G i n i ( D 2 ) (7)
3) BP neural network
The learning process of BP neural network includes information forward transmission and error reverse transmission BP neural network has the following structure shown in
m = n + l + α (8)
where m is the number of nodes in the hidden layer, n is the number of nodes in the input layer, and l is the number of nodes in the output layer, and α is a constant between 1 and 10.
1) Recursive feature elimination
Recursive feature elimination method (Recursive feature elimination, RFE) the main idea is repeated build model (SVM or the regression model) and choose the best (or worst) features, the selected feature selection, and then repeat the process on the characteristics of the residual, until all the characteristics of the traverse the select key characteristics. RFE adopts the feature sorting technology to select the feature subset. In this paper, the classification performance of SVM is taken as the evaluation function to select the feature. The flow chart of RFE is shown in
2) Pearson correlation coefficient
Pearson correlation coefficient is a linear correlation coefficient. Pearson correlation coefficient is a statistic used to reflect the degree of linear correlation between two variables. The correlation coefficient is represented by r, where n is the sample size, r describes the degree of strong linear correlation between the
two variables, and the greater the absolute value of r is, the stronger the correlation is. The formula is as follows:
r k j = ∑ i = 1 n ( x i k − x ¯ k ) ( x i j − x ¯ j ) ∑ i = 1 n ( x i k − x ¯ k ) 2 ∑ i = 1 n ( x i j − x ¯ j ) 2 (9)
where r k j represents the correlation coefficient between the k index and the j index, x i k represents the i index value of the k index, x ¯ k represents the average value of the k index, represents the average value of the third sample. x i j represents the i index value of the j sample, and x ¯ j represents the average value of the j index.
1) Confusion matrix
In the two classification problems, there are four types of prediction results. Based on the confusion matrix, we can preliminarily judge the normal rate and error rate of the model, as shown in
Where, TN represents the true category of samples as negative, and the number of samples judged as negative; FP represents the number of samples whose real category is negative, but which are judged to be positive. FN means the real category of the sample is positive, but the number of samples is judged as negative. TP represents the number of samples whose real category is positive and predicted to be positive. According to
predicted | |||
---|---|---|---|
negative | positive | ||
actual | negative | TN (True Negative) | FP (False Positive) |
positive | FN (False Negative) | TP (True Positive) |
can be calculated, which can be expressed as:
A c c u r a c y = T N + T P T N + F N + F P + T P (10)
2) ROC curve and AUC values
In binary classification problems, ROC curve and AUC value are often used to evaluate the merits of binary classifier. ROC curve is a comprehensive indicator of sensitivity and specificity of continuous variables. The horizontal axis of the ROU curve is the false positive rate (FPR), that is, the proportion of all negative cases in the partition instance, and the vertical axis is the true positive rate (TPR), that is, the proportion of all positive cases in the partition instance. For a binary classification problem, the instance is divided into a positive class (positive) or a negative class (negative). But in practice, there are four things that happen when you classify. a) If an instance is a positive class and is predicted to be a positive class, it is a True Positive class (TP); b) If an instance is a positive class, but is predicted to be a Negative class, it is False Negative (FN); c) If an instance is a negative class, but is predicted to be a positive class, it is False Positive (FP); d) If an instance is a Negative class, but is predicted to be a Negative class, it is True Negative (TN).
Among them, the True Positive Rate (TPR) represents the proportion of the actual positive instances in the positive classes predicted by the classifier to all positive instances. The formula can be expressed as:
T P R = T P T P + F N (11)
False Positive Rate (TPR) represents the proportion of the actual negative instances in the positive class predicted by the classifier to all negative instances:
F P R = F P F P + T N (12)
AUC (Area Under Curve) is defined as the Area Under ROC Curve. The closer the value of AUC is to 1, the better the stimulation effect of the model is.
In this study, Lending club provided data on borrowing targets in the third quarter of 2017 for experiments (data source URL: https://www.lendingclub.com/info/download-data.action). The platform of Lending Club asks customers to fill in the loan application form online or offline to collect the basic information of customers, including the applicant’s age, gender, marital status, educational background, loan amount, and the applicant’s property status, etc. Generally speaking, it also USES the information of third-party platforms such as credit investigation agencies or FICO. The original data in this paper contains 122,703 samples and 112 attributes. The classification label attribute “loan_stauts” has seven states In the original data: “Current”, “Fully Paid”, “Charged Off”, “Default”, “In Grace Period”, “Late (16 - 30 days)” and “Late (31 - 120 days)”. Since “Current” belongs to the loan bid under repayment, and the repayment situation is not clear, the loan bid under “Current” status is deleted in this study. The final data of this paper includes 24,330 samples, including 16,770 non-default samples and 7560 default samples. The training set contains 70% data, and the test set contains 30% data. “Fully Paid” is regarded as non-default and the value is 0; “Charged Off”, “Default”, “In Grace Period”, “Late (16 - 30 days)” and “Late (31 - 120 days)” shall be deemed as breach of contract, with the value of 1.
Firstly, the attributes that have many missing values, most of the observed values are the same and have no significance to affect the borrower’s default status were deleted. After preliminary screening, the remaining 87 attribute indexes were found. This paper adopts recursive feature elimination method to select key features. RFE uses the feature sorting technique to select the feature subset and takes the classification performance of SVM as the evaluation function. In this paper, the recursive feature elimination algorithm is used to screen out 15 features with the strongest correlation with the target variable. Then the Pearson correlation map is used to find out the redundancy features. From
Finally, the important features are sorted by the random forest algorithm, and the results are shown in
Through the above, 9 attribute values were selected from 112 attributes in the final paper. The variable descriptions for the nine attributes are shown in
1) Analysis of ROU curve and ACU value
In this study, the ROU curve mentioned above is used to evaluate the advantages and disadvantages of the model. The detailed results of the ROC curves of the training samples and test samples of the logistic regression model, CART decision tree and BP neural network model in this study are shown in
Variable | Variable declaration | Variable types |
---|---|---|
loan_status | Current status of the loan | nominal variable |
loan_amnt | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value. | continuous variable |
int_rate | Interest Rate on the loan | continuous variable |
out_prncp | Remaining outstanding principal for total amount funded | continuous variable |
total_rec_int | Interest received to date | continuous variable |
last_pymnyt_amnt | Last total payment amount received | continuous variable |
home_ownership | The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER | nominal variable |
verification_status | Indicates if income was verified by LC, not verified, or if the income source was verified | nominal variable |
term | The number of payments on the loan. Values are in months and can be either 36 or 60. | nominal variable |
Data source URL: https://www.lendingclub.com/info/download-data.action.
Models | Samples | AUC value |
---|---|---|
Logistic | training samples | 0.991 |
test samples | 0.991 | |
CART | training samples | 0.957 |
test samples | 0.965 | |
BPNN | training samples | 0.991 |
test samples | 0.992 |
indicating that the simulation effect was also good. The difference between the AUC values of the training samples and test samples of BP neural network algorithm and the ACU values of the training samples and test samples of Logistic regression model is not significant, which are 0.991 and 0.992 respectively. From the above analysis, it can be seen that Logistic regression model and BP neural network model have the best simulation effect among the four classification models after filtering out important variables by recursive feature elimination method, followed by CART decision tree. However, in general, the ACU values of the training samples and test samples of these three classification models are all greater than 0.9. From the definition of ACU, their simulation effects are very good.
2) Analysis of model results
In this paper, the above mentioned logistic regression model, CART decision tree and BP neural network model are respectively used to predict the credit default situation of borrowers. The accuracy of these three classification models in predicting borrowers’ credit default in training samples and test samples is shown in
As can be seen from the results in
models | classification | training samples | test samples | ||||
---|---|---|---|---|---|---|---|
0 | 1 | correct rate | 0 | 1 | correct rate | ||
logistic | 0 | 11,241 | 529 | 95.5% | 4794 | 206 | 95.9% |
1 | 256 | 5081 | 95.2% | 108 | 2115 | 95.1% | |
correct rate | 95.4% | 95.7% | |||||
CART | 0 | 11,390 | 380 | 96.8% | 4782 | 218 | 95.6% |
1 | 201 | 5136 | 96.2% | 57 | 2166 | 97.4% | |
correct rate | 96.6% | 96.2% | |||||
BPNN | 0 | 11,198 | 572 | 95.1% | 4785 | 215 | 95.7% |
1 | 183 | 5154 | 96.6% | 79 | 2144 | 96.4% | |
correct rate | 95.6% | 95.9% |
with the Logistic regression model, the classification effect of the decision tree was better. In BP neural network model, the training sample, not the borrower default judgment is not the default of the borrower’s accuracy is 95.1%, the default of the borrower to default of the borrower’s accuracy is 96.6%, the test sample of the correct rates were 95.7% and 96.4%, respectively, CART decision tree in the training sample and test sample, will not the borrower default judgment for the default of the borrower’s accuracy and the effect of the Logistic regression model about the same, However, the accuracy rate of the BP neural network model in judging defaulted borrowers as defaulted borrowers was higher than that of the Logistic regression model, which was 1.4% higher in the test sample and 1.3% higher in the test sample. In P2P network lending, judging defaulted borrowers as non-defaulted borrowers will bring more losses than non-defaulted borrowers. Therefore, the classification effect of BP neural network is better than Logistic regression model, but worse than the classification effect of CART decision tree. Compared with Logistic regression model, CART decision tree and BP neural network model.
In this section, through the recursive feature method, Prosper correlation coefficient method and random forest feature selection, the paper finally screened out “loan_amnt”, “int_rate”, “out_prncp”, “total_rec_int”, “last_pymnyt-amnt”, “home_ownership”, “verification_status” and “term”, which had a great impact on “loan_status”. As can be seen from
Taking the user data published by lending club as the research object, this paper uses the method of recursive feature elimination combined with classification algorithm to identify and predict the credit default of borrowers. In this paper, it is found that 1) the recursive feature elimination method can screen the key variables affecting the borrower’s default status; then, by sorting the key variables from large to small, it is found that the borrower’s latest repayment amount, loan amount and loan interest rate have a great impact on the borrower’s default status. Pearson coefficient indicates that the borrower’s credit rating and income have a strong correlation with the borrower’s loan amount. 2) The experimental results show that the classification model with the recursive feature elimination method to select key variables has high accuracy. In this paper, CART decision tree has the highest accuracy, indicating that it has the best classification effect. The classification accuracy of Logistic regression model and BP neural network model is slightly lower than CART decision tree, but the ACU value is higher than CART decision tree, indicating that their simulation effect is better than CART decision tree.
Although the combination of the recursive feature elimination method and the classification algorithm used in this paper is effective in identifying and predicting the credit default situation of P2P online borrowers, it can be studied in the following two aspects: 1) try to apply oversampling or undersampling methods deal with data imbalances and observe the corresponding processing effects; 2) use other classification algorithms, such as random forest, support vector machine and other machine learning algorithms to incorporate experimental comparisons, and compare various classification algorithms for identification and the accuracy rate of predicting the default status of P2P online borrowers.
The author declares no conflicts of interest regarding the publication of this paper.
Hou, X.Y. (2020) P2P Borrower Default Identification and Prediction Based on RFE-Multiple Classification Models. Open Journal of Business and Management, 8, 866-880. https://doi.org/10.4236/ojbm.2020.82053