^{1}

^{1}

Personal credit risk assessment is an important part of the development of financial enterprises. Big data credit investigation is an inevitable trend of personal credit risk assessment, but some data are missing and the amount of data is small, so it is difficult to train. At the same time, for different financial platforms, we need to use different models to train according to the characteristics of the current samples, which is time-consuming. In view of these two problems, this paper uses the idea of transfer learning to build a transferable personal credit risk model based on Instance-based Transfer Learning (Instance-based TL). The model balances the weight of the samples in the source domain, and migrates the existing large dataset samples to the target domain of small samples, and finds out the commonness between them. At the same time, we have done a lot of experiments on the selection of base learners, including traditional machine learning algorithms and ensemble learning algorithms, such as decision tree, logistic regression, xgboost and so on. The datasets are from P2P platform and bank, the results show that the AUC value of Instance-based TL is 24% higher than that of the traditional machine learning model, which fully proves that the model in this paper has good application value. The model’s evaluation uses AUC, prediction, recall, F1. These criteria prove that this model has good application value from many aspects. At present, we are trying to apply this model to more fields to improve the robustness and applicability of the model; on the other hand, we are trying to do more in-depth research on domain adaptation to enrich the model.

Personal credit risk is a part that both government and enterprises attach great importance to. A good personal credit risk assessment will not only help government to improve the credit system but also make some enterprises avoid risk effectively. The development of personal credit risk assessment model is from traditional credit assessment model to data mining credit risk assessment model. It has gone through the process from traditional credit assessment model to big data credit assessment model. Traditional credit assessment model often uses discriminant analysis, liner regression, logistic regression, while data mining credit risk assessment model often uses decision tree, neural network, support vector machine and other methods to evaluate credit [

At present, the existing data mining credit risk assessment models have relatively high accuracy, but only limited to the case of sufficient data and less missing values. When the data volume is small or the data is seriously missing, the prediction effect of the model is often poor. Based on this, this paper introduces Instance-based Transfer Learning, which migrates the existing large data set samples to the target field of small samples, finding out the commonness between them, and realizing the training of the target domain dataset.

In the other parts of this paper, the second section introduces the work related to this study. The third section constructs the personal credit risk assessment model based on the idea of Instance-based transfer. The fourth section introduces the specific experimental process and the comparative analysis of the results. The fifth section is the summary of the full paper.

The concept of transfer learning was first proposed by a psychologist. Its essence is knowledge transfer and reuse. Actually, it is to extract useful knowledge from one or more source domain tasks and apply it to new target task, so as to realize “renovation and utilization” of old data and achieve high reliability and accuracy. The emergence of transfer learning solves the contradiction between “big data and less tagging” and “big data and weak computing” in machine learning.

In terms of the classification of transfer learning, Pan, S. J. and Yang, Q. [

At present, transfer learning has a large number of applications, but mainly concentrated in Text Classification, Text Aggregation, Emotion Classification, Collaborative Filtering, Artificial Intelligence Planning, Image processing, Time Series, medical and health fields. [

Traditional machine learning assumes that training samples are sufficient and that training and test sets of the data are distributed independently. However, in most areas, especially in the field of financial investigation, these two situations are difficult to meet, data sets in some domains have not only small data volume but also a large number of missing, which leads to the traditional machine learning method cannot train very good results. If other data sets are introduced to assist training, it will be unable to train because of the different distribution of the two data sets. In order to solve this problem, this paper introduces Transfer Learning. In transfer learning, we call the existing knowledge or source domain, and the new knowledge to be learned as the target domain. And Instance-based Transfer Learning, to make maximum use of the effective information in the source domain data to solve the problem of poor training results caused by the small sample size of the target domain data set.

In order to ensure the maturity of the transfer learning framework，we innovatively introduce the classic algorithm of Instance-based Transfer Learning, the tradaboost algorithm, to apply to the data in the field of financial credit reference [

Where we mark the source domain data as T a , the destination domain data is marked T b . Take 50% of all the source domain data and the target domain data as the training set T, take 50% of the target domain data as the test set, recorded as S, from which it is not difficult to find that T b and S are same distribution.

Step 1. Normalized training set ( T a ∪ T b ) And each data weight in the test set (S) to make it a distribution.

Step 2. For t = 1, ∙∙∙, N

1) Set and call the Base Learner.

2) Calculate the error rate, and calculate the error rate on the training set S.

3) Calculates the rate of weight adjustment.

4) Update the weight. If the target domain sample is classified incorrectly, increase the sample weight; if the source domain sample is classified incorrectly, reduce the sample weight (

Step 3. Output final classifier

In general, for personal credit risk assessment, the commonly used algorithms include logistic regression, decision tree and other machine learning algorithms, as well as xgboost and other Ensemble Learning algorithms. When the dataset is sufficient, the application of machine learning algorithm on the dataset can achieve good results. Therefore, this paper can learn from these mature algorithms in the selection of Base Learner, and migrate the algorithm from the source domain to the target domain, so as to achieve better results in the target domain.

Learners are generally divided into weak learners and strong learners. At present, most researches choose weak learners, and then through many iterations to achieve better results. However, this paper finds that in the field of credit risk, some scholars have applied xgboost algorithm and achieved good results [

XGBoost (extreme gradient boosting) is a kind of Ensemble Learning algorithm, which can be used in classification and regression problems, based on decision tree. The core is to generate a weak classifier through multiple iterations, and each classifier is trained on the basis of the residual of the previous round. In terms of prediction value, XGBoost’s prediction value is different from other machine learning algorithms. It sums the results of trees as the final prediction value.

y ^ i = ∅ ( x i ) = ∑ k = 1 K f k ( x i ) , f k ∈ F (1)

Suppose that a given sample set has n samples and m features, which is defined as

D = { ( x i − , y i ) } ( | D | = n , x i ∈ R m , y i − ∈ R ) (2)

For x i , y i , The space of CART tree is F. As follows:

F = { f ( x ) = w q ( x ) } ( q : R m → T , w ∈ R T ) (3)

where q is the model of the tree, w q ( x ) is the set of scores of all leaf nodes of tree q; T is the number of leaf nodes of tree q. The goal of XGBoost is to learn such k-tree model f ( x ) . Therefore, the objective function of XGBoost can be expressed as [

o b j t = ∑ i = 1 n ( y j , y ^ i ( t ) ) + ∑ k = 1 t Ω ( f t ) where Ω ( f ) = ϒ T + 1 2 λ ‖ w ‖ 2 (4)

The source domain dataset and target domain dataset are from the Prosper online P2P lending website and a bank’s April-September 2005, respectively. The data sets of both source domain and target domain data have data missing and high correlation among features. There are only 9000 pieces of data in the destination domain, and the source domain dataset contains more redundant fields. Therefore, it is necessary to fill in the missing values and select features by information divergence.

There are several common missing value handling methods:

1) Filling fixed values according to data characteristics;

2) Fill the median/median/majority;

3) Fill in the KNN data;

4) Fill the predicted value of the model;

The characteristics of the data will have a positive or negative impact on the experimental results. In particular, the amount of features in the source domains of this paper is huge, including many redundant features and highly relevant features. Firstly, delete redundant features according to the meaning of the features.

Common ground | Features | Meaning |
---|---|---|

Redundant features | Listing Key | Unique key for each listing, same value as the “key” used in the listing object in the API. |

Listing Number | The number that uniquely identifies the listing to the public as displayed on the website. | |

Loan Number | Unique numeric value associated with the loan. | |

Lender Yield | The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee. | |

Loan Key | Unique key for each loan. This is the same key that is used in the API. | |

Characteristics related only to investors | LP_Interestand Fees | Cumulative collection fees paid by the investors who have invested in the loan. |

LP_Collection Fees | Cumulative collection fees paid by the investors who have invested in the loan. | |

LP_Gross Principal Loss | The gross charged off amount of the loan. | |

LP_Net Principal Loss | The principal that remains uncollected after any recoveries. | |

Percent Funded | Percent the listing was funded. | |

Investment From Friends Count | Number of friends that made an investment in the loan. | |

Investment From Friends Amount | Dollar amount of investments that were made by friends. |

This paper chooses the method of Information divergence to select other features. Information divergence is often used to measure the contribution of a feature to the whole, which also can select features. The basis of Information divergence is entropy, a measure of the uncertainty of random variables. Entropy can be subdivided into information entropy and conditional entropy. The computational formula is shown in

The calculation of Information divergence is based on information entropy and conditional entropy. The computational formula is as follows.

I G ( T ) = H ( C ) − H ( C | T ) (5)

Using the python program, the entropy of the overall dataset and Information divergence of each feature can be obtained. At the same time, the greater the value of Information divergence, the greater the contribution of the feature to the overall dataset. Since there are many useless features in the source domain data in this paper, Information divergence of each feature is calculated and shown in

Firstly, apply the XGBoost algorithm to training T a and T b . The training results are as follows

It is observed that training T b alone cannot get a better performance. However, using the XGBoost algorithm to train T a can get a higher AUC value, which proves that it is feasible to use the XGBoost experimental method as a Base Learner.

Information Entropy | H ( S ) = − ∑ i = 1 C p i log 2 ( p i ) |
---|---|

Conditional entropy | H ( C | T ) = P ( t ) H ( C | t ) + P ( t ¯ ) H ( C | t ¯ ) |

In the aspect of base learner selection, we have done a lot of experiments, including traditional machine learning algorithm and ensemble learning algorithm. In this paper, we choose xgboost as the base learner to construct the tradapoost (xgboost).

It can be seen that the accuracy of T b after transfer is significantly higher than that of training using only XGBoost algorithm.

The experiment is compared from two aspects: 1) Choose different Base Learners, and compare it from the transfer learning dimension. 2) Compare transfer learning with machine learning algorithms.

In the dimension of transfer learning, this paper adds the decision tree as the Base Learner of TrAdaBoost to predict data. Denote the algorithm using decision tree as the base learner as TrAdaBoost (DT). At the same time, Denote the algorithm using XGBoost as the base learner as TrAdaBoost (XGBoost). Now, this paper input into the Base Learner using decision tree and XGBoost as TrAdaBoost construction separately to predict the target data. In this paper, the AUC value is selected as the criterion of result evaluation. The models’ evaluation uses AUC, prediction, recall, F1.

Dataset | T a | T b |
---|---|---|

AUC | 0.97 | 0.56 |

AUC | prediction | recall | F1 | |
---|---|---|---|---|

TrAdaBoost (XGBoost) | 0.80 | 0.79 | 0.65 | 0.71 |

It can be seen from the experimental results that using the Ensemble Learning algorithm XGBoost as the Base Learner increases the AUC value of the base learner by 18% compared with the simple algorithm decision tree as the Base Learner. Therefore, it reveals that the choice of Base Learner has an important influence on the final result.

To demonstrate the superiority of transfer learning algorithm, this paper also selects decision tree, XGBoost, Logistic regression algorithm to predict the target domain respectively. Observe the results of training using only the target domain data and the models in this paper. The results are shown in

From this, it is clear that using transfer learning algorithms to train the target domain has a higher AUC, prediction, recall and F1 than traditional machine learning. It also further verifies that transfer learning algorithms can better solve the prediction of small samples problem.

TL VS TL | TrAdaBoost (DT) | TrAdaBoost (XGBoost) |
---|---|---|

AUC | 0.62 | 0.80 |

prediction | 0.64 | 0.79 |

recall | 0.61 | 0.65 |

F1 | 0.63 | 0.71 |

TL VS ML | TrAdaBoost (XGBoost) | XGBoost | Decision Tree | Logistic |
---|---|---|---|---|

AUC | 0.80 | 0.56 | 0.61 | 0.64 |

prediction | 0.79 | 0.59 | 0.61 | 0.62 |

recall | 0.65 | 0.59 | 0.64 | 0.67 |

F1 | 0.71 | 0.59 | 0.61 | 0.64 |

This paper constructs a person Credit Evaluating Model based on Instance-based Transfer Learning, and focuses on the choice of Basic Learners in the design. The model shows better classification and forecasting capabilities and can help banking and P2P financial institutions to avoid risks to a certain extent. Besides, the model uses Information divergence to select features with greater contribution to reduce computational complexity. We do a lot of experiments to select the Base Learner and improve the accuracy of the model. The TrAdaBoost (XGBoost) model makes full use of the source domain information to successfully complete the training of the target domain information, and solves the predicament that the data set cannot be trained due to the lack of samples and significant missing values. This article achieves the transfer of samples in the field of personal credit risk, which has certain reference value for the financial field. The model based on TrAdaBoost sample transfer proposed in this paper adds the XGBoost Ensemble Learning algorithm, which improves the accuracy of the model, enhances the performance of the model, and has good generalization capabilities.

The authors declare no conflicts of interest regarding the publication of this paper.

Wang, M.G. and Yang, H. (2021) Research on Personal Credit Risk Assessment Model Based on Instance-Based Transfer Learning. International Journal of Intelligence Science, 11, 44-55. https://doi.org/10.4236/ijis.2021.111004