Research on P2P Credit Risk Assessment Model Based on RBM Feature Extraction—Take SME Customers as an Example

This paper combines the nonlinear dimensionality reduction method, and the Restricted Boltzmann machine (RBM algorithm), to assess the credit risk of P2P borrowers. After screening and processing many big data indicators, the most representative indicators are selected to build the P2P customer credit risk assessment model. In addition, after comparing the advantages and disadvantages of linear dimensionality reduction algorithm and nonlinear dimensionality reduction algorithm, this paper establishes a P2P enterprise customer credit risk assessment model based on RBM feature extraction combined with contrast divergence theory. It is concluded that the effect of RBM is better than that of PCA when the same model is selected. The Logistic model performs best in the three models when the same data feature extraction method is selected.


tion.
Credit risk is a difficult problem in the current P2P industry: from a macro perspective, due to the low barriers to entry of P2P, the uncontrollable macro-risk situation is getting worse. From a micro perspective, most of the P2P platform business is still in its infancy. The operating experience and risk management capabilities of platform operators are generally insufficient, and the development situation is extremely unstable. From this perspective, credit issues remain the cause of large-scale risks in the P2P industry in the future. As China's P2P industry has developed rapidly in recent years, the theoretical research on P2P network lending by domestic and foreign scholars is still closely surrounding the development of Internet platform operations. There is little discussion in the academic community on risk management, security prevention, and industry regulation. Especially in the quantitative assessment of credit risk of P2P enterprise customers, it is still almost empty. In view of this, this paper attempts to draw on the existing research results of credit risk assessment (such as the credit model of traditional commercial banks). After analyzing the credit risk characteristics of P2P industry, the credit risk of P2P borrowers is evaluated by using artificial intelligence method. The credit risk is actually through the machine learning method, by learning the borrower's historical data, to assess its future repayment ability and default risk, and obtain a P2P enterprise credit risk assessment model suitable for China's current national conditions.

Literature Review
Many scholars have studied credit risk measurement and evaluation models and adopted a variety of methods. Based on the traditional credit risk measurement model and Ronalce model, Rosenberg & Gleity [1] constructed a new P2P credit risk measurement model, and through simulation, the neural network model can be used to obtain better results. On the basis of the traditional credit risk measurement model, Huang [2] combined with the support vector machine and empirical research on loan default, which shows that the new metric model combined with support vector machine can get better result than the metric model combined with neural network. Puroetal [3] takes multiple factors as independent variables, including the borrower's loan amount, credit rating, current overdue loan amount, debt yield, loan interest rate, etc., constructing a logistic regression model for testing, and obtaining good results. Jiang Wei [4] replaced the training algorithm in BP neural network with improved particle swarm optimization algorithm, and constructed BP neural network algorithm model with improved particle swarm optimization, combined with credit evaluation index system, and finally realized based on improved PSO-BP neural network. The personal credit evaluation model establishes a BP neural network credit evaluation model to quantitatively evaluate the credit of the lender and improve the automation of personal credit evaluation. Liu Chang and Xu Zhuoting Open Journal of Business and Management tion model with the loan data of Lending Club, the world's largest P2P company, and gave the prediction accuracy, in order to provide credit risk management method for domestic P2P companies.
The nonlinear dimensionality reduction method used in this paper, the Restricted Boltzmann machine (RBM algorithm), comes from the field of unsupervised learning, a multi-layer limited Boltzmann proposed by Professor Hinton and Log-Sum-RBM has better characterization ability than Sp-RBM. [9] The author also analyzes the application of different RBM models in the field of credit risk assessment.

Research Method
In the classical neural network algorithm theory, Professor Hinton sees the restricted Boltzmann machine (RBM algorithm) as a typical undirected graph, as shown in Figure 1. ν defined as the visible layer, it represents the input data set in the P2P customer credit risk assessment study. Next, we define h as a hidden layer and apply it to our credit evaluation research, which is a feature extractor.
In other words, it is the dimension reduction process. In the middle of the visible and hidden layers, we use W as the neighboring weight between the layers. For the most classic RBM models, all visible neurons and hidden neurons are generally binary variables, that is [10]. Open Journal of Business and Management In different practical applications, the problem we are more concerned with is the distribution of visible neurons ν defined by the RBM parameters Similarly, applying the pattern of visible neurons to hidden neurons, we have: In order to find the specific situation of the ( ) | P v θ distribution, here we need to solve the normalization factor Z θ , and estimate it, roughly 2 n m + times calculations. In view of this, even if we can obtain the parameters , i j ω , j a and j b through the training of the model, we still cannot accurately calculate the unique distribution determined by these parameters.
Of course, it is worth mentioning here that, due to the special structure of RBM neurons, we know that when determining the state of local visible neuron states, in this case the activation states of each hidden neuron are conditionally independent [11].
We record the vector obtained by digging the binning variable k h at h as ( ) , use the following formulas (3) and (4) ( ) Here, Through the derivation of the above formula, we find the formula (7) ( ) For the symmetric RBM neuron structure map, when we fix the state condition of the hidden neurons, it can be clarified that the activation states of the respective visible neurons are also conditionally independent [12]. Similarly, we derive the independent activation probability of the visible neurons at the ith by the derivation of the formula as shown in (8) below.
( ) Finally, the activation probabilities for different neurons are:

Data Description
Whether the credit risk assessment model is effective or not, one of the important rating ideas is whether it can accurately identify the potential financial problems of SMEs borrowing from P2P. Therefore, the ideal sample in this section is the SMEs that have borrowed through the P2P platform. However, because the P2P platform does not disclose the borrower's specific information, and most companies that use the P2P platform to raise funds are not listed companies. For non-listed companies, they have no obligation to publish financial statements. Therefore, it is difficult to collect enough sample data to support this empirical study. In order to make this modeling idea go smoothly, the main method of this research is to find potential lending companies and representa- Data is collected form WIND database and WDZJ-OFFICIAL website.

Indicator Selection
This article establishes individual user portraits through six dimensions: identity information verification label, stability information label, financial application information label, important asset information label, commodity consumption information label, and media viewing information label. Then we will consider enterprise executive information below. In the case, combined with the empirical data, the indicators are embodied, and the P2P enterprise customer credit index system is established.
After fully considering the difficulty of obtaining the indicators, here is a summary of the corporate customer credit pre-selection indicators established by the three dimensions of the company's own label, the company's main executive label, and the external evaluation label.
Among the nearly 200 pre-selected indicator variables, it is necessary to screen out variables with significant effects. In this paper, the credit risk assessment of P2P enterprise customers, the primary problem is to discretize the continuous variables to facilitate the next data grouping and WOE coding, and to solve the IV value.
In the traditional machine learning model, if the data set is improperly discretized, the accuracy of the trained model classification will be greatly reduced. In order to discretize the continuous variables, after considering each model, we finally choose the entropy-based discretization method.
To solve the IV indicator, we need to calculate the WOE (Weight Of Evidence) value in the first step [13]. Combined with the P2P enterprise customer credit risk assessment model to be established in this paper, the dependent variable here is the case that the enterprise loan is overdue and the normal loan is repaid. In fact, WOE is a measure of the proportion of defaults when estimating the value of an independent variable in a particular dimension. If the value of WOE is larger, it means that the dimension is more important.
ln 100 IV ln It can be seen from the above formula (12)  Take the operating income indicator as an example to illustrate and explain how to calculate the cabinet (the attributes of the variables) WOE and IV, see Table 2.
The IV value of the operating income index = 0.36 > 0.3, which is an indicator with strong forecasting ability. In the actual data analysis, sometimes the variable with the IV value between 0.01 and 0.02 is still significant in the use of Logistic regression. Therefore, this paper adopts a conservative approach and only excludes variables with IV less than 0.01 (Table 3).

RBM Feature Extraction
Analysis of the sample data found that there were only 33 samples of the GEM <  Table 4, and finally selected 40 hidden neuron nodes with the smallest single-layer reconstruction error to train the RBM model.

Model Comparison
After undergoing the above data preprocessing, we put P2P enterprise data into the machine learning model for classification and prediction, such as SVM, Logistic, KNN and so on. As can be seen: 1) The effect of RBM is better than that of PCA when the same model is se-Open Journal of Business and Management 2) The Logistic model performed best in the three models with the same data feature extraction method selected. In general, the RBM-Logistic model has the best classification, with an accuracy rate of 74.87% (Figure 2).

Result
According to the P2P SME customer credit index system screened by IV value, it can be seen that in the P2P SME credit risk assessment, corporate financial information (such as operating income, net assets growth, current liabilities total ratio, etc.) is a very important ring. This also reflects the full understanding of the financial situation of SMEs, while focusing on the debt situation of enterprises, which is of great benefit to the construction of the credit evaluation index system.  A good credit risk model can not only reduce the burden of qualification review for SMEs for the P2P credit platform, but also reduce the risk of lending, while also speeding up the financing process for SMEs, which has many benefits for both parties. Therefore, for the P2P online lending platform, it is very important for the P2P business to construct a scientific and reasonable credit evaluation model.
After comparing the advantages and disadvantages of the linear dimensionality reduction algorithm and the nonlinear dimensionality reduction algorithm, combined with the contrast divergence theory, the P2P enterprise customer credit risk assessment model based on RBM feature extraction is established. Finally, it is concluded that the effect of RBM is better than that of PCA when the same model is selected. The Logistic model performs best in the three models when the same data feature extraction method is selected. Therefore, the P2P network lending platform can consider constructing the RBM-Logistic model with the highest accuracy when conducting credit risk assessment for SMEs.