Machine Learning Methods of Bankruptcy Prediction Using Accounting Ratios

The aim of bankruptcy prediction is to help the enterprise stakeholders to get the comprehensive information of the enterprise. Much bankruptcy prediction has relied on statistical models and got low prediction accuracy. However, with the advent of the AI (Artificial Intelligence), machine learning methods have been extensively used in many industries (e.g., medical, archaeological and so on). In this paper we compare the statistical method and machine learning method to predict bankruptcy with utilizing China listed companies. Firstly, we use statistical method to choose the most appropriate indicators. Different indicators may have different characteristics and not all indicators can be analyzed. After the data filtering, the indicators are more persuasive. Secondly, unlike previous research methods, we use the same sample set to conduct our experiment. The final result can prove the effectiveness of the machine learning method. Thirdly, the accuracy of our experiment is higher than existing studies with 95.9%.


Introduction
For a long time, corporate bankruptcy prediction is one of the utmost significance parts in evaluating the corporate prospects. Lenders, investors, governments and all kinds of stakeholders are eager to seek an efficient way to understand the ability of the company so that they can choose the suitable decision making. The whole condition of the corporate either small or large needs to develop the models to assess the financial risks. For example, Altman (1968), in a paper, used the multivariate discriminant analysis to predict the financial case [1].
The original study in bankruptcy prediction can be dated back to the early 20th century when Fitzpatrick (1932) used economic index to describe predictive capacity of default business [2]. After that, more and more researchers focused on the bankruptcy prediction (e.g. Winakor and Smith(1935) [3]; Merwin, (1942) [4]). The turning point in the survey of the business failure symptoms was happened in 1966 by Beaver who initiated the statistical models to made financial forecasts. Following the line of thinking, there are many representative statistical models were proposed by scholars [5]. Ohlson (1980) used the logistic regression to forecast financial status [6]. Besides, in 1985, West determined financial forecasts with factor analysis [7]. Similar to the experiments, a great number of generalized liner models that can be used to predict financial conditions emerged continuously (e.g. Aziz, Emanuel and Lawson (1988) [8]; Koh (2010) [9]; Platt, Platt and Pederson (1994) [10]; Upneja and Dalbor (2000) [11]; Beaver, McNichols and Rhie(2005) [12] et al.).
From the beginning of the 20th century, AI and machine learning methods are becoming more popular in many different industries. For example, Subasai and Ismail Gursoy (2010) [13] and de Menezes, Liska, Cirillo, and Vivanco (2016) [14] in medicine; Maione et al. (2016) [15] and Cano et al. (2016) [16] in chemistry; Heo and Yang (2014) [17] in finance; Kim, Kang, and Kim (2015) [18] in finance. Except for those industries, it is widely used in a variety of discipline. Bankruptcy prediction is one of them. With the advent of the big data era, statistical models have some weaknesses in reflect bankruptcy prediction.
Based on that, researchers have to find some new method to overcome the shortcoming of statistical method. Since the bankruptcy prediction is similar to the classify algorithm, academics are exploring machine learning tools can be used to separate bankruptcy and non-bankruptcy corporate (Wilson and Sharda (1994) [19]; Tsai (2008) [20]; Chen et al. (2011) [21]). Besides, many researchers combine statistical methods and machine learning methods to enhance the reality of bankruptcy prediction continually. Cho et al. (2010) introduced the hybrid model by selecting variables filtered by decision tree and case-based reasoning using the Mahalanobis distance with weights [22]. Chen et al. (2009) introduced a hybrid model by combining the fuzzy logic and neural network [23]. The final results show that the hybrid model has a higher accuracy than logic model. All in all, with the development of information science, it has great influence on all fields of scientific research.
As a hot research topic in computer science, machine learning has many different components, which consists of the decision tree, support vector machines(SVM), K-nearest neighbor method (KNN), random forest, logistic regression , artificial neural network (ANN) and so on. Support vector machines (SVM) are one of the most successful models, for example, Cortes and Vapnik (1995) [24] [29]. In a word, the artificial neural network (ANN) can improve the accuracy modify by setting the parameters. Based on that, the paper compares statistical methods and computer science methods to find the most effective bankruptcy prediction model.
The rest of papers proceed as follows. In Section 2, we briefly introduce the data filter processing methods and machine learning methods. In Section 3, we present the data filtering process. In Section 4, we do the experiment and display experiments result. Concluding the article and suggestions for the future research will be given in the last part of Section 5.

Normal Distribution Test
Normal distribution is one of the components of hypothesis testing. The formula for one-dimension normal distribution is: where µ is the mean or expectation of the distribution (and also its median and mode). σ is the standard deviation. 2 σ is the variance.
x is the i-th order statistic, ( ) is the sample mean, the constants i a are given by  Create two separate samples

Wilcoxon Rank-Sum Test
The first x sample size is 1 n , the second y sample size is 2 n . In the capacity of the mixed sample 1 2 n n n = + (first and second), the x sample ranksum is x W and the y sample ranksum is y W .The value of Z is: According to the significance level, determine whether to accept the original hypothesis.

Principle Component Analysis (PCA)
Suppose that we have a random vector X: With population variance-covariance matrix: Consider the linear combinations: , , , P X X X  . There is no intercept, but 1 2 , , , can be viewed as regression coefficients.
Note that i Y is a function of our random data, and so is also random. Therefore it has a population variance.

K Nearest Neighbors (KNN)
The principle idea of KNN is that if a majority of samples in the feature space in the k most adjacent samples belonging to a certain category, the sample also belong to this category. For example, in order to distinguish between cats and  Figure 1(a) and Figure 1(b).
When k = 3, the three lines are the closest three points, so the circle is more, so the star belongs to the cat.

Logistic Regression
The logistic regression model is a two class model. It selects different features and weights to classify the samples, and calculates the probability of the samples belonging to a certain class with each log function. That is, a sample will have a certain probability, belong to a class, there will be a certain probability, belong to another class; the probability of large class is the sample belongs to the class.

Decision Tree
The decision tree is a predictive model that represents a mapping between object attributes and object values. It is classified according to the features, each node raises a problem, and the data are divided into two categories by judgment, and then continue to ask questions. These questions are learned from existing data, and when new data is added, the data can be partitioned into suitable leaves based on the tree's problem.

Support Vector Machines (SVM)
Support vector machine (SVM) is a learning theory of VC dimension theory and structural risk minimization principle on the basis of statistics. According to the limited sample information in model complexity and learning ability, it will obtain the best generalization ability. SVM is a two classification algorithm, which can find a (N-1) dimension hyper plane in N dimension space. This hyper plane can classify these points into two categories. That is to say, if there are two classes of linearly separable points in the plane, SVM can find an optimal straight line separating these points.

Random Forest
Random forest is based on decision tree. It is a classifier that combines existing classifiers algorithms in a certain way to form a classifier with stronger performance, and a weak classifier is assembled into a strong classifier. Its algorithm process is as follows: 1) Extract training sets from the original sample set. Each round extracts N training samples from the original sample using Bootstraping (In the training set, some samples may be extracted several times, while some samples may not be extracted at one time). K rounds were extracted and k independent training sets were obtained; 2) K decision tree models are obtained through K training sets; 3) K decision tree model is adopted to get the classification results by voting. the importance of all models is the same.

Data-Set
In this paper, the data set is collected from the Wind Financial Terminal Database and CCER Economic and Financial database. It contains all kinds of financial data of capital market enterprises which being disclosed by the financial statements. In China, if the listed company loses two consecutive years, it will be marked ST. In addition, the company that has been losing money for three years will be marked *ST. Such companies would be in danger of exiting the capital markets. In order to assess the reliability of the method, we are going to draw random sampling financial data covering 2000 to 2016 on China capital market companies. In this paper, we will take the company of ST as the bankruptcy sample and the company of net profit for four years is positive as the non-bankruptcy sample. The process of data selection follows the following proceeds: First of all, in China, financial industry and non-financial industry follows the different accounting standards. So there are some differences in accounting statements and we need to distinguish two kinds of industry. However, financial industry have too much uncertainty mainly based on the national policies and regulations, so it has the high risk, therefore we decided to analyze non-financial industry.
Next, it is necessary to choose the indicators to evaluate the enterprise. From the financial management perspective, the enterprise evaluation mainly consists of the following four abilities: Debt Paying ability, Operation ability, Profit ability and Development ability. Moreover, different ability measurement system has different index composition. The indicators considered in the paper are described in details in Table 1. After the above steps, 518 companies are selected as samples. The numbers of observation in each year are shown in Table 2.

Data Filtering Results
To reduce the computational complexity and improve the significance of the model, it is necessary to make a significance test and filter candidate indicator variables. The proceeds of filter are as follows: Before the significance test, the normal test of each indicator is required. We need to determine which method to perform the significance test according to  Wilcoxon rank and perform non-parametric test. In the end, according to the standard of certain significance level, the model variables will be determined. In this paper, stata 10.0 is selected as the data statistics software. Stata is widely used in data analysis and it provides everything researchers need for statistics, graphics, and data management.
There are many methods to check the normal distribution. Different methods apply to different sample characteristic. Based on the sample of the paper, we choose Shapiro-Wilk test as a method. Shapiro-Wilk test is applicable to normal distribution test of sample size less than 2000. Besides, it is widely used in explore the distribution of continuous random variables. The selected financial indicators are inspected and the results are shown in the Table 3.

Experiments
After data filtering, it is necessary to conduct the experiment. In this section, we will take both statistical and machine learning methods to predict bankruptcy. Figure 2 illustrates our methodology.
In this paper 19 variables are selected, it can be merged into profit ability, paying ability, operation ability and development ability. We will use 518 companies to complete bankruptcy prediction in two ways. The first way is statistical method. We plan to use logistic regression to conduct the experiment. The second way is machine learning method. Machine learning methods need to  train data, so we will choose the proper train set. After the process of learning, we will take test set to complete bankruptcy prediction. After the experiment, we will compare the accuracy of two methods and determine which method is more accurate.

Statistical Method
In this section, we will take statistical method to conduct bankruptcy prediction.
The bankruptcy prediction of statistical methods is mainly composed of the following steps: Firstly, it is necessary to analyze the principal component of four financial analysis indicators. Secondly, the binary logistic regression analysis is used to predict the bankruptcy. The results of Principle Components Analysis can be seen in Table 5.  Figure 3. Figure 3 shows the gravel figure, score figure and loading figure of each variable. Loading figure in Figure 3(a) shows that two principal components of profit ability emphasis on different financial indicators. As for the first   F3 is the comprehensive evaluation of operation ability; F4 is the comprehensive evaluation of development ability. After principal component analysis, we reduced our 19 financial measures to 4 financial indicators. Based on that, we perform binary logistic regression on F1 F2 F3 and F4 respectively. The results of bankruptcy prediction can be seen in Table 6.
From Table 6, we can draw the following conclusions: The Non-bankruptcy prediction probability of the enterprise is 86% and bankruptcy prediction probability of the enterprise is 55.8%. Therefore, the probability of accurate prediction is 70.8%. This probability is consistent with the exact probability of the most financial model predictions. However, the prediction accuracy of this model is not high and needs to be improved.

Machine Learning Methods
After the statistical method, we will take machine learning methods to take bankruptcy prediction. In this section, we choose KNN. SVM. Logistic regression. Random forest and decision tree to forecast bankruptcy. Figure 2 shows the ROC (Receiver Operating Characteristic) curve of random forest model.
From Figure 4, we can draw the conclusion that random forest model shows the outstanding accuracy. The square of the shaded area has reached 0.99, which is close to the largest area 1. The shaded area shows the diagnosis effect. The bigger the shadow, the better the diagnosis effect. Besides, in order to verify the effectiveness of more machine learning methods, we enumerated the confusion matrix of five machine learning methods. The confusion matrix of five machine learning methods can be seen in Figure 5. Figure 5 shows the bankruptcy prediction outcomes for machine learning methods. The random forest and decision tree shows high accuracy with 95% and 94% in non-bankruptcy prediction. Besides, the KNN. SVM and logistic regression have the lower accuracy with 88% 88% and 84% respectively. However, compared to statistical method with 86%, four machine learning methods have the higher predictive accuracy except for logistic regression. As for bankruptcy prediction, machine learning methods show significant superiority over statistical method. The random forest shows the highest accuracy with 97% and the KNN shows the lowest accuracy with 74% which is much greater than statistical method with 55.8%. In addition, the remaining fours machine learning methods also have high prediction accuracy than statistical method.
In conclusion, the machine learning method has the higher bankruptcy prediction accuracy than statistical method. Overall results of different methods can be seen in Figure 6. The prediction of random forest is 95.9%. It is clear that    machine learning method is more accurate than other prediction method. The results demonstrate that machine learning method has the advantage to predict bankruptcy over the statistical method.

Conclusions
This paper compares the accuracy of statistical forecasting and machine learning methods to predict bankruptcy in Chinese-listed companies. Firstly, we take Shapiro-Wilk test to test the normal distribution. Secondly, according to the Shapiro-Wilk test result, it is necessary to determine the parameter test or the non-parametric test. In the end, we take both statistical method and machine learning method we predict bankruptcy. The empirical results show that machine learning methods are superior to statistical methods. As for statistical method, we choose principal component analysis to reduce the 19 financial statements to 4 financial indicators. Through a comprehensive Open Journal of Business and Management measurement of financial indicators, we carry out binary logistic regression to four comprehensive indexes. The final rate of accuracy is 70.8%. However, each machine learning method (KNN, SVM, Logistic Regression, Decision Tree, and Random Forest) has the greater accuracy than statistical method.
In the future, we plan to extend our work to more indicators of the companies. With the development of the China capital market, there will be more company characteristics to be disclosure. Based on this trend, we will select more indicators to predict bankruptcy. Furthermore, we will apply the method to more small and medium enterprise in China. Subsequently, we would try more machine learning method and improve the accuracy of bankruptcy prediction.