Classification and Prediction on Rural Property Mortgage Data with Three Data Mining Methods

The Farmers Property Mortgage Policy is a strategic financial policy in western China, a relatively underdeveloped region. Many contradictions and con-flicts exist in the process between the strong demand for the loans by farmers and the strict risk control by the financial institutions. The rural finance cor-porations should use scientific analysis and investigation of the potential households for overall evaluation of the customers. These include historical credit rating, present family situation, and other related information. Three different data mining methods were applied in this paper to the specifical-ly-collected household data. The objective was to study which factor could be the most important in determining loan demand for households, and in the meanwhile, to classify and predict the possibility of loan demand for the potential customers. The results obtained from the three methods indicated the similar outputs, income level, land area, the way of loan, and the understanding of policy were four main factors which decided the probability of one specific farmer applying for a credit loan. The results also embodied the difference within the three methods for classifying and predicting the loan anticipa-tion for the testing households. The artificial neural network model had the highest accuracy of 91.4 which is better than the other two methods.


Introduction
As a developing country, unbalanced development exists in China. The development of western China is far behind that in the eastern part, especially in the rural areas of the western China. The central government encourages the rural financial cooperatives to provide loans to farmers in order to expand the scope of production. However, the potential credit risks constrain the credit operation between commercial banks and households. All the financial institutions need to seriously examine the basic situation and business of each household to decide the probability of lending. Rural and agricultural development is facing a bottleneck caused by insufficient funds and lack of financial credit. The national banks and rural financial cooperatives tighten lending strictly due to concern that the risk can not be controlled. The waiting time for a loan is invariably very long.
From experience and common sense, we know that some factors exist which restrict the enthusiasm of households to apply for credit loan from financial in- How to investigate the causal relationships among all these factors and how to take advantage of them to predict the probability of one specific farmer will be of practical value. Predicting the possibility of loan demand is one of the most interesting and challenging tasks in which to develop data mining applications. With the increased use of computing methods and data mining techniques, large volumes of financial data are being collected and are being made available to the specific research community. Prediction models are being developed with these historical data based on knowledge discovery methods such as statistics or other optimization techniques. All of these models identify and exploit relationships among large numbers of variables regarding households and financial institutions, and are able to predict the outcome of loan demand using the historical cases stored within a database.
Previous researches utilized statistical models to study the correlative analyses of all these factors on the loan demand [1] [2] [3]. In rural financial domains, where data and statistics driven research is successfully applied. The results could produce some useful suggestions for financial institutions to evaluate the specific household. For instance, in [1], a multivariate regression model, a Logistic regression model is applied to study the relationship between the independent variable, loan demand and other variables and influencing factors. However, we know that all statistical models are based on some assumption. For example, all variables considered must be subject to the normal distribution conditions in addition to the assumption that the variables are mutually independent. What is lacking of this assumption is that it means the statistical results will be accepted with a certain confidence value.
Causal relationships among variables can provide intuitive observation for a particular household and can provide support for financial institutions to make K. X. Zhang et al. Journal of Software Engineering and Applications a scientific assessment. There are many popular data mining methods that can be used to study these specific problems [4] [5]. In this paper, we utilise two machine learning methods (Bayesian network and Artificial neural network) and one statistical technique (Logistic regression) to build models for investigating the interested variables, and to develop prediction models for household loan demands. We have designed a questionnaire and gathered a general case of data for consideration, where the observations are discrete. The objective of this paper is to study the causal relationships among variables with three data mining methods in rural financial data, to build and evaluate the classification and prediction models under the three methods. We expect the empirical analysis and the theoretical results can provide a valuable reference to the relevant financial institutions when they are assessing credit loan. The innovation of this paper lies in the application of two typical data mining methods (Bayesian Network and Artificial Neural Network) to predict and analyze the data of farmers to overcome the insufficiency of traditional statistical model (Logistical Regression) Practical significance £ The work of this paper is the chief application of data mining method in the prediction of economic data in western China and the research results of this paper have certain reference significance for the analysis of rural financial mortgage loan policies in western China.
The paper is organized as follows: we introduce the data and their properties in Section 2. In Section 3, we present the three methods respectively. The comparative analysis of classifications and predictions is described in Section 4. The conclusion is summarized in Section 5.

Datasets
The data used in this paper was collected during June 2011 and July 2012 by the researchers from College of Economics and Management, Northwest A&F University, China. The whole data collection process was supported through funding from the Chinese government. The project is "Changjiang Scholars and Innovative Research Team in University, Jan 2012-Dec 2014, No.IRT1176". All these data were taken from the western region of China, including Shaanxi province and Ningxia province. In order to ensure that the data is scientific and reasonable, we randomly surveyed a total of 4000 households from the above regions using a questionnaire. The data collected consists of three main parts. The first part is composed of basic information relating to the specific investigated farmer, including age, educational level, family size, land management, household income and expenditure structure, etc. The second part includes loan status of the farmers, loan history in the past 5 years and credit rating. The third part is made up of the understanding, demand and satisfaction about the property rights mortgage. We selected a total number of 11 factors in this research, each of these factors has 2 to 5 attributes to describe the different levels of the specific household. For example, the variable "Income (CNY)" has 5 levels  Table 1.
The meaning of each variable in 1 is described as following: Income represents the income level of the specific household, Wayofloan is for the way a specific household ever used, Expenditure is for the spending level of a household, Fa-milySize is the population living in a household, LoanDem and describes if the household need a loan or not, Policy means the level to which a household understands the loan policy, Land Area is the land size a household owns, Age is the true age of the householder, Edu is the educational background of the householder, and Conven is how easy it is for a household to apply for a loan.

Methods and Prediction Models
In keeping with recently published literature as well as our previous studies, we will take three different types of classification models in this paper. They are the Bayesian Network Model (BN), Artificial Neural Networks (ANN), and Logistic Regression (LR). A simple introduction of these models is as follows: K. X. Zhang et al.

Bayesian Network Model
Bayesian Networks (BNs) are probabilistic graphical models which represent the dependencies among a set of random variables in a chosen domain [6] [7]. A BN structure consists of two main components: a visible Directed Acyclic Graph (DAG) and a set of parameters. The DAG is defined as

Artificial Neural Network
Artificial Neural networks (ANNs) are commonly known as biologically inspired analytical techniques, capable of predicting new observations from other observations after executing from existing data. ANNs are basically a data-driven black-box model to explore the relationships between input and output variables from historical data. They are virtual input-output device that accept any number of numeric inputs and produce any number of numeric outputs. ANNs have the ability to solve highly non-linear complex problems [10] [11] [12]. Multi-layer perceptron (MLP) with back-propagation is a popularly used and well-studied ANN model. It is known as a powerful function approximator for prediction and classification problems. In this paper, a three-layer feed forward ANN will be applied. The model has 10 input neurons in the input layer, many hidden neurons and two neurons in the output layer.

Logistical Regression
Logistical regression (LR) is a generalization of linear regression [

Classification, Prediction Results and Discussion
In this section, we first carry out the relationships analysis within factors with BN, ANN and LR respectively. The results of the comparison of these outputs provide factor classification in different perspectives. We then study the accuracy of each model with testing data. The properties of accuracy about these models embody the authenticity and reliability when they are utilised in real problems. In the first part, we randomly select half of the total data size (2000 cases) as training data set for building the classification model. The rest of the data (2000 cases) are adopted as testing data set to test each of the models and assess the accuracy of each model.

BN Classification Results
We took an novel algorithm, ChainACO, in this paper for BN topological graph learning. ChainACO is an algorithm which is developed by Wu, etc.
[15] [16]. It has been tested as an efficient and cheap technique for BN structure learning, especially for large groups of data. In this problem, we run ChainACO and achieved the best structure accompanying the best fitness score within 10 repeated runs. The structure obtained is described in Figure 1.  On the other side, the above BN model can provide quantitative relationships to any one interested variable and the relevant variables. For example, Table 2 is the conditional probability distribution among the variables LoanDemand, Policy, Income, LandArea and WayofLoan. This table shows to what degree the basic information of each farmer restricts the possibility of applying for a loan from the financial institutions, and which combination of this basic information about the farmer makes him the most possible customer for applying for the loan. For instance, the farmer who knows the loan policy very well, has an income level in level 1, with land area in level 2, and takes the loan in way 1 will have the highest possibility to apply for the loan in the future(the probability in this case is 0.6190). In the BN model, we can gather a conditional probability distribution about any one factor with its corresponding factors. Table 3 is an other example about the factor Wayofloan related to the relevant variables, Expend and Age. Figure 1 shows that the variables Expend and Age are parent factors to Wayofloan, so the different combinations of specific farmers have different attitude to the credit loan.
BN provides us with a visual topological graph that indicates the underlying relationships among all interested factors. The quantitative conditional probability distribution reveals the inherent probabilistic relationships of these factors.

Logistical Regression Results
We use a popular statistics tool, SPSS 21 to carry out the logistic regression analysis [17]. In this process, the factor LoanDemand is regarded as the K. X. Zhang et al.  Table 4. In The value is significant at the 0.01 level (2-tailed).

Artificial Neural Network Classification on Rural Data
Artificial Neural Network was performed using SPSS 21. In order to build the structure of ANN, the training data were randomly assigned to training (1398 cases; 69.5%) and testing (602 cases; 30.5%) datasets. The input layers consisted of ten input nodes, and the output layer has one node with two states (Loan Demand = 1 and Loan Demand = 2). After the debugging and testing five times, in this research, the hidden layer consisted of 10 hidden nodes.
The first main result produced in this model is the importance of each input factor to the dependent variable, which is shown in Figure 2. This table depicts the three main variables: Land Area, WayofLoan, and Policy respectively. The corresponding respective relevance of the three variables are significant importance to other seven variables. The fourth important variable is Income. It has a lower impact factor, only about 30% that of other factors, and the normalized importance is less than 20%, these suggest that they are negligible when considering the impacts to the loan demanding of one specific farmer.
The second output which we concerned is the correlation coefficient between actual data and estimated values. The ANN model produced the correlation  The investigation of multi-factors analysis is a popular problem in rural finance. Comprehensively comparing these results can provide us with inspiration for understanding the substantial problem in rural finance. For instance, the data in this paper is collected from the less developed regions in western China. In these regions, the farmer's main income comes from land they have owned, more land that they own, and higher incomes from agriculture. So the investment of household's land had significantly positive effect on the credit loan. From the result in LR, we can see that the Land Area has the highest positive coefficient (4.074) to the factor, Loan Demand. In ANN model, Land Area is the most critical influence to Loan Demand. Educational background should be an important factor when applying for a loan, however, all the classification results show it is a weaker factor in this problem. For instance, in BN, the Edu is independent to all other factors, in LR, it is not included in the equation (the Sig. value is 0.859), and in ANN, the Normalized importance is less than 20%. The explanation for this performance is that in rural areas with a lower level education development, farmers who intend to apply for the credit loan depend on the actual need but not on the education degree they have. The valuable suggestion is with synthesizing the outputs of the above methods, the useful results can be concluded for analysing and investigating the collected data in rural area as discussed.

Measures for Classification and Prediction Results
Sensitivity, specificity, and accuracy, are widely used statistics to describe a prediction and classification model. They are used to quantify how good and reliable a classification is. In this rural financial problem, sensitivity evaluates how good the classifying is at detecting a positive result. Specificity estimates how likely it is that a farmer who does not need a loan can be correctly ruled out.
Accuracy measures how correctly a classification identifies and excludes a given condition [18] [19] [20]. Sensitivity, specificity, and accuracy are described as We apply the models construed in the previous section to the testing data in our problem. We have understood the basic information about all these data.
For example, we know the situation of loan demand for each household.
Through comparing the prediction results about Loan Demand to the actual Loan Demand, we got TP, TN, FN and FP to calculate the sensitivity specificity and accuracy. Table 5 describes the comparison results of accuracy, sensitivity and specificity within these three methods. Table 5 shows the complete set of results in a tabular format. For each model, the detailed prediction results of the validation datasets are presented in form of confusion matrixes. The problem in this paper is a two-class prediction problem, K. X. Zhang et al. In evaluating the performance of the above three methods, we found that the

Conclusions
In this paper, we report a research effort where we developed three prediction models for farmers loan demand. Two of the models are from machine learning (BN, ANN) and one from statistics (LR The complexity, validity, as well as the accuracy of data directly decide the classification efficiency and predication accuracy. BN, ANN and LR require large amount of training data. Also, the accuracy in this paper still needs to be further improved. A larger data set and the improved Bayesian network or neural network will be used to improve the accuracy in the future. Our ongoing research efforts are geared toward investigating large data set from western China and studying properties of these methods.