The mortgage sector plays a pivotal role in the financial services industry, and the U.S. economy in general, with the Federal Reserve, St. Louis, reporting Households and Nonprofit Organizations for One-to-Four-Family Residential Mortgages Liability Level at $10.8T in Q3 2020. It has been in the interest of banks to know which factors are the most influential predicting mortgage default, and the implementation of survival models can utilize data from defaulted obligors as well as non-default obligors who are still making payments as of the sampling period cutoff date. Besides the Cox proportional hazard model and the accelerated failure time model, this paper investigates two machine learning-based models, a random survival forest model, and a Cox proportional hazard neural network model DeepSurv. We compare the accuracy of covariate selection for the Cox model, AFT model, random survival forest model, and DeepSurv model, and this investigation is the first research using machine learning based survival models for mortgage default prediction. The result shows that Random survival forest can achieve the most accurate, and stable, covariate selection, while DeepSurv can achieve the highest accuracy of default prediction, and finally, the covariates selected by the models can be meaningful for mortgage programs throughout the banking industry.
Home building and sales are one of economic engines driving the United States’ $21T economy, and the Federal Reserve, St. Louis, reports Households and Nonprofit Organizations for One-to-Four-Family Residential Mortgages, Liability Level, at $10.8T in Q3 2020. Housing foreclosure and mortgage default were major drivers of the 2008 Great recession, however, to the contrary in 2020, existing housing sales are up, generating demand for mortgages even during the Cov-19 crisis and are a bright spot in the US economy as shown below.
Also, two important aspects involved in predicting performance evaluation and prediction interpretation, are respectively: 1) Prediction accuracy, and 2) the rank of covariate importance.
Mortgage default data presents a binary classification problem with an obligor either defaulting or not defaulting, a logistic regression model seems to sufficiently handle this type of classification problem with 1 indicating default and 0 indicating nondefault, however, the classification of 0 as nondefault is incomplete, since the status of this loan is unknown after the end of the sampling period: for example, consider a performing mortgage loan with a loan term of 30 years that does not default or pay-off during the sampling period, then classification of 0 is incomplete, since the status of the loan is unknown from the end of the sampling period to the end of the 30-year loan term.
However, this incomplete data can still give information on default probability, and should not be discarded, instead survival analysis is a modelling methodology that can incorporate this type of incomplete data.
A wide variety of applications for survival analysis abound in economics and
other social science disciplines ranging from unemployment analysis to the tendency of a convicted criminal to reoffend (recidivism). [
In this paper, several survival analysis methodologies will be compared in relation to their accuracy of default prediction and accuracy of covariate importance ranking.
Mortgage prediction has been examined by other researchers: [
Although these papers discussed the covariate importance based on different models there are several considerations deserving further investigation: First, these papers did not discuss model accuracy, which is a deficiency when discussing covariate importance, and second, most classification models did not utilize the incomplete data, which potentially could be a large fraction of the total dataset.
Survival models incorporate and utilize incomplete data, and several papers have used the famous Cox Proportional Hazard model to examine mortgage default. First, [
With the popularity of machine learning, several other survival models started being noticed by academicians, such as Random Survival Forest, and deep learning-based survival model. Consequently, this paper will investigate four different survival models, the Cox Proportional Hazard model, Accelerated Failure Time (AFT) model, Random Survival Forest (RSF) model, and the deep learning based DeepSurv model.
To compare the accuracy of the models, training and test data will be constructed and a C-index will serve as the accuracy metric, and the motivating factor for using C-index as the performance metric is because for incomplete data, accuracy could not be defined. This paper will also generate covariate importance ranks for all the models, and construct a model-covariates matrix to discover the best model in terms of covariate importance.
Finally, this paper is arranged as follows: Section 2 will be the introduction of survival theory and models including Cox, AFT, RSF and DeepSurv. Section 3 is building the covariates ranking with Cox model, AFT model, RSF model. Section 4 will be evaluating the effectiveness and accuracy of covariates ranking generated with the three models using DeepSurv model. Section 4 will be the results, discussion and conclusion.
Censoring arises naturally in time-to-event data when, the starting of an event or ending of the event, are not precisely observed ( [
Survival analysis is a statistical data analytic technique for analyzing time-to-event data, and one fundamental relationship in survival analysis is the survival function. Assume T is a continuous random variable, then the probability of an individual surviving beyond time t can be defined as Equation (1).
S ( t ) = P ( T ≥ t ) = ∫ t ∞ f ( t ) d t = 1 − F ( t ) (1)
where f ( t ) is the probability density function of an event of interest happens at time t. F ( t ) represents the cumulative probability of an event of interest happened by time t. For example, in the case of mortgage default above, the event of interest is mortgage default. S ( t ) is the probability of mortgage default has not happened until time t. Time t is not an absolute time stamp, but a time period relative to the start of mortgage borrowing.
The other basic quantity is the hazard function. It is also known as the hazard rate, the instantaneous death rate, or the force of mortality. The hazard function can be expressed as in Equation (2).
λ ( t ) = lim d t → 0 P ( t ≤ T ≤ t + d t | T ≥ t ) d t = f ( t ) S ( t ) (2)
where P ( t ≤ T ≤ t + d t | T ≥ t ) expresses the conditional probability that the event of interest will happen in time interval dt given it did not occur before.
Combining Equation (1) and Equation (2), Equation (3) can also be derived, which shows that survival and hazard function provide equivalent information.
S ( t ) = exp ( − ∫ 0 t λ ( x ) d x ) (3)
where ∫ 0 t λ ( x ) is called cumulative hazard, which every model uses to calculate the survival function, S(t).
In time-to-event data, because some outcomes are unknown, it will not be possible to use accuracy or area under curve (AUC) to evaluate the performance of a model, however, [
The dataset used in this paper has 50,000 U.S mortgage borrowers (obligors), and is the dataset used by [
The four macroeconomic covariates are grouped, and their paired correlations calculated, as shown in
Covariate name | Description |
---|---|
Balance | Outstanding balance at observation time |
LTV | Loan-to-value ratio at observation time |
interest_rate | Interest rate at observation time |
hpi | House price index at observation time |
gdp | Gross domestic product (GDP) growth at observation time, in % |
uer | Unemployment rate at observation time, in % |
REtype_CO_orig | Real estate type condominium = 1, otherwise = 0 |
REtype_PU_orig | Real estate type planned urban development = 1, otherwise = 0 |
REtype_SF_orig | Single family home = 1, otherwise = 0 |
investor_orig | Investor borrower = 1, otherwise = 0 |
balance_orig | Outstanding balance at origination time |
FICO_orig | FICO score at origination time, in %* |
LTV_orig | Loan-to-value ratio at origination time, in % |
Interest_Rate_orig | Interest rate at origination time, in % |
hpi_orig | House price index at origination time, base year = 100 |
*FICO score is a credit score created by Fair Isaac Corporation.
sustained increase in economic activity increases employment, which in turn decreases the unemployment rate, uer, and presents a fairly strong negative correlation between gdp and uer in
Before fitting survival models, the data set is preprocessed by performing two steps. First, it is moving the origination date of each mortgage to 0, and the second, is keeping only the last record of each mortgage, and computing the time from origination date to the last observation. The default indicator variable takes on two values, the value 1 if the mortgage has defaulted during the sampling window, and 0 if the observation has not defaulted, that is, survived and is censored, and finally, left censoring is avoided by assuming all loans start from the first observation.
Before fitting any survival model, it is standard practice to generate Kaplan-Meier survival curves to explore the impact of univariate data on survival, and since most of the covariates in the mortgage data set are continuous, dummy variables are generated as follows: For any continuous covariate, if the value is larger than the mean of the covariate, it is labeled as 1, otherwise it is labeled as 0.
In each survival curve, if the two curves are overlapping, it signifies the value of that covariate does not matter to the survival time of the mortgage, and if the two curves separate, it indicates the covariate impacts survival time.
lower red survival curve indicating shorter survival time, and the other univariates that show separation follows a similar logic. Although the Kaplan-Meier curve is visually straightforward, there is a drawback, it does not detect correlations.
In D.R. Cox’s famous 1972 paper [
λ ( t | X i ) = λ 0 ( t ) exp ( X i ⋅ β ) (4)
where X i = { X i 1 , X i 2 , ⋯ , X i n } are the values of covariates of object i. The Cox model attempts to find the effect of covariates on the hazard rate, λ(t), by multiplying the base hazard rate, which changes with time, and an exponentiated linear combination of covariates. The above model implies the effect of the covariates on the hazard rate does not change over time, and the Cox model is called a proportional hazards model since the ratio of the hazard rate of one object, Xi over that of another object, Xj is a constant.
L1-regularized generalized linear model (LASSO regression) was introduced by Tibshirani ( [
β ^ ( λ ) = arg min β [ − log { X ; β } + λ | β | ] (5)
Lasso regression has the quality of shrinking and selecting covariates, and Tibshirani ( [
The five covariates are listed in
Variable | interest_rate | LTV | gdp | hpi_orig | FICO_orig |
---|---|---|---|---|---|
Coefficient | 0.35 | 0.26 | −0.25 | 0.17 | −0.01 |
S ( t | X i ) = [ S 0 ( t ) ] exp ( β ⋅ X i ) (6)
From Equation (6) it can be concluded that when a coefficient is 0, the covariate has no impact on the survival function and, when the coefficient is larger than 0, it will reduce survival time, and when the coefficient is negative, it increases survival time. This can explain why the coefficient of gdp and FICO_orig are smaller than 0, i.e., since a higher gdp growth rate and larger FICO scores have a positive impact on survival time, the coefficients are less than 0, which in turn, positively affects survival time. The coefficient values for interest rate, LTV, hpi_orig are positive, since the higher values for those risk drivers indicate the possibility of a shorter survival time, i.e., interest rate is a measure of default risk, the higher the interest rate, the higher the risk of defaults, and for LTV the larger the loan in relation to the value of the property the higher the risk, and finally for hpi_orig, the higher the house price at origination the higher the mortgage payment, and the more difficult for the obligor to make larger payments over the business cycle. Now, given that all the covariates are standardized to the same magnitude, the absolute value of the coefficient reflects the extent survival time can be reduced, and the survival function altered.
Like the Cox model, the accelerated failure time (AFT) model is also a linear model, and the L1-regularized, Lasso penalty, AFT model will be employed with this data to choose the five most significant covariates that drive failure time. There are several parametric AFT models, and the Weibull AFT model is the most popular since it has characteristics of both a proportional hazard model and an accelerated failure time model. Equation (7) shows the Weibull AFT model.
log ( T i ) = X i T β + σ ε i (7)
In Equation (7), ε i is an i.i.d. random variable that satisfies the log-Weibull distribution and σ is a scale parameter, and since Weibull AFT is a parametric AFT model, the expected survival time can be derived as Equation (8), which can give a clear indication how covariates impact survival time ( [
E ( T ) = exp ( X T β ) Γ ( σ + 1 ) (8)
The Lifelines python package will be used in this section, and similar to the Cox model, all the covariates are standardized before applying the AFT model, and from Equation (8), we can find that when one covariate’s coefficient is 0, the covariate does not have an impact on survival time. When the covariate’s coefficient is larger than 0, it has positive impact on survival time, and therefore, the coefficients from the AFT model usually are opposite of that from the Cox model, as shown in
The covariates selected with the AFT model are consistent with the Cox model, and the signs of the coefficients are opposite of the Cox model, which confirms the theoretical analysis.
The random survival forest (RSF) model derives from the Random forest model of Breiman ( [
Like Random forests, RSF models also produce hundreds of decision trees based on some splitting rule, and the most commonly used splitting rule is the log-rank statistic. For each tree, a subset of the covariates is selected randomly based on the square root of p, where p is the number of covariates, then recursively a covariate is chosen, and its splitting value determined, so that the left node and the right node of the tree has the maximum difference of the log-rank statistics ( [
The covariate ranking in RSF is similar with that of Random forest. It calculates the drop of prediction accuracy on the test data excluding the selected covariate, and since RSF is an ensemble algorithm, there are efficient ways to implement this process ( [
Note that gdp, LTV, uer, interest_rate and hpi_orig are the top five covariates as in the other rankings above.
DeepSurv presents as a deep learning algorithm based on a Cox model ( [
Variable | interest_rate | gdp | hpi_orig | LTV | FICO_orig |
---|---|---|---|---|---|
Coefficient | −0.29 | 0.23 | −0.20 | −0.18 | 0.04 |
deep learning neural network algorithms, DeepSurv also uses alternate fully connected layers and drop out layers to avoid overfitting. DeepSurv, also uses a scaled Exponential Linear Unit (SELU) as the activation function with a hazard function output, and finally, the loss function is the average negative log partial likelihood with regularization ( [
DeepSurv is a multi-layer perceptron similar to the Faraggi-Simon network. However, we allow a deep architecture (i.e., more than one hidden layer) and apply modern techniques such as weight decay regularization, Rectified Linear Units (ReLU) … Batch Normalization … dropout … stochastic gradient descent with Nesterov momentum … gradient clipping … and learning rate.
Scheduling … The output of the network is a single node, which estimates the risk function h ^ θ ( x ) parameterized by the weights of the network [
As seen above, Deepsurv is a highly flexible model facilitated, in part, by modifying the basic gradient descent algorithm into a more adaptable method, and also, as noted, allowing for more neural network layers, introducing more parameters, within the hidden layer framework ( [
The default structure of the DeepSurv neural network will be employed, which is composed of two hidden layers each with thirty-two nodes, a ReLU activation function, a batch norm, and 10% drop out, and finally, the data is split as 80% training data and 20% test data. Training data is used to fit the model, and test data is used for model evaluation by applying the C-index metric to determine the best model fit, and finally, all the models were given a random state, so the results are repeatable.
Unlike the Cox model, which can identify the coefficients of covariates, DeepSurv is a black box model, and consequently is not the optimal choice for coefficient explanation or selection, however, as with neural networks in general, DeepSurv is an excellent prediction model. Kim ( [
Next, DeepSurv is used as a tool to compare and evaluate, the performance of covariate ranking obtained from the other models, and the covariate ranking will be evaluated at 5 levels first, the top covariate, then the top 2 covariates, continuing until finally, the top 5 covariates are evaluated.
Finally, in Section 3.2 and Section 3.3, the choice of a 5-covariate model with a selection of λ = 0.05 , is supported by the conclusions gleaned from
Model | Cox | AFT | RSF | DeepSurv |
---|---|---|---|---|
C-index | 0.798 | 0.789 | 0.799 | 0.928 |
model | Top 1 | Top 2 | Top 3 | Top 4 | Top 5 |
---|---|---|---|---|---|
Cox | 0.688 | 0.768 | 0.844 | 0.859 | 0.865 |
AFT | 0.687 | 0.776 | 0.837 | 0.867 | 0.865 |
RSF | 0.724 | 0.805 | 0.881 | 0.890 | 0.915 |
Determining the probability of mortgage default is a critical part of a bank’s risk assessment profile affecting originations, relationship management, and loss reserves, consequently, determining the best modeling algorithm is also critical to a bank’s overall financial strength. Public mortgage data with 15 covariates, and a binary variable indicating default or nondefault were procured, organized, and analyzed to determine the covariate selection and ranking capability of several widely used and studied survival models. The aim was not to search all the variations of survival models, but to demonstrate the capability of survival models to enhance the understanding of mortgage default through the selection of a judicious set of covariates that explain default and enhance senior managements understanding of an obligor’s potential for default. Results from a Kaplan-Meier analysis and Cox Proportional Lasso regression show that interest_rate, LTV, gdp, hpi_orig, and FICO_orig are highly effective explanatory variables to determine mortgage default.
Further analysis shows that DeepSurv can achieve far better prediction accuracy than the other models in this study, and using the C-index as the measure of goodness-of-fit for the Cox, AFT, and RSF models, the RSF model achieves the best goodness-of-fit ranking. Among all the 15 covariates, the RSF model picked 5 covariates which can successfully predict mortgage default, and finally, the chosen top 2 covariates are gdp growth rate and the loan to value ratio, and this result is consistent with findings from the literature ( [
The authors declare no conflicts of interest regarding the publication of this paper.
Zhang, D.F., Bhandari, B. and Black, D. (2021) Covariate Selection for Mortgage Default Analysis Using Survival Models. Journal of Mathematical Finance, 11, 218-233. https://doi.org/10.4236/jmf.2021.112012