Properties of the Maximum Likelihood Estimates and Bias Reduction for Logistic Regression Model

Abstract

A frequent problem in estimating logistic regression models is a failure of the likelihood maximization algorithm to converge. Although popular and extremely well established in bias correction for maximum likelihood estimates of the parameters for logistic regression, the behaviour and properties of the maximum likelihood method are less investigated. The main aim of this paper is to examine the behaviour and properties of the parameters estimates methods with reduction technique. We will focus on a method used a modified score function to reduce the bias of the maximum likelihood estimates. We also present new and interesting examples by simulation data with different cases of sample size and percentage of the probability of outcome variable.

Share and Cite:

Badi, N. (2017) Properties of the Maximum Likelihood Estimates and Bias Reduction for Logistic Regression Model. Open Access Library Journal, 4, 1-12. doi: 10.4236/oalib.1103625.

1. Introduction

The logistic regression methods are often used to interpret the statistical analysis of dichotomous outcome variables. It is commonly applied procedure for describing the relationship between a binary outcome variable. The general method of estimating the logistic regression parameter is maximum likelihood (ML). In a very general sense the ML method yields values for the unknown parameters that maximize the probability of the observed set of data. The commonly problem with using ML method is convergence problem, which occurs when the maximum likelihood estimates (MLE) do not exist. The subject of the assessment behaviour of MLE for logistic regression model is important, as the logistic model is widely used in medical statistics. Much work discusses on logistic regression model address converges problem like [1] or the bias reduction like [2] [3] . Many assumptions and more details considered about the distribution of the coefficients estimated by MLE approach and bias reduction technique, and also for more application and effects of the sample size, see [4] [5] . However, the behavior and properties of bias correction methods are less investigated. A recent paper takes the bias correction technique proposed by [2] to achieve the MLE existing. In the present paper, it centers to evaluate the behavior and properties of the bias reduction method by simulated data with different sample sizes and parameters. The next section, explains the shape and fits the logistic regression model. Section 3 discusses clearly the ML convergence problem. Application on modified score function in logistic regression model will discuss in Section 4 and it illustrates special case of modified function to give two equations that are used to estimate the parameters. Section 5 investi- gates the asymptotic properties for logistic regression model with making compression between estimated parameters with ML method and reduction technique by simulated data. The discussion, conclusion and some general remarks about the results are in Section 6.

2. The Logistic Regression Model

The goal of a logistic regression analysis is to find the best fitting model to describe the relationship between an outcome and covariates where the outcome is dichotomous. [6] considered the logistic regression model is a member of the class of the generalized linear models. For more details of logistic model see [7] [8] [9] also [10] [11] [12] .

The Model

Suppose now y i ~ binomial ( m i , π i ) where y i , i = 1 , , n is a response variable. Suppose that y i { 0,1, , m i } and π i are related to a collection of covariates ( x i 1 , x i 2 , , x i p ) according to the equation

log ( π i 1 π i ) = j = 1 p β j x i j = β T x i (1)

We consider the special case m i = 1 so y i ~ binomial ( 1, π i ) where π i is the probability of success for each i = 1 , , n . We also define η i = β T x i so that

g ( π i ) = log ( π i 1 π i ) = η i (2)

and

π i = exp ( η i ) 1 + exp ( η i ) (3)

Here g ( . ) is called the logit link function and η i = j = 1 p β j x i j = β T x i is the linear predictor.

There are some other link functions which can also be used, instead of the logit link function such as the probit link function

η = Φ 1 ( π ) π = Φ ( η )

and the complementary log-log link function

η = log ( log ( 1 π ) ) π = 1 exp ( e η )

Fitting The Model

The logistic model when y i ~ binomial ( m i , π i ) with m i = 1 can be fitted using the method of maximum likelihood to estimate the parameters. The first step is to construct the likelihood function which is a function of the unknown parameters. we choose those values of the parameters that maximize this function. The probability function of the model is

f ( y i ) = π i y i ( 1 π i ) 1 y i (4)

where the likelihood function is

L ( π i | y i ) = i = 1 n f ( y i | π i ) (5)

Since the observations are independent, the likelihood function is as follows:

L ( π i | y i ) = i = 1 n π i y i ( 1 π i ) 1 y i (6)

The maximum likelihood estimate of β is the value which maximizes the likelihood function. In general the log likelihood function is easier to work with mathematically and is:

l = i = 1 n [ y i log ( π i ) + ( 1 y i ) log ( 1 π i ) ] (7)

2.1. Special Case of the Logistic Model with Two Covariates

In this case the logistic regression model with two covariates, thus, p = 2 , with one the general mean. So, we have β 0 and β 1 , such that

g ( π i ) = η i = β 0 + β 1 x i (8)

where x i is now a scalar covariate and

π i = exp ( β 0 + β 1 x i ) 1 + exp ( β 0 + β 1 x i ) (9)

Therefore we can write the log-likelihood function as:

l ( β 0 , β 1 ) = i = 1 n y i ( β 0 + β 1 x i ) log [ 1 + exp ( β 0 + β 1 x i ) ] (10)

To estimate the values of β 0 and β 1 we differentiate l ( β 0 , β 1 ) in terms of β 0 and β 1 respectively as:

l β 0 = i = 1 n y i exp ( β 0 + β 1 x i ) 1 + exp ( β 0 + β 1 x i ) = i = 1 n ( y i π i ) (11)

l β 1 = i = 1 n y i x i x i ( exp ( β 0 + β 1 x i ) ) 1 + exp ( β 0 + β 1 x i ) = i = 1 n ( y i π i ) x i (12)

Now we set l β 0 = 0 and l β 1 = 0 and so the maximum likelihood estimates

of β 0 and β 1 are the solution of the following equations

i = 1 n y i = i = 1 n π i (13)

and

i = 1 n y i x i = i = 1 n π i x i (14)

and will be denoted as β ^ 0 and β ^ 1 . We know that for the logistic regression the last two equations are non linear in β 0 and β 1 , and we need to use a numerical method for their solution, such as Newton-Raphson method.

2.2. The Asymptotic Distribution of the (MLE)

The estimated parameters β ^ = ( β ^ 0 , β ^ 1 ) , have an asymptotic distribution which is given by β ^ ~ N ( β , I ( β ) 1 ) where I ( β ) is Fisher’s information matrix defined as

I ( β ) = E ( l 2 β 0 2 l 2 β 0 β 1 l 2 β 1 β 0 l 2 β 1 2 ) (15)

where the matrix is evaluated at the MLE. For the logistic regression the estimated Fisher Information matrix can be writen as

I ( β ^ ) = ( i = 1 n π ^ i ( 1 π ^ i ) i = 1 n x i π ^ i ( 1 π ^ i ) i = 1 n x i π ^ i ( 1 π ^ i ) i = 1 n x i 2 π ^ i ( 1 π ^ i ) ) (16)

where π ^ i = exp ( η ^ ) 1 + exp ( η ^ ) and η ^ = β ^ 0 + β ^ 1 x i . The variance of β ^ is approximated

defined by Var ( β ^ ) = I ( β ^ ) 1 .

3. Maximum Likelihood Convergence Problems

A problem occurs in estimating logistic regression models when the maximum likelihood estimates do not exist and one or more components of β ^ are infinite. The one case of the occurrence of this problem is when all of the observations have the same response. For example, suppose that m i = 1 and that all of the response variables equal zero i.e., i = 1 n y i = 0 . In this case the log-likelihood function is

l ( β 0 , β 1 ) = i = 1 n log [ 1 + exp ( β 0 + β 1 x i ) ] (17)

Now differentiating l ( β 0 , β 1 ) in terms of β 0 and β 1 respectively and setting equal to zero gives

i = 1 n π i = i = 1 n exp ( β ^ 0 + β ^ 1 x i ) 1 + exp ( β ^ 0 + β ^ 1 x i ) = 0 (18)

and

i = 1 n π i x i = i = 1 n x i exp ( β ^ 0 + β ^ 1 x i ) 1 + exp ( β ^ 0 + β ^ 1 x i ) = 0 (19)

The first equation has no solution because it is the sum of positive quantities and so cannot be equal to zero and satisfy the equation. To make this equation equal to zero we need to make β 0 larger and negative i.e. tend to . However, if precisely one of the response variable equal 1, the result maximum likelihood equation become

i = 1 n π i = 1 (20)

i = 1 n π i x i = x 1 (21)

where we have assumed the numbers such that y 1 = 1 . Here the maximum likelihood estimates is exist and the convergence of the MLE is achieved. Because the two previous equations are sum of positive quantities equal positive values. So as in first equation, if parameter is large and positive, then the sum is larger than one as well as if it is large and negative, then the sum is smaller than one and will not satisfy the equation, then we can find finite estimate of parameters which satisfy the equation.

4. Modified Score Function

Firth [2] proposed a method to reduce bias in MLE. The maximum likelihood convergence problem does not exist with the modified score function. The idea that extend and focus on two standard approaches have been extensively studied in the literature. The computationally-intensive jackknife method proposed by [13] [14] . The second approach simply substitutes β ^ for the unknown β in

b ( β ) n . The point that discussed in case of small size sets of data, it is not

uncommon that β ^ is infinite in some samples of logistic regression models [15] [16] . We know that the maximum likelihood approach is dependent on the derivative the log-likelihood function as a solution to the score equation

l ( β ) β = U ( β ) = 0 (22)

[2] proposed that instead, we solve U * ( β ) = 0 , where the appropriate modification to U ( β ) is:

U * ( β ) = U ( β ) I ( β ) b ( β ) (23)

and the expected value of β ^ proposed by [3] , is given by:

E ( β ^ ) = β + b ( β ^ ) + O ( n 1 ) (24)

where

b ( β ) = 2 E ( l β l 2 β 2 ) + E ( l 3 β 3 ) 2 { E ( l 2 β 2 ) } 2

The variance of β ^ is approximated defined by Var ( β ^ ) I ( β ^ ) 1 .

4.1. Modified Function with Logistic Regression Model

In this part we will apply the modified score function to simple logistic regression model. We know that the O ( n 1 ) bias vector given in the form

b = ( X T W X ) 1 X T W ξ which proposed by [17] . Here W ξ has ith element

h i ( π i 1 2 ) and h i is the ith diagonal element of the hat matrix

H = W 1 / 2 X ( X T W X ) 1 X T W 1 / 2 .

where W = d i a g ( π i ( 1 π i ) ) and X is the design matrix. Then, the modified score function is written as

U * = U X T W ξ (25)

In this case, the modified score function U * = ( U 0 * , U 1 * ) gives two equations

U 0 * = i = 1 n [ ( y i + h i 2 ) ( 1 + h i ) π i ] = 0 (26)

and

U 1 * = i = 1 n [ ( y i + h i 2 ) ( 1 + h i ) π i ] x i = 0 (27)

These are used to estimate the parameters.

4.2. Special Case of Modified Function

For more evaluation, we will discuss the behaviour of the adjusted score function when all the observation have the same response i.e. i = 1 n y i = 0 . As a special case, suppose we have one explanatory variable x i taking values 0 or 1. Before we calculate the adjusted score function, first calculate the form of h i which we obtain from H . Here, h i is the diagonal element of the H matrix and is

h i = π i ( 1 π i ) ( X 2 2 x i X 1 + x i 2 X 0 ) Δ (28)

where Δ = X 0 X 2 X 1 2 , X 0 = n 0 π 0 ( 1 π 0 ) + n 1 π 1 ( 1 π 1 ) , X 1 = n 1 π 1 ( 1 π 1 ) and X 2 = n 1 π 1 ( 1 π 1 ) , where n 0 and n 1 are the number of observations of x equal to 0 and 1 respectively. Hence

h 0 = π 0 ( 1 π 0 ) [ n 1 π 1 ( 1 π 1 ) ] Δ (29)

and

h 1 = π 1 ( 1 π 1 ) [ n 0 π 0 ( 1 π 0 ) ] Δ (30)

Therefore, when we set the adjusted score function ( U 0 * , U 1 * ) = 0 with

i = 1 n y i = 0 we have

U 1 * = i = 1 n [ h i 2 ( 1 + h i ) π i ] x i = 0 (31)

This gives

[ h 1 2 ( 1 + h 1 ) π 1 ] = 0 (32)

and

π 1 = h 1 2 ( 1 + h 1 ) (33)

Now,

U 0 * = i = 1 n [ h i 2 ( 1 + h i ) π i ] = 0 (34)

and so

[ h 1 n 1 2 n 1 ( 1 + h 1 ) π 1 ] + [ h 0 n 0 2 n 0 ( 1 + h 0 ) π 0 ] = 0 (35)

we get

π 0 = h 0 2 ( 1 + h 0 ) (36)

Before calculate π 0 and π 1 we can consider the following way to calculate h 0 and h 1 . Let A = n 1 π 1 ( 1 π 1 ) and B = n 0 π 0 ( 1 π 0 ) . Then, X 0 = A + B , X 1 = A and X 2 = A , so, we can write Δ as

Δ = X 0 X 2 X 1 2 = A 2 + A B A 2 = A B = n 1 π 1 ( 1 π 1 ) n 0 π 0 ( 1 π 0 ) (37)

Therefore, h 0 and h 1 can be written as

h 0 = π 0 ( 1 π 0 ) [ n 1 π 1 ( 1 π 1 ) ] [ n 1 π 1 ( 1 π 1 ) ] n 0 π 0 ( 1 π 0 ) = 1 n 0 (38)

and

h 1 = π 1 ( 1 π 1 ) [ n 0 π 0 ( 1 π 0 ) ] [ n 1 π 1 ( 1 π 1 ) ] n 0 π 0 ( 1 π 0 ) = 1 n 1 (39)

Then, we obtain

π 0 = h 0 2 ( 1 + h 0 ) = 1 / n 0 2 ( 1 + 1 / n 0 ) = 1 2 ( n 0 + 1 ) (40)

and

π 1 = h 1 2 ( 1 + h 1 ) = 1 / n 1 2 ( 1 + 1 / n 1 ) = 1 2 ( n 1 + 1 ) (41)

As a result of this example with x = 0 , 1 when i = 1 n y = 0 , we can say that, the estimate of parameters are finite. The modified function works well and the problem of convergence does not exist.

5. Simulation Study

The follow discussion are the simulation plan and the designs used in generating the data to identify the effect of sample size and proportion of events (the percentage of y = 1 or y = 0 ) on estimation of parameters. We will examine the precision of the estimation by calculating the variance of parameters obtained by simulation for the two approaches, MLE and Firth, and compare those with I ( β ) 1 evaluated at the known values of β . The simulation study is designed as follows:

1) Thre sample sizes have been used n = 40 , n = 120 and n = 500 .

2) For each sample size we choose x i as a draw from N ( 0,1 ) . The x variables are fixed at these values throughout the simulation.

3) We choose β 0 and β 1 to give three cases. Choose β 1 = 0.2 and adjust β 0 so that over the covariates pr ( y = 1 ) is approximately (a) 0.5, (b) 0.1, (c) 0.05.

4) For each sample size and set of parameter values we perform 100,000 simulation.

5) Two approaches are used to estimate the parameters, MLE and the bias- reduced estimator Firth.

5.1. Results and Discussion of Sample Size n = 500

The simulation reported the accuracy of the estimation of Var ( β ^ ) using the information matrix. We calculate Var ( β ^ 0 ) and Var ( β ^ 1 ) for the simulated values of β ^ 0 , β ^ 1 and also by evaluating I ( β ) at the known values of β . The results in the Table 1, which shows the three cases of the proportion of y = 1 , achieved the convergence of likelihood maximization alogrithm.

As can be seen in Table 1, Var β ^ L Sim and Var β ^ F Sim are the variance of the parameters estimated by MLE and Firth method respectively. Ratio L and Ratio F denote the ratio of the variance estimated by MLE and Firth’s method, respectively. The results showed that, both the variance of the parameters calculated from the simulation and the variance calculated by evaluating the information matrix at the known values of β are almost the same. We note that the ratio in the first case when pr ( y = 1 ) is 0.5 appeared nearly close to one but in the second case and the last case the ratio appeared slightly larger than in the first case.

The variance of parameters calculated by Firth’s method were smaller than when calculated by MLE and the ratio in general was close to 1. Moreover, the bias ( β ^ F - β ) was smaller.

5.2. Results and Discussion of Sample Size n = 120

In this part using the same way used in the previous case when n = 500 . The

(a) The variance of the parameters estimated by MLE and Firth with (0.5, 0.1, 0.05) propotion of y = 1.
(b)The bias value with (0.5, 0.1, 0.05) propotion of y = 1.

Table 1. Results of 100,000 simulations with sample size n = 500 and (0.5, 0.1, 0.05) propotion of y = 1.

results of simulation are shown in Table 2. Maximum likelihood convergence problems occurred (when pr ( y = 1 ) = 0.05 ). Note that, there are many situations in which the likelihood function has no maximum, in which case we say that the maximum likelihood estimate does not exist. Consider the simulation which generating the data set 100,000 times, in some cases the coefficients reach to infinite in the final iterations and so, we have not results of the estimation β ^ 0 and β ^ 1 , that result in at which point the algorithm has not converged. In our simulation we consider the cases that not achieved the converges algorithm.

Here for only 99,806 (99%) of the data sets was it possible to obtain finite estimates of β 0 and β 1 converged. Moreover, the variance of the parameters β ^ 0 and β ^ 1 is large. This is because even though convergence is achieved when i = 1 n y i = 1 , There are some very large negative values of β ^ . In the other two cases of pr ( y = 1 ) = ( 0.5 , 0.1 ) we achieved ML convergence in every simulation. We note that the ratio is nearly one but is a bit high when compared with case of n = 500 . Firth’s approach showed reasonable results, all cases achieved the maximum likelihood convergence. Moreover, the ratio was better than MLE approach as well as the bias β ^ F - β .

5.3. Results and Discussion of Sample Size n = 40

We used the same analysis as in the previous cases with n = 40 . As can be seen in Table 3, the results showed that, MLE approach had convergence problems,

(a) The variance of the parameters estimated by MLE and Firth with (0.5, 0.1, 0.05) propotion of y = 1.
(b) The bias value with (0.5, 0.1, 0.05) propotion of y = 1.

Table 2. Results of 100,000 simulations with sample size n = 120 and (0.5, 0.1, 0.05) propotion of y = 1.

(a) The variance of the parameters estimated by MLE and Firth with (0.5, 0.1, 0.05) propotion of y = 1.
(b) The bias value with (0.5, 0.1, 0.05) propotion of y = 1.

Table 3. Results of 100,000 simulations with sample size n = 40 and (0.5, 0.1, 0.05) propotion of y = 1.

98,273 (98%) and 85,967 (86%) of data sets achieved ML convergence when pr ( y = 1 ) was (0.1, 0.05), respectively. Convergence was only achieved in every simulation in the case of pr ( y = 1 ) = 0.5 , where the ratio was nearly close to one, but is a bit high from previous cases. Moreover, we found the same problem as discussed in the case of n = 120 , in that the variance of the parameters β ^ 0 and β ^ 1 is large. However, when we use Firth’s approach, all data sets achieved M.L convergence. Moreover, the ratio was better than M.L.E approach as well as the bias β ^ F - β being smaller.

6. Conclusion

Attention has been directed in this work to determine the behaviour of the asymptotic estimation of parameters by two methods―MLE and bias reduction technique compared with the result of the information matrix. In fact in regular convergence problem the modified score function appeared appropriate behaviour, which denoted that the bias form may be removed from the MLE by reduction bias term. The asymptotic variance of the MLE may be appeared as strange behaviour, and the results shown variance of the parameters were large in some cases, even though convergence is achieved. It is denoted that there are some very large negative values of β ^ , as shown in results section. We can report that the small sample size and the value of pr ( y = 1 ) have an effect on behaviour estimation of parameters when using MLE. Clearly, we found conver- gence problem for some combinations of sample size and pr ( y = 1 ) . The approach of Firth appeared a moderate results that the data sets in all cases of sample size and pr ( y = 1 ) achieved ML convergence. Overall, we can consider the bias reduction technique is worked well and has a moderate behaviour almost with all cases which have been investigated. Moreover, the convergence problem is not only effective on behaviour of the MLE, and although the convergence is achieved, the variance of the parameters estimates appeared large value.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Cox, D.R. and Hinkley, D.V. (1974) Theoretical Statistics. Chapman and Hall, London.
[2] Firth, D. (1993) Bias Reduction of Maximum Likelihood Estimates. Biometrika, 80, 27-38.
https://doi.org/10.1093/biomet/80.1.27
[3] Anderson, J.A. and Richardson, C. (1979) Logistic Discrimination and Bias Correction in Maximum Likelihood Estimation. Technometrics, 21, 71-78.
[4] McCullagh, P. (1986) The Conditional Distribution of Goodness-of-Fit Statistics for Discrete Data. Journal of the American Statistical Association, 81, 104-107.
https://doi.org/10.1080/01621459.1986.10478244
[5] Shenton, L.R. and Bowman, K.O. (1977) Maximum Likelihood Estimation in Small Samples. Griffin’s Statistical Monograph No. 38, London.
[6] Nelder, J.A. and Wedderburn, R.W.M. (1972) Generalized Linear Models. Journal of the Royal Statistical Society, Series A, 135, 370-384.
https://doi.org/10.2307/2344614
[7] Dobson, A. (1990) An Introduction to Generalized Linear Models. Chapman and Hall, London.
[8] Dobson, A.J. and Barnett, A.G. (2008) An Introduction to Generalized Linear Models. 3rd Edition, Chapman and Hall, New York.
[9] Kleinbaum, D.G. (1994) Logistic Regression: A Self-Learning Text. Springer-Verlag, New York.
https://doi.org/10.1007/978-1-4757-4108-7
[10] Hilbe, J.M. (2009) Logistic Regression Model. Chapman and Hall, New York.
[11] Hosmer, D.W. and Lemeshow, S. (2000) Applied Logistic Regression. Wiley, Chichester.
https://doi.org/10.1002/0471722146
[12] Hosmer, D., Lemeshow, S. and Sturdivant, R.X. (2013) Applied Logistic Regression. 3rd Edition, Wiley, Chichester.
https://doi.org/10.1002/9781118548387
[13] Quenouille, M.H. (1949) Approximate Tests of Correlation in Time-Series. Journal of the Royal Statistical Society: Series B, 11, 68-84.
[14] Quenouille, M.H. (1956) Notes on Bias in Estimation. Biometrika, 43, 353-360.
https://doi.org/10.1093/biomet/43.3-4.353
[15] Albert, A. and Anderson, J.A. (1984) On the Existence of Maximum Likelihood Estimates in Logistic Regression Models. Biometrika, 71, 1-10.
https://doi.org/10.1093/biomet/71.1.1
[16] Clogg, C.C., Rubin, D.B., Schenker, N., Schultz, B. and Weidman, L. (1991) Multiple Imputation of Industry and Occupation Codes in Census Public-Use Samples Using Bayesian Logistic Regression. Journal of the American Statistical Association, 86, 68-78.
https://doi.org/10.1080/01621459.1991.10475005
[17] McCullagh, P. and Nelder, J.A. (1989) Linear Models. 2nd Edition, Chapman and Hall, London.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.