The logistic regression model has been become commonly used to study the association between a binary response variable; it is widespread application rests on its easy application and interpretation. The subject of assessment of goodness-of-fit in logistic regression model has attracted the attention of many scientists and researchers. Goodness-of-fit tests are methods to determine the suitability of the fitted model. Many of methods proposed and discussed for assessing goodness-of fit in logistic regression model, however, the asymptotic distribution of goodness-of-fit statistics are less examine, it is need more investigated. This work, will focus on assessing the behavior of asymptotic distribution of goodness-of-fit tests, also make comparison between global goodness-of-fit tests, and evaluate it by simulation.
The goal of a logistic regression analysis is to find the best fitting model to describe the relationship between an outcome and covariates where the outcome is dichotomous, [
[
Hosmer and Lemeshow Test C ^ The calculation of this test dependent upon grouping of estimated probabilities π ^ ( x i ) which use g groups. The first group contains the n 1 = n / g observations which have the smallest estimated probabilities, the second group contains n 2 = n / g values have the next smallest estimated probabilities and the last group contains the n g = n / g observation with the largest π ^ ( x i ) : here n is the size of the sample and g the total number of groups. Before defining a formulae to calculate C ^ we will consider some notions. The statistic test C ^ is obtained by calculating Pearson chi-square statistic from the 2 × g table with two rows and g columns of observed and expected frequencies. In the row with y = 1 summing of the all estimated probabilities in a group give the estimated expected value. In the row with y = 0 estimated expected value is obtained by summing one minus the estimated probabilities over all subjects in the group. We can denotes the observed number of subjects have had the event present ( y = 1 ) and absent ( y = 0 ) respectively in each group columns g ( s = 1 , 2 , 3 , ⋯ , g ) :
O 1 s = ∑ i = 1 n s y i , O 0 s = ∑ i = 1 n s ( 1 − y i )
where n s is the number of the observation in group g. The expected number of subjects of present and absent respectively is denoted by:
E 1 s = ∑ i = 1 n s π ^ i , E 0 s = ∑ i = 1 n s ( 1 − π ^ i )
Then C ^ is simply obtained by calculation the Pearson χ 2 statistic for the observed and expected frequencies from the 2 × g table as:
C ^ = ∑ s = 1 g ∑ j = 0 1 ( O j s − E j s ) 2 E j s
from which it following
C ^ = ∑ s = 1 g ( ( O 0 s − E 0 s ) 2 E 0 s + ( O 1 s − E 1 s ) 2 E 1 s )
and finally we get
C ^ = ∑ s = 1 g ( O s − n s π ¯ s ) 2 n s π ¯ s ( 1 − π ¯ s ) ,
where, n s is the total number of values in sth group, O s is the number of responses for the number of covariates in the sth group, defining as
O s = ∑ i = 1 n s y i
where, O s = O 1 s + O 0 s , and π ¯ s is the average of the estimated probabilities which are defined as:
π ¯ s = ∑ i = 1 n s m i π ^ i n s .
Here, the number of observations within covariate pattern i is denoted by m i . Use of an extensive set of simulations proved that when m i = 1 , where m i is the individual binomial denominator and the fitted logistic model is the correct model, then the distribution of C ^ is approximated by the χ 2 distribution with ( g − 2 ) degrees of freedom [
Hosmer and Lemeshow Test H ^
The second grouping strategy was proposed from Hosmer and Lemeshow denoted by H ^ , this method depends upon grouping the estimated probabilities in groups based on fixed cutpoint, so each group contains all subjects with fitted probability located in specific intervals. For example, the cutpoint of the first group is 0.0 ≤ π ^ ( x i ) < 0.1 , then this group contains all subjects with estimated probabilities located in this interval; the second group contains all subjects with estimated probabilities located between cutpoint 0.1 ≤ π ^ ( x i ) < 0.2 and the last group has interval 0.9 ≤ π ^ ( x i ) < 1.0 .
The calculation of H ^ uses exactly the same formulae used to calculate C ^ : the only difference between the two approaches is in the construction of the groups. The distribution of H ^ is approximated by the χ 2 distribution with ( g − 2 ) degrees of freedom.
Although Hosmer and Lemeshow tests are good, it requires grouping, and choice of g is
・ g is arbitrary but almost everywhere in the literature and in software a value of 10, or very similar is chosen.
・ Smaller values of g might be chosen for smaller n.
・ Sparse data causes a problem for H and lead to uneven group widths for C.
Deviance and Pearson Chi-Square Tests
Two of the most commonly used goodness-of-fit measures, are the Pearson’s chi-squared χ 2 and the deviance D goodness-of-fit test statistics but the behaviour of these tests are unstable with bernoulli data; see [
D = − 2 log ( L ^ s L ^ r ) = − 2 ( l s − l r ) ,
where L ^ r , L ^ s are the likelihoods for the full and small model and l r , l s denoted to the log-likelihood: Asymptotically this is χ 2 in p − q df. The residual deviance is the case when the large model is saturated and has n parameters. In case of the logistic regression model [
D = − 2 ∑ i = 1 n { π ^ i log π ^ i + ( 1 − π ^ i ) log ( 1 − π ^ i ) } ,
In this case the deviance is invalid as a goodness-of-fit test, because it is a function of π ^ i , which does not compare the observed values with fitted values.
Also, [
X 2 = ∑ i = 1 n ( y i − π ^ ) 2 π ^ ( 1 − π ^ ) = n
which is equal to the sample size: this is not a useful goodness-of-fit test.
Residual Sum of Squares Test
[
R S S = ∑ i = 1 n ( y i − π ^ i ) 2 .
Of course, the relative weighting for varying mi is not relevant for our case where mi = 1. [
W = diag [ π i ( 1 − π i ) ] , S ( W ) = ∑ i = 1 n [ diag ( π i ( 1 − π i ) ) ] and d is vector with
elements d i = ( 1 − 2 π i ) . Used the standardized statistic to assess significance by referring the following to the standard normal
[ R S S − S ( W ) ] var [ R S S − S ( W ) ] .
R 2 Test
Several R 2 type statistics have been used for goodness-of-fit in logistic regression, such as that proposed by [
R g 2 = 1 − ( L ^ c L ^ 0 ) n / 2
where, L ^ c represents the log-likelihood evaluated at the ML estimation parameters and L ^ 0 represents the log-likelihood of the model containing only an intercept. Another version due to [
R ¯ g 2 = R g 2 max ( R g 2 )
where, max ( R g 2 ) = 1 − ( L ^ 0 ) 2 / n .
Information Matrix tests: IMT and IMTDIAG
The Information Matrix test (IMT) is a test for general mis-specification, proposed by [
E ( − ∂ 2 l ∂ θ ∂ θ T ) and E ( ∂ l ∂ θ ∂ l ∂ θ T ) , as these differ when the model is mis-specified
but not when the model is correct.
Let, consider binary regression, where the outcome for individual i, i = 1, ・・・, n is a random variable Y i ∈ { 0,1 } . Also Pr ( Y i | x i ) = π i = f ( x i T β ) where x i is a p × 1 dimensional vector of covariates and β is a p-dimensional vector of parameters. It will be convenient to write a i = x i T β and l i to be the contribution to the log-likelihood l from unit i.
We have
l ( β ) = ∑ i = 1 n l i ( β ) = ∑ i = 1 n Y i log π i + ( 1 − Y i ) log ( 1 − π i )
The p-dimensional likelihood equations ∂ l / ∂ β = 0 can be written:
∂ l ∂ β = ∑ i = 1 n [ ( Y i − π i ) π i ( 1 − π i ) ] ∂ π i ∂ a i x i = 0 (1)
We can also derive the p × p matrix ∂ 2 l / ∂ β ∂ β T as:
∑ i = 1 n [ ( Y 1 − π i ) π i ( 1 − π i ) ∂ 2 π i ∂ a i 2 − ( Y 1 − π i ) 2 π i 2 ( 1 − π i ) 2 ( ∂ π i ∂ a i ) 2 ] x i x i T (2)
The idea behind the information matrix test is that if the model is correctly specified then the quantity:
I M = ∑ i = 1 n ( ∂ l i ∂ β ∂ l i ∂ β T | β ^ + ∂ 2 l i ∂ β ∂ β T | β ^ )
has zero mean. By comparing (1) and (2) we can compute this quantity, for a general value of β , as the sum of:
∂ l i ∂ β ∂ l i ∂ β T + ∂ 2 l i ∂ β ∂ β T = ( Y i − π i ) π i ( 1 − π i ) ∂ 2 π i ∂ a i 2 x i x i T (3)
We can test the null hypothesis that IM has zero mean by computing the variance of IM and then constructing a standard χ 2 statistic. The first step is to
compute the variance of n − 1 2 ∑ d i where we write d i for essentially the right
hand side of (3):
( Y i − π i ) π i ( 1 − π i ) ∂ 2 π i ∂ a i 2 z i
where we have changed the p × p symmetric matrix into a vector z i in order to be able to use standard methods. As x i x i T is symmetric we do not wish to
duplicate entries, so z i is the 1 2 p ( p + 1 ) -dimensional vector:
z i T = ( [ x 11 , x 21 , ⋯ , x p 1 ] , [ x 22 , x 32 , ⋯ , x p 2 ] , ⋯ , [ x ( p − 1 ) , ( p − 1 ) , x p , ( p − 1 ) ] , [ x p p ] )
where x s t is the ( s , t ) t h element of x i x i T . If we write:
A = n − 1 2 ∑ d i = n − 1 2 ∑ i = 1 n ( Y i − π i ) π i ( 1 − π i ) ∂ 2 π i ∂ a i 2 z i = n − 1 2 ∑ i = 1 n ( Y i − π i ) ( 1 − 2 π ) z i
then because the different terms are independent we obtain:
Ψ = var ( A ) = 1 n ∑ i = 1 n 1 π i ( 1 − π i ) ( ∂ 2 π i ∂ a i 2 ) 2 z i z i T .
which is a q × q dimensional matrix where q = 1 2 p ( p + 1 ) .
We should also note that if B is defined as essentially the log-likelihood, i.e.
B = n − 1 2 ∑ i = 1 n ( Y i − π i ) π i ( 1 − π i ) ∂ π i ∂ a i x i = n − 1 2 ∑ i = 1 n ( Y i − π i ) x i
then the variance of B is the p × p matrix Ω :
Ω = 1 n ∑ i = 1 n 1 π i ( 1 − π i ) ( ∂ π i ∂ a i ) 2 x i x i T
Before compute the covariance of A and B, we get, using
( y i − π i ) 2 π i ( 1 − π i ) = ( y i − π i ) ( 1 − 2 π i )
Now,
cov ( A , B ) = E ( A B ) − E ( A ) E ( B )
For independently and identically random variables and under the H 0 the second term of the cov ( A , B ) is zero, and covariance of A and B in this case is the q × p matrix, and so
Δ = cov ( A , B ) = 1 n ∑ i = 1 n 1 π i ( 1 − π i ) ( ∂ π i ∂ a i ) ( ∂ 2 π i ∂ a i 2 ) z i x i T
Central limit arguments suggest that asymptotically ( A T , B T ) is a q + p dimensional normal variable. However, the IM-test requires A to be evaluated at β ^ , A ^ , say, and at this value we know that B = 0. Consequently the variance of A ^ is the variance of A conditional on B = 0 which is Ψ − Δ Ω − 1 Δ T .
Assuming a logistic regression we have ∂ π i / ∂ a i = π i ( 1 − π i ) and ∂ 2 π i / ∂ a i 2 = π i ( 1 − π i ) ( 1 − 2 π i ) so we can evaluate the dispersion matrices at the MLEs as:
Ω ^ = 1 n ∑ i = 1 n π ^ i ( 1 − π ^ i ) x i x i T
Ψ ^ = 1 n ∑ i = 1 n π ^ i ( 1 − π ^ i ) ( 1 − 2 π ^ i ) 2 z i z i T
Δ ^ = 1 n ∑ i = 1 n π ^ i ( 1 − π ^ i ) ( 1 − 2 π ^ i ) z i x i T
If we write V ^ = Ψ ^ − Δ ^ Ω ^ − 1 Δ ^ T then one version of the IM test is found by referring A ^ T V ^ − 1 A ^ to a χ 2 variable with degrees of freedom equal to the rank of V ^ .
The idea of the IMDIAG test and IM test are the same, the only difference is that for the former the elements of z i are just the diagonal elements of x i x i T , so z i is the p dimensional vector:
z i T = ( x i 1 2 , x i 2 2 , ⋯ , x i p 2 ) .
To explain the difference in size of vector z i in the two cases of IM test and IMDIAG test, let us consider a simple example. Suppose we have a symmetric matrix with elements x i x i T and 3 × 3 dimension as:
[ x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 ] ,
where, x r s = x r i x s i . Then in the case of the IM test, the dimension of vector z i T is 1 × 6 and elements are:
z i T = [ x 11 , x 12 , x 13 , x 22 , x 23 , x 33 ] ,
whereas in the case of IMDIAG test, z i is the 1 × 3 dimensional vector:
z i T = [ x 11 , x 22 , x 33 ] .
Our work, focus on behaviour of goodness of fit tests under alternative hypotheses in case of missing covariate model and in case of the wrong model, because these cases we could not reproduce Kuss’s work in. We will focus on four goodness-of-fit tests ( C ^ g , R S S , I M , I M D I A G ) . Therefore, we examine in more depth the behaviour of the tests and determine more information about asymptotic MLE distribution in case of the wrong model
π i = expit ( 0.405 x i 2 ) ,
or in the case of the missing covariate,
π i = expit ( 0.405 x i + 0.223 u i ) ,
where X , U ~ U ( − 6,6 ) , X and U independent.
Simulation study designed as Kuss’s work:
・ The sample sizes are n = 100 and n = 500.
・ Applied only on extreme sparseness when m i = 1 .
・ number of simulation is 1000.
・ distribution of the predictor variables X, U is U ( − 6,6 ) , X and U independent chosen to confirm with Kuss’s work.
・ Use four of goodness-of-fit tests from the simulation study under three different alternative hypotheses:
(a) True covariate.
(b) Missing covariate.
(c) Wrong functional form of the covariate.
・ Fitted model in all cases is a standard logistic model with an intercept and one covariate.
・ All the tests on the null hypothesis under α = 0.05 .
Results and Discussion of Tests under Correct ModelIn
π i = expit ( 0.693 x i ) .
Statistics used in the simulation as goodness-of fit tests are: Hosmer- Lemeshow ( C ^ g ) , Information matrix ( I M ) , Information matrix Diagonal ( I M D I A G ) and residual sum of squares (RSS). The asymptotic distribution of statistics is χ d f 2 distribution, where the mean and variance equal df and 2df respectively. In case of ( C ^ g ) statistic we chosen the number of group is g = 10 so, degree of freedom is d f = g − 2 . The results shown in
In the second case, the results reported the mean, variance and the power to detect a mis-specified model for same goodness-of-fit tests under missing covariate model, when the model is:
logit ( π i ) = expit ( 0.405 x i + 0.223 u i ) ,
and fit standard logistic regression model with x i .
The final case we will show the results of power to detect a mis-specified model for four goodness-of-fit tests under the wrong functional form of the covariate model
logit ( π i ) = expit ( 0.405 x i 2 )
and fit the model as previous cases.
n = 100 | n = 500 | ||||||
---|---|---|---|---|---|---|---|
- | df | Mean | Var | %Rej | Mean | Var | %Rej |
8 | 8.06 | 20.47 | 4.6 | 7.96 | 17.12 | 5.70 | |
3 | 3.06 | 7.23 | 5.10 | 3.00 | 6.33 | 4.70 | |
2 | 2.04 | 3.97 | 5.50 | 1.94 | 3.63 | 4.20 | |
1 | 0.98 | 1.81 | 4.60 | 0.99 | 1.83 | 4.10 |
n = 100 | n = 500 | ||||||
---|---|---|---|---|---|---|---|
- | df | Mean | Var | %Rej | Mean | Var | %Rej |
8 | 7.44 | 11.13 | 1.50 | 7.35 | 12.62 | 3.20 | |
3 | 3.01 | 6.05 | 5.50 | 2.38 | 4.15 | 1.90 | |
2 | 1.82 | 3.06 | 3.3 | 2.05 | 3.46 | 4.80 | |
1 | 0.92 | 1.51 | 4.10 | 0.99 | 1.73 | 4.50 |
In
In
n = 100 | n = 500 | ||||||
---|---|---|---|---|---|---|---|
- | df | Mean | Var | %Rej | Mean | Var | %Rej |
8 | 31.50 | 75.73 | 98.8 | 133.73 | 382.62 | 100 | |
3 | 17.33 | 17.97 | 100 | 75.57 | 72.70 | 100 | |
2 | 16.85 | 16.64 | 100 | 76.28 | 71.82 | 100 | |
1 | 17.07 | 17.16 | 100 | 76.17 | 163.84 | 99.5 |
The work considered in this paper was centered on the asymptotic distribution of goodness-of-fit tests in logistic regression model. We also consider the comparison between some global goodness-of-fit tests, which compared with Kuss’s results. Application of simulation apply in two types of goodness-of-fit tests, those based a test which groups the observation and those which do not group observation. Our results of study confirm the work of Kuss’s regarding
the power of goodness-of-fit tests, which related the Rss , Hosmer-Lemeshow, IM and IMDIAG tests under correct and missing model. However, our results about the asymptomatic distribution of goodness-of-fit tests show, various combinations of behavior on the mean and variance of statistics, which, the asymptotic distribution of statistics is Chi-square χ d f 2 . The results under correct model show reasonable power for all methods, slightly larger variance found in case of Hosmer-Lemeshow test, and smaller variance under missing covariate model. As we know the goodness-of-fit statistics are distributed asymptotically as central χ 2 distribution under H0 when the model is correctly specified, and is non-central χ 2 under H1 when the model mis-specificed. However, under wrong model the results show strange behavior, which all the means and variances are not satisfy the assumption on asymptotic distribution χ d f 2 with men df and variance 2df, also, it is appeared with high power. The problem means that in some circumstances properties of the distribution of the statistics of tests (e.g mean and variance) are far away from the properties of χ 2 distribution. In fact, the interesting point here, some of goodness-of-fit tests seem affected by assumption on covariance matrix. So, many issues about the mean and variance of the asymptotic distribution of goodness-of-fit statistic should also be examined.
Badi, N.H.S. (2017) Asymptomatic Distribution of Goodness-of- Fit Tests in Logistic Regression Model. Open Journal of Statistics, 7, 434-445. https://doi.org/10.4236/ojs.2017.73031