Asymptomatic Distribution of Goodness-of-Fit Tests in Logistic Regression Model ()
1. Introduction
The goal of a logistic regression analysis is to find the best fitting model to describe the relationship between an outcome and covariates where the outcome is dichotomous, [1] considered the logistic regression model is a member of the class of the generalized linear models. Many assumptions and more details considered about the behavior of logistic model see [2] [3] , also for more application see [4] [5] [6] [7] . The goodness-of-fit is very important to decide if the more succinct model is adequate. After fitting the logistic regression model, the next step is to examine the proposed model how well fits the observation data and to know how effective the model is; this is called as its goodness-of-fit. Goodness-of-fit tests for the logistic regression can be split into three types: 1) Those based an examination of residuals; 2) Those based a test which groups the observation; 3) Those which do not group observation. Methods in 1) are more general and subjective assessments of a model and are not considered in this work. This is not to undervalue then they are often the most valuable approach to model assessment. The observed values for Bernoulli regression are just 0 s and 1 s and this makes graphical approaches less easy to handle. The focus of this work is the test statistics. In next section, tests using grouping are considered, with those that do not need to group the data being discussed in section 3. Investigate the behavior of the asymptotic distribution of goodness-of-fit tests is considered in section 4 with comparisons between some goodness-of-fit tests, evaluated by simulation data with two different sample sizes. The simulation in this work was designed according to simulation that made by [8] , which made comparisons between some goodness-of-fit tests in logistic regression models with sparse data. The results of his simulation showed that some goodness-of-fit tests have reasonable power compared with other tests. However, Kuss did not give information about the asymptotic distribution of these statistics. This paper supposes to show the behavior of the asymptotic distribution of goodness-of-fit tests for logistic regression model. Finally, conclusion and further discussion made in the last section.
2. Goodness-of-Fit Tests with Grouping
[9] proposed and developed approaches involving grouping based on the values of the estimated probabilities obtained from the fitted logistic model. Two grouping methods were proposed. The first approach is based on grouping the data according to percentiles of the estimated probabilities, and the second approach is based on grouping the data according to fixed cutoff values of the estimated probabilities. Tests with grouping based on estimated probabilities were proposed and developed by [9] [10] [11] . [12] developed a score test statistic which essentially compares two fitted model.
Hosmer and Lemeshow Test
The calculation of this test dependent upon grouping of estimated probabilities
which use g groups. The first group contains the
observations which have the smallest estimated probabilities, the second group contains
values have the next smallest estimated probabilities and the last group contains the
observation with the largest
: here n is the size of the sample and g the total number of groups. Before defining a formulae to calculate
we will consider some notions. The statistic test
is obtained by calculating Pearson chi-square statistic from the
table with two rows and g columns of observed and expected frequencies. In the row with y = 1 summing of the all estimated probabilities in a group give the estimated expected value. In the row with y = 0 estimated expected value is obtained by summing one minus the estimated probabilities over all subjects in the group. We can denotes the observed number of subjects have had the event present
and absent
respectively in each group columns g
:
where
is the number of the observation in group g. The expected number of subjects of present and absent respectively is denoted by:
Then
is simply obtained by calculation the Pearson
statistic for the observed and expected frequencies from the
table as:
from which it following
and finally we get
where,
is the total number of values in sth group,
is the number of responses for the number of covariates in the sth group, defining as
where,
, and
is the average of the estimated probabilities which are defined as:
Here, the number of observations within covariate pattern i is denoted by
. Use of an extensive set of simulations proved that when
, where
is the individual binomial denominator and the fitted logistic model is the correct model, then the distribution of
is approximated by the
distribution with
degrees of freedom [9] .
Hosmer and Lemeshow Test
The second grouping strategy was proposed from Hosmer and Lemeshow denoted by
, this method depends upon grouping the estimated probabilities in groups based on fixed cutpoint, so each group contains all subjects with fitted probability located in specific intervals. For example, the cutpoint of the first group is
, then this group contains all subjects with estimated probabilities located in this interval; the second group contains all subjects with estimated probabilities located between cutpoint
and the last group has interval
.
The calculation of
uses exactly the same formulae used to calculate
: the only difference between the two approaches is in the construction of the groups. The distribution of
is approximated by the
distribution with
degrees of freedom.
Although Hosmer and Lemeshow tests are good, it requires grouping, and choice of g is
・ g is arbitrary but almost everywhere in the literature and in software a value of 10, or very similar is chosen.
・ Smaller values of g might be chosen for smaller n.
・ Sparse data causes a problem for H and lead to uneven group widths for C.
3. Goodness-of-Fit Tests without Grouping
Deviance and Pearson Chi-Square Tests
Two of the most commonly used goodness-of-fit measures, are the Pearson’s chi-squared
and the deviance D goodness-of-fit test statistics but the behaviour of these tests are unstable with bernoulli data; see [13] . The general idea of the deviance is make comparison between two models the first model is full model with p parameters and the second model is a model with q parameters, where
. The deviance can write as
where
,
are the likelihoods for the full and small model and
,
denoted to the log-likelihood: Asymptotically this is
in
df. The residual deviance is the case when the large model is saturated and has n parameters. In case of the logistic regression model [13] introduced specific form when
; the residual deviance can then be found as
In this case the deviance is invalid as a goodness-of-fit test, because it is a function of
, which does not compare the observed values with fitted values.
Also, [13] discussed that Pearson chi-square goodness of fit statistic when
; can be written:
which is equal to the sample size: this is not a useful goodness-of-fit test.
Residual Sum of Squares Test
[14] proposed a method, which used the unweighted residual sum of squares a goodness-of-fit test to assess the model adequacy. The idea of this approach is to keep all the individual values of mi but to give less weight in cases of mi are small. The unweighted residual sum of squares statistic considers only the numerator of the Pearson chi-squares statistic, which is the summation again over the individual observations, the statistic can be written:
Of course, the relative weighting for varying mi is not relevant for our case where mi = 1. [11] discussed how to compute the moments and asymptotic distribution of the RSS statistic. They give useful expressions for the mean and variance which are easier to compute than the expressions given by [14] . The proposed asymptotic mean and variance of RSS are respectively,
and var
, where
,
,
and d is vector with
elements
. Used the standardized statistic to assess significance by referring the following to the standard normal
Test
Several
type statistics have been used for goodness-of-fit in logistic regression, such as that proposed by [15] .
where,
represents the log-likelihood evaluated at the ML estimation parameters and
represents the log-likelihood of the model containing only an intercept. Another version due to [16] is
where,
.
Information Matrix tests: IMT and IMTDIAG
The Information Matrix test (IMT) is a test for general mis-specification, proposed by [17] . The two well-known expressions for the information matrix coincide only if the correct model has been specified and the IMT takes advantage of this fact. The IMT avoids the grouping necessary for tests like the Hosmer-Lemeshow test. Many researchers, [18] [19] [20] [21] pointed out the behaviour of the asymptotic distribution of IMT statistic and dispersion matrix. [22] discussed the information matrix test and showed that it is useful with binary data models. [8] claimed that, the IMT has reasonable power compared with other tests, without information about the behaviour of the asymptomatic distribution of IMT. The idea of the information matrix test is to compare
and
, as these differ when the model is mis-specified
but not when the model is correct.
Let, consider binary regression, where the outcome for individual i, i = 1, ・・・, n is a random variable
. Also
where
is a
dimensional vector of covariates and
is a p-dimensional vector of parameters. It will be convenient to write
and
to be the contribution to the log-likelihood
from unit i.
We have
The p-dimensional likelihood equations
can be written:
(1)
We can also derive the
matrix
as:
(2)
The idea behind the information matrix test is that if the model is correctly specified then the quantity:
has zero mean. By comparing (1) and (2) we can compute this quantity, for a general value of
, as the sum of:
(3)
We can test the null hypothesis that IM has zero mean by computing the variance of IM and then constructing a standard
statistic. The first step is to
compute the variance of
where we write
for essentially the right
hand side of (3):
where we have changed the
symmetric matrix into a vector
in order to be able to use standard methods. As
is symmetric we do not wish to
duplicate entries, so
is the
-dimensional vector:
where
is the
element of
. If we write:
then because the different terms are independent we obtain:
which is a
dimensional matrix where
.
We should also note that if B is defined as essentially the log-likelihood, i.e.
then the variance of B is the
matrix
:
Before compute the covariance of A and B, we get, using
Now,
For independently and identically random variables and under the
the second term of the
is zero, and covariance of A and B in this case is the
matrix, and so
Central limit arguments suggest that asymptotically
is a
dimensional normal variable. However, the IM-test requires A to be evaluated at
,
, say, and at this value we know that B = 0. Consequently the variance of
is the variance of A conditional on B = 0 which is
.
Assuming a logistic regression we have
and
so we can evaluate the dispersion matrices at the MLEs as:
If we write
then one version of the IM test is found by referring
to a
variable with degrees of freedom equal to the rank of
.
The idea of the IMDIAG test and IM test are the same, the only difference is that for the former the elements of
are just the diagonal elements of
, so
is the p dimensional vector:
To explain the difference in size of vector
in the two cases of IM test and IMDIAG test, let us consider a simple example. Suppose we have a symmetric matrix with elements
and
dimension as:
where,
. Then in the case of the IM test, the dimension of vector
is
and elements are:
whereas in the case of IMDIAG test,
is the
dimensional vector:
4. Simulation Study
Our work, focus on behaviour of goodness of fit tests under alternative hypotheses in case of missing covariate model and in case of the wrong model, because these cases we could not reproduce Kuss’s work in. We will focus on four goodness-of-fit tests
. Therefore, we examine in more depth the behaviour of the tests and determine more information about asymptotic MLE distribution in case of the wrong model
or in the case of the missing covariate,
where
, X and U independent.
Simulation study designed as Kuss’s work:
・ The sample sizes are n = 100 and n = 500.
・ Applied only on extreme sparseness when
.
・ number of simulation is 1000.
・ distribution of the predictor variables X, U is
, X and U independent chosen to confirm with Kuss’s work.
・ Use four of goodness-of-fit tests from the simulation study under three different alternative hypotheses:
(a) True covariate.
(b) Missing covariate.
(c) Wrong functional form of the covariate.
・ Fitted model in all cases is a standard logistic model with an intercept and one covariate.
・ All the tests on the null hypothesis under
.
Results and Discussion of Tests under Correct Model
In Table 2, reported some results, the mean, variance and the empirical power of four goodness-of-fit tests from simulation study under correct model, namely
Statistics used in the simulation as goodness-of fit tests are: Hosmer- Lemeshow
, Information matrix
, Information matrix Diagonal
and residual sum of squares (RSS). The asymptotic distribution of statistics is
distribution, where the mean and variance equal df and 2df respectively. In case of
statistic we chosen the number of group is g = 10 so, degree of freedom is
. The results shown in Table 1, the mean and variance of all statistics appeared close to df and 2df. Moreover, the simulation study appeared reasonable results when fit the model with sample size n = 500. However, there is slightly large variance of
in case of sample size n = 100. Overall, the empirical power and type I error looks good.
In the second case, the results reported the mean, variance and the power to detect a mis-specified model for same goodness-of-fit tests under missing covariate model, when the model is:
and fit standard logistic regression model with
.
Table 2, showed results from simulation study under alternative hypotheses missing covariate model. The mean and variance of all statistics close to df and 2df, but we have slightly smaller variance in case of
. However, we have low power when used IM statistics in case of sample size n = 500, IMDIAG statistic and RSS in case of sample size n = 100 and
statistic in both cases of sample size.
The final case we will show the results of power to detect a mis-specified model for four goodness-of-fit tests under the wrong functional form of the covariate model
and fit the model as previous cases.
Table 1. Results of N = 1000 simulation with sample size n = 100 and n = 500 under correct model.
Table 2. Results of N = 1000 simulation with sample size n = 100 and n = 500 under missing covariate model.
In Table 3, reported results for goodness-of-fit tests from simulation study under wrong model. The mean and variance of all statistics appeared very larger in two cases of sample size comparing with degree of freedom of statistics. How- ever, high power in all goodness-of-fit tests in both sample size were found, that is meaning this tests have rejected all the null hypothesis. On the other hand, Kuss’s results appeared low power in case of sample size n = 100 compared with our results.
In Figure 1, we plot
vs
and we show the true model (continues line). If we fit
, these putative approximation are shown for
,
and
(dot and dash, dash and dot) line respectively.
Table 3. Results of N = 1000 simulation with sample size n = 100 and n = 500 under wrong model.
Figure 1. Plots of the different logistic model
given
.
5. Conclusion and Further Work
The work considered in this paper was centered on the asymptotic distribution of goodness-of-fit tests in logistic regression model. We also consider the comparison between some global goodness-of-fit tests, which compared with Kuss’s results. Application of simulation apply in two types of goodness-of-fit tests, those based a test which groups the observation and those which do not group observation. Our results of study confirm the work of Kuss’s regarding
the power of goodness-of-fit tests, which related the Rss , Hosmer-Lemeshow, IM and IMDIAG tests under correct and missing model. However, our results about the asymptomatic distribution of goodness-of-fit tests show, various combinations of behavior on the mean and variance of statistics, which, the asymptotic distribution of statistics is Chi-square
. The results under correct model show reasonable power for all methods, slightly larger variance found in case of Hosmer-Lemeshow test, and smaller variance under missing covariate model. As we know the goodness-of-fit statistics are distributed asymptotically as central
distribution under H0 when the model is correctly specified, and is non-central
under H1 when the model mis-specificed. However, under wrong model the results show strange behavior, which all the means and variances are not satisfy the assumption on asymptotic distribution
with men df and variance 2df, also, it is appeared with high power. The problem means that in some circumstances properties of the distribution of the statistics of tests (e.g mean and variance) are far away from the properties of
distribution. In fact, the interesting point here, some of goodness-of-fit tests seem affected by assumption on covariance matrix. So, many issues about the mean and variance of the asymptotic distribution of goodness-of-fit statistic should also be examined.