A New Approach for Dispersion Parameters

This paper presents a new approach to identify and estimate the dispersion parameters for bivariate, trivariate and multivariate correlated binary data, not only with scalar value but also with matrix values. For this direction, we present some recent studies indicating the impact of overdispersion on the univariate data analysis and comparing a new approach with these studies. Following the property of McCullagh and Nelder [1] for identifying dispersion parameter in univariate case, we extended this property to analyze the correlated binary data in higher cases. Finally, we used these estimates to modify the correlated binary data, to decrease its over-dispersion, using the Hunua Ranges data as an ecology problem.


Introduction
The dispersion parameter should be the unity in case of the univariate Bernoulli data, but there may be deviation if there is a sequence of the Bernoulli outcomes included in a study that may lead to a binomial variable.The over-dispersion is happened if the variance of actual response is more than the nominal variance, ( ) ( ) , as a function of the mean, µ .The estimation of dispersion parameter in the univariate case can be obtained easily using the Pearson's Chi-square or the deviance function.Many studies have devoted the over-dispersion criteria in the univariate case, namely, when the binomial data are used.It is difficult to extend these methods to estimate the dispersion parameters in the bivariate case, because in the bivariate case, the association between correlated response variables may be happened.So, we must take this association into account when estimate the dispersion parameter.But in the independence case, the estimate of dispersion parameter is performed as in the univariate case.The estimate of dispersion parameters for the bivariate correlated binary data can be obtained using different methods.The first one when the dispersion parameter is scalar.The second one when we have a matrix values of dispersion parameters.These estimates can be extended to the trivariate and multivariate correlated binary data.So, we present a new approach to identify and estimate the dispersion parameters, in scalar and matrix values, for the bivariate, trivariate and multivariate correlated binary data.Also, after obtaining these estimates we can modify the correlated binary data, this happens to obtain a dispersion parameter equal or near to the unity.
This paper can be organized as follows: Some of the previous studies are presented in the Section 2.
A proposed approach for identifying and estimating the dispersion parameters in a scalar and matrix values, and the impact of over-dispersion in the case of bivariate, trivariate and multivariate binary outcomes associated with covariates, are demonstrated in the Sections 3, 4 and 5, respectively.
Finally, the numerical examples for the vectorized generalized additive model, VGAM, or vectorized generalized linear model, VGLM, Yee and Wild [2], and the alternative quadratic exponential form, AQEF, measure, El-Sayed et al. [3], are demonstrated in Section 6.

Previous Studies
In this section, we present some studies on the over-dispersion problem as shown below: (1) Smith and Heitjan [4] provided an appropriate statistical tool to detect extra binomial variation (over-dispersion).To test the nominal dispersion in the i-th (

i i i i i
Var Y m The hypothesis testing problem is formulated as An appropriate procedure to test 0 H is the score statistic suggested by Smith and Heitjan 2 1 , where , , ,  is a random vector that registers the difference between actual information and nominal information, in the i-th margin with respect to every j-th ( 1, 2, , j p =  ) parameter, for ( ) And i A is the covariance matrix of i J corrected for estimation of linear predictors, i θ , where log 1 . Under the null hypothesis, 0 H , the asymptotic distribution of statistic (2) is the 2 χ distribution with p degrees of freedom.The eventual rejection of 0 H will be a clear evidence that ( ) ( ) ) Cook and Ng [5] described a bivariate logistic-normal mixture model for over-dispersed two state Markov processes.The use of these mixed models cause increase in the standard error of marginal probability estimates.They did not specify the explicit form for the over-dispersion estimate, but display the log-likelihood function for the full sample of m subjects, as where, the expectation, i E α , is taken with respect to the bivariate normal distribution, hence , ~, , are regression parameters.
(3) Saefuddin et al. [6] showed the effect of over-dispersion on the hypothesis test of logistic regression.A simple method proposed by William, [7], was used to correct the effect of over-dispersion by taking inflation factor into consideration.This method takes account of adjusting the estimate of the standard error of the parameter resulting from the over-dispersion.Modeling of the over-dispersion is often expressed in the equation of the variance of response variable, i Y , for binomial case for i n trials, as follows ( ) ( ) ( ) where ( )  is the over-dispersion scale and φ denote inflation factor.When the over-dispersion does not occur or very small over-dispersion occurs, φ will be approximately equal to zero, so i Y exactly follows binomial distribution, ( ) Bin n π , and ( ) ( ) [8].However, when over-dispersion exists, φ exceeds zero and leads

( )
i Var Y to be greater than ( ) X statistic of the model to its approximate expected value, written as ˆ, and where ( ) , i w is the weight and i d is the diagonal element of the variance-covariance matrix of the linear predictor, say 1 p . The value of 2 X statistic depends on φ , so iteration process is needed to find the optimum value.This procedure was the first introduced by William, [7], and is known as William method.
The algorithm of the William method is described as follows: 1. Assume 0 φ = , calculate parameter estimate of logistic regression parameter, β , using maximum like- lihood method.Calculate the 2 X statistics of fitted model.X statistic is too large, conclude that 0 φ > and cal- culate the initial estimates of φ using following formula ( ) 3. Using the initial weights ( ) we can recalculate the value of β and 2 X statistic.

2
X statistic close to its degrees of freedom, n p − , then the estimated value of φ is sufficient.If not, re-estimate φ using following expression: X statistic remains large, return to step (3) until optimum value of estimated φ is obtained.Once φ has been estimated by φ , ( )  could be used as weights in fitting the new model, Collett [8], and William [7].We conclude that the over-dispersion problem causes lower standard errors of the estimates of parameters.
(4) Davila et al. [9] introduced a new approach for modeling the multivariate marginals over-dispersed binomial data.They illustrate this approach by analyzing the data using the Gaussian copula with Beta-binomial margins.In order to model the over-dispersion, they used the Beta-binomial model, a generalization of binomial distribution, Casella and Berger [10].In this model, it is supposed that Y P Bin m P , whereas ( ) P Beta α β .Then, they make the assumption that each margin, i Y , follows a Beta-binomial distribution.Therefore, unconditionally the compound density, with respect to the counting measure of i Y , is given by , , , where, 0, 0 Conditional to i P , the expectation is given by ( ) The conditional variance is From the relation (12), we see that the marginal dispersion parameter is 1 .
Comparing the relation (1) with the relation (12), it is noted that the later has a greater variance.In their study, as compared with the multivariate normal (MVN), the marginal GLM, and the marginal over-dispersion model (ODM), they have shown that the model based on the Beta-binomial model (BBM) displayed the higher standard errors associated to estimated parameters.
(5)-The vectorized generalized additive model (VGAM) introduced by Yee and Wild [2] and implemented by Yee [11] [12].The conditional distribution of VGAM function for bivariate correlated binary responses, , Y Y given that some covariates, x, is: where, ( ) , And the j η , 1, 2, 3 j = , are additive predictors.If all the functions are constrained to be linear, then the resulting model is a vector generalized linear model (VGLM).
The conditional distribution of VGAM family function for trivariate binary responses, ( )

y y y x u x u x y u x y u x y u x y y u x y y u x y y
Note that a third order association parameter, 123 u , for the product, ( ) y y y , is assumed to be zero for this family, Yee and Wild [2].
The conditional distribution of VGAM (VGLM) function for multivariate correlated binary responses, ( ) , given that some covariates, x, is where ( ) 0 u x is the normalizing constant.
In the next section, we suggest a new approach to estimate the dispersion parameter, φ , using a scalar and a matrix values of the dispersion parameters and indicate how the dispersion parameter may influence on the analysis of correlated binary data, specially on the standard errors, the Wald statistics and the LRTs for the bivariate, trivariate and multivariate binary outcomes variables associated with covariates.For fitting the correlated binary data, we use the log-likelihood function for the alternative quadratic exponetial form (AQEF) measure, [3], in the bivariate, trivariate and multivariate case, respectively.Using the following notations which imply to the link functions which enable us to use the regression model: we have the log-likelihood function for the bivariate AQEF measure as ( ) ( ) The log-likelihood function for the trivariate AQEF measure is where, ( ) , 1 e e e e e e e Finally, the log-likelihood function for the multivariate AQEF measure is )

Dispersion Parameters in Bivariate Case
In this section, we determine the identification and estimation of a fixed value for dispersion parameter, φ , and also a matrix of dispersion parameters to extend the effect of over-dispersion on the analysis of bivariate correlated binary data.

Scalar Dispersion Parameter
We can use the variance-covariance matrix of 1 Y and 2 Y to estimate a scalar dispersion parameter, φ , in the bivariate binary outcomes.So, we can define the response vector and its mean vector Following the GLM property, the variance-covariance matrix of Y is where, Then, the estimator of φ , for n observations, is , Hence, we can show that , , Follows the non-central 2 n p χ − .Under independence, this quantity follows, approximately, 2 n p χ − .An estima- tor of φ in this case is

Matrix of Dispersion Parameters
Now, we use different values for dispersion parameter, such that , ., The estimator of dispersion parameters matrix is From the equation (26), we have Follows the non-central , then the estimator of φ is same as (24).
We can correct the data using the estimates of dispersion parameters, 11 22 ˆ, φ φ , and Equation (25), for the i-th observation, in the bivariate case as

Dispersion Parameters in Trivariate Case
We can define the response vector and its mean vector

Scalar Dispersion Parameter
The variance-covariance matrix of Y can be written as , , The estimator of φ , for n observations, is Follows the non-central 2 n p χ − .Then, under independence, this quantity follows, approximately, 2 n p χ − .Under independence, the estimator of φ is

Matrix of Dispersion Parameters
The variance-covariance matrix of Y can be displayed as , , The estimator of dispersion parameters, Follows the non-central Similarly, we can correct the data using the estimates of dispersion parameters, 11 22 ˆ, φ φ and 33 φ , and the equation (31), for the i-th observation, in the trivariate case as ,

Dispersion Parameters in Multivariate Case
We can define the response vector

Scalar Dispersion Parameter
The variance-covariance matrix of Y can be written as The estimator of φ , for n observations, , , ) Follows non-central 2 n p χ − .Then, under independence, this quantity follows, approximately, 2 n p χ − .Under independence, the estimator of φ is ( )

Matrix of Dispersion Parameters
The variance-covariance matrix of Y can be displayed as , , The estimator of dispersion parameters, ˆˆˆk

Numerical Examples
In this section, we present two examples.The first one applies to the bivariate correlated binary data.This example presents the results obtained by using AQEF measure and the VGLM measure which are similar in the bivariate case.The second one applies on the trivariate binary data.However, the third association is absent in the VGAM (VGLM) measure.In both examples, we will use the Hunua Ranges data, Yee [11] [12].These data were collected from the Hunua Ranges, a small forest in the Southern Auckland, New Zealand.At 392 sites in the forest, the presence/absence of 17 plant species was recorded along with the altitude.Each site was of area size 200 m 2 .The Hunua Ranges data frame has 392 rows and 18 columns.Altitude is a continuous variable, and there are binary responses (presence = 1, absence = 0) for 17 plant species.These data frame contains the following columns: agaaus, beitaw, corlae, cyadea, cyamed, daccup, dacdac, eladen, hedarb, hohpop, kniexc, kuneri, lepsco, metrob, neslan, rhosap, vitluc and altitude (meters above the sea level).

Application to Bivariate Case
Hence, we will use the first two columns, agaaus and beitaw, as correlated binary outcome variables, 1 Y and 2 Y , respectively.A third column, corlae, is used as the explanatory binary variable, X.
We will use the estimates, 11 φ and 22 φ , to modify the correlated data according to the relationship (27).From Table 1 and Table 2, we demonstrate the conclusions after modifying the correlated data by the estimates of dispersion parameters, as follows: 1.The estimates of the regression parameters are changed.
2. The standard errors are decreased for the estimates of association parameters.This leads to a significant association between the two outcomes binary variables, ( ) , Y Y , associated with covariate, x. 3. The Wald statistic test shows lower values, this confirms a significant association between the two outcomes binary variables, ( ) , Y Y , associated with covariate, x. 4. The LRT is increased, this also confirms the conclusion observed from the Wald statistic.
5. The estimate of a scalar dispersion parameter, φ , is increased.
6.The estimates of the matrix of dispersion parameters, 11 22 , φ φ and 12 φ , increased and close to the unity.
7. The scaled deviance value is increased.

Application to Trivariate Case
We will use the columns, cyadea, beitaw and kniexc, as the dependent correlated binary variables, 1 2 , Y Y and 3 Y , respectively.On the other hand, we will use the column "altitude", meters above sea level, as the continuous explanatory variable, X.The estimates of the regression parameters and their tests for the association parameters can be determined for the AQEF and VGLM measures, before and after modifying the correlated data by the estimates of dispersion parameters, 11 22 , φ φ and 33 φ , as shown in Table 3.Hence, the LRT's will be compared with ( ) 2 0.05,1 3.8415 χ = .
From Table 3, we demonstrate the conclusions after modifying the data by the estimates of dispersion parameters, as follows: 1.The estimates of regression parameters in the two measures are changed.2. The scaled deviance is increased for the two measures.
3. The estimate of a scalar dispersion parameter, φ , is decreased for the two measures.5.For the VGLM measure, the LRTs reflect significant association between the pairwise outcome variables, , , Y Y and ( ) , Y Y , associated with covariates, x.For the AQEF measure, the LRTs also reflect significant association between the pairwise outcome variables, , Y Y and ( ) , Y Y , associated with covariates, x.
However, no significant association is observed between the correlated binary outcome variables, ( ) , Y Y , associated with covariates, x.
6.The LRT for the third association, which is observed from the AQEF measure, reflects no significant association between the correlated binary outcome variables, ( ) , , Y Y Y , associated with covariates, x.So, when modifying the correlated data, the estimates of dispersion parameters, 11 φ , 22 φ and 33 φ , tend to the unity.This leads to no significant association between the outcome variables, 1 2 , Y Y and 3 Y , associated with covariates, x. If
n p χ − .Under independence, this quantity follows, approximately, 2 estimator of φ is same as (36).Similarly, we can correct the data using the estimates of dispersion parameters,11 22   ˆˆ, , , kk φ φ φ  , and the equation (37), for the i -th observation, in the multivariate case as (

4 .
The estimates of values of dispersion parameters, 11 φ , 22 φ and 33 φ , are decreased for the two measures, but close to the unity for the AQEF measure.On the other hand, the estimates of dispersion parameters, 12 φ , 13 φ and 23 φ , are decreased for the two measures, but close to the unity for the VGLM measure.

Table 1 .
Results of AQEF and VGLM before modifying the data.

Table 2 .
Results of AQEF and VGLM after modifying data.

Table 3 .
Results before and after modifying data.