^{1}

^{2}

^{*}

Count data that exhibit over dispersion (variance of counts is larger than its mean) are commonly analyzed using discrete distributions such as negative binomial, Poisson inverse Gaussian and other models. The Poisson is characterized by the equality of mean and variance whereas the Negative Binomial and the Poisson inverse Gaussian have variance larger than the mean and therefore are more appropriate to model over-dispersed count data. As an alternative to these two models, we shall use the generalized Poisson distribution for group comparisons in the presence of multiple covariates. This problem is known as the ANCOVA and is solved for continuous data. Our objectives were to develop ANCOVA using the generalized Poisson distribution, and compare its goodness of fit to that of the nonparametric Generalized Additive Models. We used real life data to show that the model performs quite satisfactorily when compared to the nonparametric Generalized Additive Models.

The Poisson distribution is commonly used to model count data. However, a restriction of this distribution is that the response variable must have a mean equal to the variance. This restriction does not often hold true for many biological and epidemiological data. In many applications the variance can be much larger than the mean, a phenomenon known as “over dispersion”. This over dispersion may occur due to population heterogeneity, or presence of outliers in the data [

The Negative-Binomial (NB) distribution has been used as an alternative to the Poisson distribution for modeling data that exhibit overdispersion. The NB has two parameters and a variance that is a quadratic function of the mean. NB model has been the model of choice for the analysis of overly dispersed count data. The NB regression was reviewed by Hinde and Demetrio [

In this paper we discuss several inferential statistical issues related to a modified form of the Generalized Poisson Distribution (GPD). The GPD distribution was introduced to the statistical literature by Consul and Jain [

This paper has three-fold objectives. In Section 2, we present the model. We then, assume that we have k independent samples and we demonstrate how to construct statistical testing procedures on the dispersion parameters. Specifically, we first validate the hypothesis of homogeneity of dispersion parameters, thereafter we test the significance of the common dispersion parameter. In Section 3 we test the hypothesis of equality of k-means in the presence of overdispersion. When covariates are measured, testing the equality of group means is therefore equivalent to the Analysis of covariance (ANCOVA) in the presence of overdispersion. In Section 4 we use the COVID-19 mortality data to draw a comparison between the MGPD, and the Generalized Additive Models (GAM). We demonstrate the differences between the two analytic strategies and highlight the superiority of the MGPD in the analysis of count data exhibiting overdispersion in Section 5. General discussion is presented in Section 6.

The GPD was introduced by Consul and Jain [

P ( Y = y ) = λ 1 ( λ 1 + λ 2 y ) y − 1 y ! exp [ − λ 1 − λ 2 y ] λ 1 > 0 0 ≼ λ 2 < 1 (2.1)

The GPD whose probability function is given in (2.1) reduces to the well-known Poisson distribution when λ 2 = 0 . Therefore the parameter λ 2 with the above restriction on its range, is considered the dispersion parameter. Shoukri and Mian [

λ 1 = μ / ( 1 + ϵ μ ) λ 2 = ϵ λ 1 (2.2)

Using the transformations in (2.2) we therefore have:

P ( X = x ) = ( 1 + ϵ x ) x − 1 x ! g x ( μ , ϵ ) exp [ − μ 1 + ϵ μ ] (2.3)

where g ( μ , ϵ ) = μ 1 + ϵ μ exp [ − ϵ μ 1 + ϵ μ ]

For fixed ϵ , the function g ( . ) in (2.3) is the natural parameter transformation which renders the GPD a member of the linear family of exponential class (see; [

f ( x ) = h ( x ) exp [ ϕ T ( x ) − A ( ϕ ) ] (2.4)

We call the transformed GPD, the “Modified Generalized Poisson Distribution” or MGPD.

Shoukri and Mian [

ϑ ′ r + 1 = σ 2 ( μ ) ∂ ϑ ′ r ∂ μ + μ ϑ ′ r (2.5)

From (2.5) we can show that:

ϑ ′ 0 ≡ 1 , ϑ ′ 1 ≡ μ = E ( Y ) , and σ 2 ( μ ) = μ ( 1 + ϵ μ ) 2 ≡ var ( Y ) (2.6)

That is the variance is a cubic function of the population mean. We shall deal with the situation when ϵ > 0 .

Our approach for parametric estimation in this section will be for a single random sample. If Y 1 , Y 2 , ⋯ , Y n is a random sample from the GPD (2.3), Consul and Shoukri [

Equating the first two sample moments ( y ¯ , s 2 ) to their corresponding population moments

y ¯ = μ s 2 = 1 n ∑ i = 1 i = 1 ( y i − y ¯ ) 2 = μ ( 1 + ϵ μ ) 2

and solving for the parameters we get:

μ ˜ = y ¯ ϵ ˜ = ( s 2 ) 1 / 2 ( y ¯ ) − 3 / 2 − ( y ¯ ) − 1 (2.7)

The variance of the moment estimators of the mean and the dispersion parameter are respectively given by Shoukri and Al-Eid [

var ( μ ^ ) = μ ( 1 + ϵ μ ) 2 / n (2.8a)

v = var ( ϵ ^ ) = ( 1 + ϵ μ ) 2 2 n μ 2 [ 1 + 2 ϵ + 3 ϵ 2 μ ] (2.8b)

Suppose that we have k independent random samples from (2.3), which we denote Y i j ~ G P D ( μ i , ϵ i ) with n i observations from the i t h population ( i = 1 , 2 , ⋯ , k ) .

Wedenote the variance of the estimator of ϵ given in Equation (2.8) by v i and, let w i = 1 / v i , v i is given in (2.8b).

Cochran [

Q _ e s p = ∑ i = 1 k w i ( ϵ ^ i − ϵ ¯ ) 2 / ∑ i = 1 k w i (2.9)

where

ϵ ¯ = ∑ i = 1 k w i ϵ ^ i / ∑ i = 1 k w i (2.10)

The hypothesis H 0 : ϵ 1 = ϵ 2 = ϵ 3 = ⋯ ϵ k = ϵ of homogeneity of dispersion parameters is rejected whenever the statistic Q _ e s p exceeds Q α , k − 1 , the upper 5% quantile of a chi-square random variable with k − 1 degrees of freedom.

Here we develop a test statistic on the null hypothesis of absence of overdispersion. For the case when μ i ’s are unknown, a uniformly most powerful test for H 0 : ϵ = 0 (Poisson) versus H 1 : ϵ > 0 (GPD) cannot be obtained, however the locally powerful Neyman’s C ( α ) test can be constructed [

l = ∑ i = 1 k ∑ j = 1 n i ( y i j − 1 ) log ( 1 + ε y i j ) + ∑ i = 1 k n i y ¯ i . [ log μ i − log ( 1 + ε μ i ) ] − ∑ i = 1 k n i μ i ( 1 + ε y ¯ i . 1 + μ i ) (2.11)

where y ¯ i . = y i . n i = ∑ j = 1 n i y i j / n i .

The locally asymptotically most powerful C ( α ) test is to reject H 0 for large values of ( ∂ l / ∂ ϵ ) ϵ = 0 . From (2.11):

( ∂ l / ∂ ϵ ) ϵ = 0 = ∑ i = 1 k ∑ j = 1 n i y i j ( y i j − 1 ) − ∑ i = 1 k n i μ i y ¯ i . − ∑ i = 1 k n i μ i ( y ¯ i . − μ i ) (2.12)

Therefore, the locally asymptotically most powerful C ( α ) test is to reject H 0 for large values of T , where

T = ( ∂ l / ∂ ϵ ) ϵ = 0 = ∑ i = 1 k ∑ j = 1 n i [ ( y i j − y ¯ i . ) 2 − y i j ] (2.13)

The statistic (2.13) is obtained from (2.12) by replacing each μ i with root n i consistent estimator, μ ^ i . The simplest μ ^ i is the maximum likelihood estimator μ ^ i = y ¯ i . . Moran [

E ( T ) = ∑ i = 1 k [ ( n i − 1 ) μ i ( 1 + ε μ i ) 2 − n i μ i ] (2.14)

and

var ( T ) = ∑ i = 1 k { 2 ( n i − 1 ) μ i 2 ( 1 + ε μ i ) 4 + 1 n i [ μ i ( 1 + ε μ i ) 4 ( 1 + 3 μ i + 10 ε μ i + 15 ε 2 μ i 2 ) − 3 μ i 2 ( 1 + ε μ i ) 4 + μ i ( 1 + ε μ i ) 2 − 2 μ i ( 1 + ε μ i ) 3 ] + μ i n i } (2.15)

Under H 0 : ϵ = 0 , (2.14) and (2.15) reduce respectively to E ° = − ∑ i = 1 k μ i and

v ° = ∑ i = 1 k ( 2 ( n i − 1 ) μ i 2 + μ i n i ) (2.16)

The hypothesis H 0 : ϵ = 0 is rejected whenever:

Q ( ϵ = 0 ) = ( T − E ° ) 2 / v ° exceeds Q α , 1 , the upper 5% quantile of a chi-square random variable with one degree of freedom.

Based on the one-way layout data considered in the previous section, we would like to test the null hypothesis H 0 : μ 1 = ⋯ = μ k = μ against H a : at least two of the μ ′ i s are different, for all ϵ > 0 . The log likelihood under the hypothesis H a : is given by (2.11), and will be denoted by l a , the log likelihood under H 0 will be denoted by l 0 and is obtained by replacing μ i = μ ( i = 1 , 2 , ⋯ , k ) in (2.11). Under H a , the maximum likelihood estimator of μ i is

μ ^ i = y ¯ i .

And the maximum likelihood estimator ϵ ^ a , of ϵ is the non-negative root of

∑ i = 1 k [ ∑ j = 1 n i ( y i j − 1 ) y i j 1 + ε ^ a y i j − n i ( y ¯ i . ) 2 ( 1 − ε ^ a y ¯ i . ) ] = 0 (3.1)

Under H 0 the maximum likelihood estimator of the common mean μ is μ ^ = y .. / N = y ¯ , where, N = ∑ i = 1 k n i .

The maximum likelihood estimator of ϵ ^ o and ϵ under H 0 is the positive root of

∑ i = 1 k ∑ j = 1 n i ( y i j − 1 ) y i j 1 + ϵ ^ o y i j − N ( y ¯ ) 2 ( 1 + ϵ ^ o Y ¯ ) = 0 (3.2)

Detailed discussion on the necessary and sufficient conditions that (3.1) and (3.2) to have a unique root is given in Consul and Shoukri [

Denoting the maximized log likelihood under H a by L a , and that under H 0 by L 0 , the likelihood ratio test, which has an asymptotic distribution of chi-squared with ( k − 1 ) degree of freedom is:

λ = 2 ( L a − L 0 ) = 2 [ ∑ i = 1 k ∑ j = 1 n i ( y i j − 1 ) log { 1 + ε ^ a y i j 1 + ε ^ 0 y i j } + ∑ i = 1 k n i y ¯ i log { y ¯ i . ( 1 + ε ^ 0 y ¯ ) y ¯ ( 1 + ε ^ a y ¯ i . ) } ] (3.3)

As an alternative to the likelihood ratio test (3.3), we present the Neyman’s C ( α ) statistic which has local optimal properties. Suppose that μ i can be written as μ i = μ + δ i with δ k = 0 . Then testing the null hypothesis H 0 : μ 1 = μ 2 = ⋯ = μ k is equivalent to testing H 0 : δ i = o ( i = 1 , 2 , 3 , ⋯ , k ) , where μ and ϵ are nuisance parameters. We reparametrize (11.2), and denote the resulting function by l * .

Define δ = ( δ 1 , δ 2 , ⋯ , δ k − 1 ) , τ = ( τ 1 , τ 2 ) ′ = ( μ , ϵ ) .

ϕ i ( τ ) = [ ∂ l * ∂ δ i ] δ ¯ = 0 i = 1 , 2 , ⋯ , k − 1

Δ i ( τ ) = [ ∂ l * ∂ τ j ] δ ¯ = 0 j = 1 , 2

Let τ ^ be any root-n consistent estimator of τ under the null hypothesis. Moran [

P i j = − E [ ∂ 2 l * ∂ δ i ∂ δ j ] δ ¯ = 0 , Q i j = − E [ ∂ 2 l * ∂ δ i ∂ τ j ] δ ¯ = 0 ,

and R i j = − E [ ∂ 2 l * ∂ τ i ∂ τ j ] δ ¯ = 0 .

Here, we replace by its estimator τ ^ in F, P, Q, and R, the C ( α ) test statistic is given by

F ′ ( p − Q R − 1 Q ′ ) − 1 F (3.5)

The asymptotic distribution of the test statistic given in (3.5) will be that of a chi-square with k − 1 degrees of freedom.

Now, there are two possible root-n consistent estimators of τ , under H 0 :

The first is the maximum likelihood estimator τ ^ = ( y ¯ , ϵ ^ 0 ) ′ , which on substitution we get Δ j ( τ ^ ) = 0 ( j = 1 , 2 ) , and hence F j ( τ ^ ) = ϕ i ( τ ^ ) . Accordingly, (3.5) reduces to

C 2 = ∑ i = 1 k n i ( y ¯ i . − y ¯ ) 2 y ¯ ( 1 + ε ^ 0 y ¯ ) 2 (3.6)

The hypothesis of equality of population means is thus rejected whenever C 2 exceeds Q α , k − 1 , the upper 5% quantile of a chi-square random variable with k − 1 degrees of freedom. For more details we refer the reader to [

It is well-known that ANOVA and regression are related techniques that are concerned with testing the differences in group means after adjusting for the confounding effects of potential risk factors and covariates. Since the MGPD is a member of the linear exponential family (for fixed ϵ ) Shoukri and Mian [

η ( μ i ) = X i T β (4.1)

In Equation (4.1), X i is a set of measured ( P + 1 ) covariates, and a subset of theses covariates defines a set of indicators (dummy) variables to identify categorical effects. The transformation η ( ⋅ ) is a monotone, differentiable function named “the link function”. To estimate β 0 , β 1 , ⋯ , β p , and ϵ we construct the log-link so that:

μ i ( x ) = exp [ X i T β ]

The logarithm of the likelihood function will be proportional to

l = l ( β , ε ) = ∑ i = 1 k ( y i − 1 ) ln ( 1 + ε y i ) + ∑ i = 1 k y i ln μ i ( x ) − ∑ i = 1 k y i ln ( 1 + ε μ i ( x ) ) − ∑ i = 1 k μ i ( x ) ( 1 + ε y i ) 1 + ε μ i ( x ) (4.2)

The first and second partial derivatives are given by:

∂ l ∂ ε = l ˙ ε = ∑ i = 1 k y i ( y i − 1 ) 1 + ε y i − ∑ i = 1 k y i μ i ( x ) 1 + ε μ i ( x ) − ∑ i = 1 k μ i ( x ) ( y i − μ i ( x ) ) ( 1 + ε μ i ( x ) ) 2 (4.3)

∂ l ∂ β r = l ˙ r = ∑ i = 1 k ( y i − μ i ( x ) ) ( 1 + ε μ i ( x ) ) 2 x i r (4.4)

∂ 2 l ∂ ε ∂ β r = l ¨ ε r = − 2 ∑ i = 1 k μ i ( x ) ( y i − μ i ( x ) ) ( 1 + ε μ i ( x ) ) 3 x i r (4.5)

∂ 2 l ∂ β r ∂ β s = l ¨ r s = − ∑ i = 1 k [ μ i ( x ) ( 1 + ε μ i ( x ) ) 2 x i r x i s + 2 ε ( y i − μ i ( x ) ) μ i ( x ) ( 1 + ε μ i ( x ) ) 3 x i r x i s ] (4.6)

∂ 2 l ∂ ε 2 = l ¨ ε r − ∑ i = 1 k y i 2 ( y i − 1 ) ( 1 + ε y i ) 2 + ∑ i = 1 k y 1 μ i 2 ( x ) ( 1 + ε μ i ) 2 + 2 ∑ i = 1 k μ i 2 ( x ) ( y i − μ i ( x ) ) ( 1 + ε μ i ( x ) ) 3 (4.7)

Taking the expected value of the negative of the second partial derivatives we get the Fishers’ information matrix I, whose elements are:

− E [ l ¨ r s ] = I r s = ∑ i = 1 k μ i ( x ) ( 1 + ε μ i ( x ) ) 2 x i r x i s , r , s , = 1 , 2 , ⋯ , p (4.8)

− E [ l ¨ ε s ] = I ε r = 0 (4.9)

From Consul and Shoukri [

− E [ l ¨ ε ε ] = i ε ε = 2 ( 1 + 2 ε ) − 1 ∑ i = 1 k μ i 2 ( x ) ( 1 + ε μ i ( x ) ) 2 (4.10)

The asymptotic distributions of the regression estimators can be established using the results in [

Our approach to the data analysis when the main interest is comparing group means in the presence of potential risk factors and confounders is summarized in three steps. In the first step we use the MLE to estimate the regression parameters using Equation (4.2), without including the groups as independent variable. In the second step, we extract the residuals (E) of the generalized Poisson regression model, defines as:

E = = Observed dependent variable − predicted val − ue of the dependent variable

In the final step we test the normality and variance homogeneity of E. Thereafter, we use nonparametric ANOVA with the residuals being the dependent variable, and the groups being the independent variables to complete the ANCOVA testing.

Al-Gahtani et al. [

The response variable is the aggregate number of COVID-19 deaths which we denote by “y”. We shall use different set of covariates, and these are:

· Region: The factor variable which is the main effect.

· The other covariates are:

1) X_{1} =log (percentage of obese personsin a country reported in 2018) [

2) X_{2} = log (population density) [

3) X_{3} = log (number of people with colorectal cancer in a country reported in 2017) [

4) X_{4} = log (Chronic Kidney Disease—case fatality in a country as reported in 2017) [

In

Direct calculations from the summary statistics given in

Q _ e s p = 0.00004 , and the corresponding p-value = 0.999. Therefore, the hypothesis of homogeneity of dispersion parameters is supported by the data. Moreover, Q ( ϵ = 0 ) is quite large and the corresponding p-value = 0.00001.

Countries | Region name | Region Code |
---|---|---|

Peru, Ecuador, Bolivia | Andean. Latin (3) | 10 |

Kazakhstan, Georgia, Armenia, Azerbaijan, Kyrgyzstan, Uzbekistan, Tajikistan | Central Asia (7) | 2 |

Czechia, Romania, Hungary, Serbia, Bulgaria, Croatia, Slovakia, Bosnia, Slovenia, North-Macedonia, Albania, Montenegro | Central Europe (12) | 5 |

Brazil, Columbia, Mexico, Panama, Costa Rica, Guatemala, Honduras, Venezuela Paraguay, El-Salvador | Central Latin America (10) | 11 |

Dominican Republic, Puerto Rico, Jamaica | Caribbean (3) | 9 |

Ethiopia, Kenya, Uganda, Zambia, Madagascar. Mozambique, Angola, French Guinea | CESSA (8) | 13 |

Indonesia, Philippine, China, Myanmar, Malaysia, Sri Lanka, French Polynesia, Maldives | East Asia (8) | 1 |

Russia, Poland, Ukraine Belarus, Lithuania, Latvia, Estonia | East Europe (7) | 6 |

Japan, Singapore, Republic Korea, Australia | HIAP (4) | 4 |

Iran, Iraq, Turkey, Morocco, Saudi Arabia, Israel, Jordan, United Arab, Kuwait, Qatar, Lebanon, Oman, Egypt, Occupied Palestine, Tunisia, Bahrain, Algeria, Libya, Afghanistan, Sudan | MENA (20) | 12 |

India, Bangladesh, Pakistan, Nepal | South Asia (4) | 3 |

Chile, Argentina, Uruguay | South Latin America (3) | 8 |

South Africa, Namibia, Zimbabwe | SSSA (3) | 15 |

USA, France, Spain, UK, Italy, Germany, Belgium, Netherland Canada, Switzerland, Portugal, Austria, Sweden, Greece, Denmark, Ireland, Norway, Luxemburg, Finland, Cyprus | West Europe and North America (20) | 7 |

Mali, Nigeria, Ghana, Cameroon, Ivory Coast Senegal, Guinea, Cape Verde | WSSA (8) | 14 |

CSSA = Central Sub-Saharan-Africa; MENA = Middle East and North Africa; HIAP = High Income Asian Pacific; WSSA = Western Sub-Saharan-Africa; SSSA = Southern Sub-Saharan Africa.

Region | n | m | s | eps |
---|---|---|---|---|

1 An. Latin 2 C. Asia 3 C. EUROP 4 C. Latin 5 Caribbean 6 CESSA 7 E. ASIA 8 E. Europe 9 HIAP 10 MENA 11 S. Asia 12 S. Latin 13 SSSA 14 W. Eur 15 WSSA | 3 7 11 10 3 7 6 7 4 19 4 2 3 20 7 | 19,496 1350 3515 33,158 976 640 5452 10,488 920 5986 38,606 27,080 7358 28,061 692 | 14,448 836 3577 59,252 1177 658 6495 15,213 934 11,027 66,403 16,476 12,372 59,609 806 | 0.005 0.016 0.018 0.01 0.037 0.039 0.016 0.014 0.032 0.024 0.009 0.004 0.019 0.013 0.043 |

Therefore, the hypotheses that the common dispersion is not significantly different from zero is not supported by the data. The C^{2}-statistic is quite large as well, and the corresponding p-value is near zero, therefore the hypothesis of equality of mean counts in all regions (aggregate COVID-19 deaths) is also not supported by the data.

We shall write a function using the R-program for the estimation of the regression parameters. The iteration process requires staring points. We obtain the staring points by first fitting the classical Poisson regression, which is done using the following code:

out1 = GLM (y~x_{1} + x_{2} + x_{3} + x_{4}, data = data2, family = Poisson).

Having obtained the parameter estimates from the Poisson regression, we use them to start the iteration process and obtain final estimates as shown in the Appendix.

The MGPD regression results are summarized in

The correlation between the observed and predicted COVID-19 death counts is (0.758).

To complete the ANCOVA testing we use theKruskal-Wallis test whereby residuals of the MGPD regression model are used as dependent variables and the “Regions”, or groups as independent variables. The results are summarized as follows:

Kruskal-Wallis chi-squared = 14.936, p-value = 0.1344.

Therefore, after adjusting for the covariates, there is not sufficient evidence to

Parameter | Estimate | SE | Z | p value |
---|---|---|---|---|

1-Intercept 2-X_{1 } 3-X_{2 } 4-X_{3 } 5-X_{4 } 6-𝟄 | −0.299 0.659 0.489 0.425 −0.385 0.028 | 1.138 0.186 0.131 0.107 0.090 0.002 | −0.262 3.545 3.729 3.400 −4.280 13.537 | 0.7900 0.0004 0.0002 0.0002 0.00002 0.000001 |

reject the hypothesis of equality of mean counts in COVID-19 deaths among the “Regions”.

The Generalized Additive Models (GAM) are recent developments that are becoming popular as modeling techniques. It is nonparametric in nature and, even though less powerful, it is quite robust against departure from the assumptions required by classical GLM regression models. The GAM allow us to include non-linear smoothers into the modeling strategy. In mathematical terms GAM solve the following equation:

g ( μ i ) = f 1 ( x 1 ) + f 2 ( x 2 ) + f 3 ( x 3 ) + f 4 ( x 4 ) + f 5 ( x 5 ) (6.1)

The f j ( x j ) are smooth functions to be estimated. Equation (6.1) seems complex, but it is very simple to understand. The first thing to notice is that with GAM we are not necessarily estimating the response directly, i.e. we are not modelling y. In fact, as with GLM we have the possibility to use link functions to model non-normal response variables (and thus perform Poisson or logistic regression) [

We fitted the GAM to the data using the R-package “GAM”, and the next two lines are the needed code:

library(gam);

agam=gam(y~Region+x1+x2+x3+x4,data=data2).

Call: gam(formula = y ~ code + x1 + x2 + x3 + x4, data = data2).

The following results are obtained from the GAM fitting to the data:

1) Null Deviance: 135512150629 on 112 degrees of freedom;

2) Residual Deviance: 74451674616 on 96 degrees of freedom.

From which: The correlation between observed counts and predicted counts is: (1-74451674616/135512150629)^{1/2} = 0.671.

The GAM results are shown in

In this paper we demonstrated the use of the MGPD as a model for the ANCOVA. We used a two-steps approach. In the first step we used the regression models to

Source | DF | Sum Sq | Mean Sq | F value | Pr. (>F.) |
---|---|---|---|---|---|

Region x_{1} x_{2} x_{3} x_{4} Residuals | 12 1 1 1 1 96 | 1.5856e+10 1.9335e+09 2.3000e+10 1.1518e+10 8.7527e+09 7.4452e+10 | 1.3213e+09 1.9335e+09 2.3000e+10 1.1518e+10 8.7527e+09 7.7554e+08 | 1.7038 2.4932 29.6566 14.8522 11.2859 | 0.078 0.118 3.963e−07*** 0.0002 0.001 |

***significant at level of significance less than 0.00001.

assess the influence of possible confounders and covariates on the outcome of interest. Thereafter we extracted the model residuals and used these residuals as a dependent variable of a nonparametric ANOVA, with the groups being the independent predictors. We note that while there was a significant difference among the group means in the univariate analysis, such difference was not significant in the second step of the ANCOVA. We note that the MGPD regression model showed high correlation (0.758) between the observed counts and the model based predicted counts, indicative of a good fit by the model to the given data. On using the Q-Q plot, model residuals are shown to have close agreement to the empirical quantiles of the standard normal distribution. This shows that the model is quite reliable as a predictive tool, and that the distribution of the estimated regression parameters is that of a multivariate normal.

For the sake of comparison, we fitted the data using the GAM, a nonparametric regression approach. This approach deals with the covariates as factors. The GAM model showed that after adjusting for the covariates within the same model, there are no significant differences among regions. These findings are in agreement with those based on the MGPD regression. The GAM did not produce estimate for the dispersion parameter 𝟄. The measure of goodness of fit of the GAM was (0.671), which is much lower than that of the MGPD. The MGPD model has several advantages when compared to the GAM. First, The GAM cannot be used as a predictive tool, while the MGPD model can be used to predict the mean of the response variable. Second, the residuals of the MGPD regression model have a distribution that is almost normal. This emphasizes the reliability of the likelihood based statistical estimation of the model parameters. Finally, our two-steps approach to data fitting makes helps avoiding both overfitting and possible multicollinearity.

The authors declare no conflicts of interest regarding the publication of this paper.

Al-Eid, M. and Shoukri, M.M. (2021) Inference Procedures on the Generalized Poisson Distribution from Multiple Samples: Comparisons with Nonparametric Models for Analysis of Covariance (ANCOVA) of Count Data. Open Journal of Statistics, 11, 420-436. https://doi.org/10.4236/ojs.2021.113026

R-Code fitting the Generalized Poisson using the method of maximum likelihood.

###Notations: b_{i} are the regression parameters, mu is the mean function, “n” is the number of observations “ll” denotes the ###log-likelihood function, eta is the linear predictor ###

llik= function(y,par){

b0=par[

b1=par[

b2=par[

b3=par[

b4=par[

k=par[

n=length(y)

eta=b0+b1*x1+b2*x2+b3*x3+b4*x4

mu=exp(eta)

ll= sum(y*log(mu/(1+(k*mu))))+sum((y-1)*log(1+(k*y))

+((-mu*(1+(k*y)))/ (1+(k*mu)))-lgamma(y+1))

return(-ll)

}

res=optim(par=c(1.4,.84,.07,.95,-.37,.1),llik,y=y,method=“BFGS”,hessian=T)

theta=res$par

theta

#CALCULATING THE STANDARD ERRORS OF MLE

out2=nlm(llik,theta,y=y,hessian=TRUE)

summary(out2)

plot(data2$y,resid(out2))

data_new=data.rame(data2$y,resid(out2))

fish=out2$hessian

solve(fish)

element=diag((solve(fish)))

se=sqrt(element)

se

z=theta/se

out.GMPD=data.frame(theta,se,z)

out.GMPD

#### FINAL ESTIMATES -0.30007804 0.65884589 0.48926214 0.42536091 -0.38469265

data2$y_hat=exp(-.3+.66*data2$x1+.5*data2$x2+.425*data2$x3-.384*data2$x4)

data2$y

data2_error=data.frame(data2$y,data2$y_hat)

cor(data2$y,data2$y_hat) #####0.76

data2_error=data2$y-data2$y_hat

data2$response=sqrt((data2_error)^(1/3))

qqnorm(data2$response)

### SHAPITO WILK TEST OF NORMALITY####

shapiro.test(data2$response)

leveneTest(data2$response~data2$Region)

aov_result=aov(data2$response~data2$Region)

####ANOVA ON THE RESIDUALS WITH REGION BEING THE INDEPENDENT VARIABLEUSING KRUSKAL_WALLIS####

levels(data2$Region)

aov_result=aov(data2$response~data2$Region)

summary(aov_result)

boxplot(data2_error~data2$Region,xlab=“Region”,ylab=“GPD Residuals”,main=“CODID-19 Deaths”)

kruskal_result=kruskal.test(data2$response~data2$Region)

###END OF CODE####.