_{1}

Excess number of zeros (zero inflation, ZI) in count data is a common phenomenon which must be addressed in any analysis. The extra zeros may be a result of over-dispersion in the data. Ignoring zero-inflation can result in biased parameter estimates and standard errors. Over-dispersion is also associated with a zero-inflated data. Depending on the selected model, different results and conclusions may be reached. In this paper two commonly encountered models in count data are considered, namely, the Zero-Inflated Poisson (ZIP) and Zero-Inflated Negative Binomial (ZINB) probability distributions. Emphasis is placed on the Maximum Likelihood (ML) estimation of the model parameters. Specifically of interest was to es-timate the zero-inflation parameter and hence, the corrected frequencies. It was found that for the Poisson model, the zero-inflation parameter estimate was considerably higher than that from the Negative Binomial model. From the results however, it is suspected that the effectiveness of adjusting for the high number of zeros in both models might have been greatly affected by the inherent high variability between sites. It is then proposed that in future research, the problem of heterogeneity in count data be addressed before any further analysis.

In most cases, analysis of insect counts data has been modelled by the three distributions, namely Poisson (θ), Binomial (n, θ) and Negative Binomial (k, θ). These discrete distributions fall under the category of power series distributions. These distributions have been generalized to what is referred to as the generalized power series distribution (GPSD). By expressing these probability distributions explicitly in the form of power series distributions, this has greatly simplified the derivation of the explicit form of the moments of these distributions. The power series distributions have been widely studied. Noak [

A discrete random variable X will have a power series distribution given as:

P ( X = x ) = a x θ x f ( θ ) , x = 0 , 1 , 2 , ⋯ , a x > 0 ; θ > 0 (1)

The distribution belongs to the exponential family of distributions and can generally be expressed in the form:

P ( X = x ) = e [ x a ( θ ) + c ( x ) + g ( θ ) ] (2)

where a and g are functions of the unknown parameter θ and c is a function of x.

This property has been exploited in the derivation of the moments and other properties of the distribution. It can be shown (see Edwin [

E ( X ) = θ f ′ ( θ ) f ( θ ) V ( X ) = θ 2 f ( θ ) f ″ ( θ ) + θ f ( θ ) f ′ ( θ ) − [ θ f ( θ ) f ′ ( θ ) ] 2 (3)

The Poisson distribution theoretically specifies that the mean and variance are equal. However, it is quite common to have data for which the variance is far larger than the mean and the phenomenon is referred to as over-dispersion. In this case, Poisson GLM has been used to correct the anomaly. See for example, Cameron and Trivedi [^{2} distribution, the mean and variance are given as;

E ( X ) = μ and σ 2 = μ + μ 2 k (4)

The distribution is reduced to equi-distribution ask becomes large, implying convergence to the Poisson distribution. As k becomes small for a small μ, a zero-inflated Negative Binomial distribution is a consequence.

A zero-inflated statistical model is based on a zero-inflated probability distribution. It arises when probability mass at a point zero exceeds the one allowed under the particular family of distributions. These models have been widely studied. See for example, Jasankul [

Count data that have an incidence of zeroes greater than expected for the underlying probability distribution is modelled as:

P o i s s o n ( θ ) 1 x ! θ e θ B i n o m i a l ( n , θ ) ( n x ) θ 1 + θ ( 1 + θ ) n L o g a r i t h m i c ( θ ) 1 x θ − ln ( 1 − θ ) N e g . B i n o m i a l ( k + x − 1 x ) θ ( 1 − θ ) − k |
---|

P ( X = x ) = { ρ + ( 1 − ρ ) P ( X = 0 ) for x = 0 ( 1 − ρ ) P ( X = x ) for x = 1 , 2 , ⋯ (5)

where 0 < ρ < 1 is the zero-inflation parameter. P ( X = x ) in this case represents any one of the count data distributions, e.g. Poisson, Negative Binomial, etc. using the format presented in (1). The means and variances of these distributions may then be obtained by using the expressions in (3) for the respective probability model. For example, for the Poisson (λ) distribution;

P ( X = x ) = { ρ + ( 1 − ρ ) e − θ ; for x = 0 ( 1 − ρ ) e − θ θ x x ! ; for x = 1 , 2 , ⋯ (6)

And for a Negative Binomial distribution;

P ( X = x ) = { ρ + ( 1 − ρ ) p k ; for x = 0 ( 1 − ρ ) ( k + x − 1 x ) p k ( 1 − p ) x ; for x = 1 , 2 , ⋯ (7)

Using (5) and (1) and applying the usual definitions for E(X) and V(X), the mean and variance of a zero-inflated distribution is given as:

E ( X ) = ( 1 − ρ ) θ f ′ ( θ ) f ( θ ) V ( X ) = ( 1 − ρ ) θ { θ f ″ ( θ ) f ( θ ) + f ′ ( θ ) f ( θ ) − ( 1 − ρ ) θ [ f ′ ( θ ) f ( θ ) ] 2 } (8)

where f ( θ ) is given in

In estimating the zero inflated parameters, some three methods seem to have been prominently used. The methods of moments (MM) is said to provide estimates which are not very accurate. Nanjundan and Naika [

Let X 1 , X 2 , ⋯ , X n be a random sample from the probability distribution of the form in (5) for a p-parameter ZI model, the estimates are obtained by solving the equations:

E ( X p ) = ∑ X i p n for i = 1 , 2 , ⋯ , p (9)

For a two-parameter ZI model, Equation (9) leads to the following two equations:

( 1 − ρ ) θ f ′ ( θ ) f ( θ ) = x ¯ ( 1 − ρ ) θ { θ f ″ ( θ ) f ( θ ) + f ′ ( θ ) f ( θ ) } = ∑ x i 2 n (10)

Equation (10) may then be solved simultaneously to obtain estimates for ρ and θ.

The likelihood function for a ZI model given in Equation (5) may be written as:

L ( θ , ρ ; x _ ) = ∏ i = 1 n { ρ + ( 1 − ρ ) a 0 f ( θ ) } 1 − π i { ( 1 − ρ ) a x i θ x i f ( θ ) } π i (11)

where x i = 1 , 2 , ⋯ and τ i = { 0 if x i = 0 1 if x i ≠ 0 .

The log-likelihood function is then given as:

l = n 0 ln { ρ + ( 1 − ρ ) a 0 f ( θ ) } + ∑ i τ i ln ( 1 − ρ ) + ∑ i τ i ln a x i + ∑ i τ i x i ln θ − ∑ i τ i ln f ( θ ) (12)

where n_{0} is the number of zeros in the observed sample. Solving ∂ l ∂ ρ = 0 and ∂ l ∂ θ = 0 for θ and ρ, the following expressions are obtained after some simplification:

ρ ^ = n 0 f ( θ ^ ) − n a 0 n ( f ( θ ^ ) − a 0 ) (13)

x ¯ = θ ^ f ′ ( θ ^ ) f ( θ ^ ) − a 0 (14)

The Newton-Raphson iterative method usually provide a solution of the form f ( θ ) = 0 . For an initial value of θ_{n}, the next estimate is given as

θ n + 1 = θ n − f ( θ n ) f ′ ( θ n ) .

Applying the Newton-Raphson iterative method to Equation (14) and using the MM estimates from Equation (10) as initial values then the improved estimate of θ is:

θ ^ r + 1 = x ¯ { f ( θ ^ r ) − a 0 f ′ ( θ ^ r ) } (15)

Finally, substituting the estimate θ ^ from Equation (15) into Equation (13), an estimate of ρ is obtained. We shall now apply the estimation procedures outlined above to two probability distributions, namely ZIP and ZINB as given in Equation (6) and Equation (7) respectively.

For the ZI-Poisson distribution, the corresponding function f(θ) from

θ ^ = ∑ i f i x i 2 n x ¯ − 1 (16)

ρ ^ = 1 − n x ¯ 2 ∑ i f i x i 2 − n x ¯ (17)

Likewise, the ML estimates for ZIP model are obtained by substituting f(θ) in Equation (13)and Equation (14) resulting into:

x ¯ ( e θ ^ − 1 ) = θ ^ e θ ^ (18)

ρ ^ = n 0 e θ ^ − n n ( e θ ^ − 1 ) (19)

Using the MM estimate for θ as the initial value, Equation (18) is solved iteratively and the estimate θ ^ is subsequently substituted in Equation (19) to obtain ρ ^ .

To analyse the data using this model, we need to obtain a pooled estimate of k. Anscombe [^{2} with the sample values x ¯ and s^{2} and solve for k. For the count data the estimate of k is 0.12.

Again the corresponding f ( θ ) for the Negative binomial distribution from

θ ^ = ∑ i f i x i 2 − n x ¯ k n x ¯ + ∑ i f i x i 2 (20)

ρ ^ = 1 − x ¯ ( k n x ¯ + ∑ i f i x i 2 ) k ( ∑ i f i x i 2 − n x ¯ ) + x ¯ k (21)

Similarly, for the ML estimates, we substitute the respective f ( θ ) from

x ¯ = k θ ^ ( 1 − θ ^ ) − k − 1 ( 1 − θ ^ ) − k − 1 (22)

ρ ^ = n 0 ( 1 − θ ^ ) − k − n n { ( 1 − θ ^ ) − k − 1 } (23)

Equation (22) is solved iteratively for θ ^ and substituted in (23) to obtain ρ ^ .

The data to be used are some counts of eggs of Aphis fabae made by Dr. D. P. Jones in the course of a survey of the Eastern counties of England in 1947 and reproduced by Anscombe [

The mean and variance of the egg counts is 5.29 and 235.01 respectively which points to a strong evidence of over-dispersion. A frequency distribution for the counts is given in

Eggs (X): | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|

Frequency: | 589 | 62 | 48 | 30 | 24 | 15 | 20 | 5 | 5 | 6 |

Eggs (X): | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |

Frequency: | 11 | 10 | 7 | 7 | 6 | 4 | 2 | 8 | 4 | 4 |

Eggs (X): | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 29 | 31 |

Frequency: | 1 | 3 | 2 | 3 | 6 | 2 | 2 | 1 | 1 | 5 |

Eggs (X): | 32 | 33 | 34 | 35 | 36 | 38 | 39 | 40 | 42 | 43 |

Frequency: | 2 | 3 | 2 | 5 | 1 | 1 | 2 | 1 | 1 | 1_ |

Eggs (X): | 44 | 45 | 47 | 48 | 49 | 50 | 51 | 52 | 58 | 59 |

Frequency: | 2 | 1 | 1 | 2 | 1 | 2 | 1 | 1 | 1 | 2 |

Eggs (X): | 65 | 66 | 70 | 82 | 83 | 84 | 105 | 110 | 120 | 123 |

Frequency: | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 1 |

Eggs (X): | 148 | 163 | ||||||||

Frequency: | 1 | 1 |

ZIP | Model | ZINB | Model | |
---|---|---|---|---|

Method | θ ^ | ρ ^ | θ ^ | ρ ^ |

MME | 48.67 | 0.89 | 0.98 | 0.12 |

MLE | 5.26 | 0.62 | 0.92 | 0.42 |

the ZIP and ZINB distributions. It is noted that the estimated value of the exponent for ZINB model from the sample version of Equation (4) was k = 0.12 which in turn, resulted in a negative estimate of the inflation parameter ρ, that may be interpreted as a case of zero-deflation. This may have been a consequence of the observed heterogeneity between sites which was not addressed here and which ranged between 0 and 2032 within sites. As proposed by Anscombe [

It should be noted however, that possible design errors, such as sampling practices, might also have caused the excess zeros.

Below are charts representing the Poisson frequency distribution.

therefore recommended that before implementing procedures to address the zero-inflation, the inherent heterogeneity between sites be further examined.

The author declares no conflicts of interest regarding the publication of this paper.

Sakia, R.M. (2018) Application of the Power Series Probability Distributions for the Analysis of Zero-Inflated Insect Count Data. Open Access Library Journal, 5: e4735. https://doi.org/10.4236/oalib.1104735