Comparative Assessment of Zero-Inflated Models with Application to HIV Exposed Infants Data

In a typical Kenyan HIV clinical setting, there is a likelihood of registering many zeros during the routine monthly data collection of new HIV infections among HIV exposed infants (HEI). This is attributed to the implementation of the prevention of mother to child transmission (PMTCT) policies. However, even though the PMTCT policy is implemented uniformly across all public health facilities, implementation naturally differs from every facility due to differential health systems and infrastructure. This leads to structured zero among reported positive HEI (where PMTCT implementation is optimum) and non-structured zero among reported positive HEI (where PMTCT implementation is not optimum). Hence the classical zero-inflated and hurdle models that do not account for the abundance of structured and non-structured zeros in the data can give misleading results. The purpose of this study is to systematically compare performance of the various ze-ro-inflated models with an application to HIV Exposed Infants (HEI) in the context of structured and unstructured zeros. We revisit zero-inflated, hurdle models, Poisson and negative binomial count models and conduct the simulations by varying sample size and levels of abundance zeros. Results from simulation study and real data analysis of exposed infant diagnosis show the negative binomial emerging as the best performing model when fitting data with both structured and non-structured zeros under various settings.


Introduction
Kenya has over the years implemented the World Health Organization (WHO) policy guidelines, in particular, prevention of mother-to-child transmission (PMTCT) policy, in an effort to mitigate sero-conversion among HIV exposed infants (HEI). The PMTCT policy includes averting transmission of HIV from mothers who live with HIV to their infants [1] [2] [3] [4] [5]. Research has shown that, majority of HIV sero-conversion among HEI occurred in the course of delivery, pregnancy or breastfeeding hence making the PMTCT policy a priority in the public health sector [6] [7] [8]. HEI sero-status can be HIV negative if deterrence of PMTCT is adhered effectively. The deterrence proportion with absence of PMTCT intervention is roughly between 15% and 45%; however with interventions, this has abridged to as low as 2% [3] [5]. Due to effective PMTCT intervention at different facilities, sero-conversion among HEI has reduced considerably [9] [10] [11] [12]; hence data collected is zero-inflated (ZI) and is therefore difficult to predict.
In Kenyan, HIV clinic setting, there is a likelihood of registering many zeros during routine monthly data collection of new HIV infections among HEI. This is attributed to implementation of PMTCT policies. Even though the PMTCT policy is affected uniformly across all public health facilities, implementation naturally differs at different facilities due to differential health systems and infrastructure. This leads to structured zero among reported positive HEI (where PMTCT implementation is optimum) and non-structured zero among reported positive HEI (where PMTCT implementation is sub-optimum). Failure and inadvertence of accommodating structured and non-structured zero-inflation may result in false inference [13]. Hence the classical ZI and hurdle models that do not account for the abundance of structured and non-structured zeros in data can give misleading results. Several rigorous and non-rigorous count data analysis approaches with zero inflation have been proposed by different researchers. These ZI models [13] and ZA models [14], also known as hurdle models are implemented to model extra zeros using logistics regression and count using count regression but they do not account for both structured and non-structured zeros [15]. Javali et al. [16] carried out a study whose aim was to determine factors associated with experience of dental caries. The dataset contained abundant zeros and was analyzed using ZI models. Results showed, the ZIP model performed well over conventional Poisson model. The ZINB also did well compare to the NB model. In conclusion, the ZINB model had performed well than the ZIP model when analyzing DMF count data. Akbarzadeh et al. [17] employed ZIP mixed models in evaluating hepatitis C's prognostic factors. Results showed, the mixed ZIP model was the best fit and was able to depict over dispersion, serial dependence, and zero-inflation in longitudinal setting. Also Francois et al. [18] compared the performances of Poisson, ZIP, NB, and ZINB models by fitting lesion count data. Results showed the NB and ZINB models are superior to the Poisson and ZIP models. The main objective of this study is to determine the best models to use when dealing with both structured and non-structured zeros

Materials and
A postulation of ZIP model is that observations 0 are given with probability p, and probability (1 p − ), for Poisson ( λ ) variable is examined in Φ . The mathematical expectation and variance of ZIP is given as: Note that this distribution approaches to Poisson. The ZIP model has two components, one component is to model the probability of being the ordered/structured zeros p using the logistic regression and to model the Poisson mean µ as the other component. Thus, the presence of ordered/structural zeros gives rise not only to a more complex distribution, but also creates an additional link function for modeling the effect of explanatory variables for the occurrence of such zeros. In other words, the ZIP model enables us to better understand the effect of covariates by distinguishing the effects of each specific covariate on structural zeros and on the non-structural zeros.

Zero-Inflated Negative Binomial (ZINB) Model
The ZINB model [19] [20] [21] is defined as largely defined as the mixture distribution, with probability, p assigned to zero-inflation and probability (1 p − ) assigned to the counts that follow NB distribution. The NB distribution is usual- Φ , τ is a shape parameter and Φ is the response variable.
The variance of Φ is given as The distribution gets closer to the ZIP distribution and NB distribution as τ approaches inf, and p approaches 0, respectively.

Zero-Altered Poisson (ZAP) Regression
Also called the Poisson hurdle (PH). ZAP model has the hurdle part which models non-zero against the zero counts, and another part (Poisson count) that is utilised for the non-zero counts: where i ρ models all zeros. The ZAP model does not categorize the zeros in the data as structured zeros or unstructured zeros. It overlooks on that concept which may bring about false interpretations of results and the study findings.

Zero-Altered Negative Binomial (ZANB) Regression
It is also known as the negative binomial logit hurdle (NBLH) [?]. Similarly, ZANB can be used in case of over-dispersion instead of applying the Poisson distribution. The ZANB model which is an extension of ZAP model also assumes the existence of the structured zeros and unstructured zeros. It overlooks on that concept hence may bring about false interpretations of results and the study findings.

Study Design and Setting
The PMTCT program in Kenya is coordinated by the National AIDS & STI Control Programme (NASCOP) through the government of Kenya. Kenya has been conducting exposed infant diagnosis (EID) among HEI since 2007. There has been increase of resources in particular, the year 2008-2009 when the guidelines were modified to include testing of all HEI. The EID testing for HEI algorithm from 2012 has been implemented as follows: maternal and EID-PCR testing is conducted during the first visit for all HEI with unknown HIV status aged-after stopping of breastfeeding. The NASCOP database covers all infants receiving EID-PCR testing in Kenya. Data from the NASCOP database is publicly available and can be viewed on a national dashboard (http://eid.nascop.org/).

Study Population
From study sampling frame, a total of 413 samples were collected from HEI visiting 60 health facilities across the three cities in Kenya (Mombasa, Kisumu and Nairobi) between January 2016 and January 2017 and obtained PCR testing together with the results. HEI with missing age or greater than 2 years old were excluded from analysis. HEI with other missing predictor variables were also excluded in the study.

Ethical Approval
Data collected from this study is secondary and readily available from National AIDS and STI Control Programme (NASCOP) website. No patient identification information is included in NASCOP database. Furthermore, we also obtained ethical approval from Stathmore University Institutional Ethics review committee (SUIERC-0446/19).

Statistical Analysis
We conducted simulations to compare different models. To get the model which best fits the data, and is also a model with lower prediction error, stepwise regression for model selection was utilized. The models were also fitted to HEI data and comparison of the performance using AIC was used.  were associated with the outcome of interest (EID positive). Analysis was conducted using R Studio version 3.5.3.

Simulations
Simulated data was created with unpredictable percentages of zeros and a fixed sample size of 500. A condition which has no zero-inflation ( 0.00 ω = ) will be tested and used as a standard comparison point. The effect of over-dispersion was observed in the non-zero part. The dispersion parameter k will be used with the following values: 1, 10, 50, and 100 which were pre-stipulated. These values represent a range of dispersion which is practical to aid in the assessment of the value of different models under study with varying distributions. The larger the value of k, the less dispersed the variable is and it approaches a Poisson distribution when 10 k > . Negative binomial distribution was used to generate the response variable with different proportion of zeros added. Two covariates, 1 X and 2 X , were also simulated. They were both assumed to come from a binomial distribution with 4 µ = and 1 trial for 1 X and 10 trials for 2 X .

Model Selection Criteria
To determine the best model, Akaike information criterion (AIC) was used. The model with minimum AIC was considered as the best model to fit the data [22].
AIC is given by:

Simulation Results
The model with the lowest value of Akaike Information Criteria (AIC) depicted a more preferable model. Under the condition of non-zero inflation, ( 0.00 ω = ), the Poisson model was preferable under the dispersion parameter k = 10 since it had the lowest AIC value with a low dispersion (see Table 1). When k = 1, 50 and 100 under the same condition of no zero inflation, the negative binomial is the most preferred model since it had a lower AIC compared to the other models.
When data exhibited 20% of zero inflation, ZIP model was most preferred at k = 10. When data exhibited 40% of zero inflation, the most preferred model was a negative binomial with a low dispersion of k = 1. When data exhibited 60% of zero inflation, the model with the lowest AIC was 173 ZIP with k = 100. With 80% of zeros, the best preferred model was ZAP when k = 1, 174 Poisson had the highest AIC value hence the least preferred among the models. Generally, ZAP

Results from Empirical Data Analysis Descriptive Statistics for Variables
Descriptive statistics which include means, frequencies, and percentages for the variables of EID Positive, County, EID Testing Point, HEI prophylaxis and Maternal Prophylaxis is shown in Table 2. The median number of HEI positive recorded from the facilities was 0 (IQR = 0.13). 8.2% of the facilities sampled were from Kisumu county, 47.5% from Mombasa county and 44.3% from Nairobi county. Testing of HIV for exposed infants were mainly done when they were less than 2 months (33.2%) since early detection of HIV infection to the child could assist in early treatment and special care be given to the child. The HEI Prophylaxis mostly prescribed at the facilities for the infants was NVP + AZT (31.2%) and the least prescribed was NVP for 12 weeks (3.9%). For the case of maternal prophylaxis, the most prescribed ARV dose for the mothers was AZT + 3TC + ATV/r (15.3%) and the least prescribed as TDF + 3TC + DTG (0.2%).

Model Comparison Based on HEI Data
The HIV exposed infants data is fitted with the zero-inflated models which are; ZIP, ZAP, ZINB and ZANB. The performance of the inflated models will be compared using the AIC values. The results are presented below. Four models described in methods section were used to fit the data which had a mixture of structured and non-structured zeros. The AIC values for the different models are presented in Table 3    Nevirapine during breastfeeding on HEI is 5 times higher, the risk of using nevirapine for 6 weeks (mother not breastfeeding) is on the HEI is 6.5 times higher, the risk of using other drugs is 3.2 times high, the risk of using a combination of Sd NVP + AZT + 3TC is 4.6 times high and lastly the risk of using Sd NVP only is 3.4 times high. Under the Maternal Prophylaxis, in comparison to the use of AZT (From 14 wks or later) + Sd NVP + 3TC + AZT + 3TC for 7 days the risk of using a combination of AZT + 3T + EFV (Efavirenz, which is a capsule and taken by mouth with plenty of water) by the mother to the infant is 5.6 times less, then the risk of using combination of AZT + 3TC + LPV/r (Lopinavir/Ritonavir, which come in tablet forms) is 3.6 times less, the risk of using a combination of TDF + 3TC + ATV/r is 3.5 times less, the risk of using a combination of TDF + 3TC + LPV/r is 3.9 times less and lastly the risk of using a combination of TDF + 3TC + NVP is 3.3 times lesser. The AIC value after fitting the Poisson model is 474.69, which is the second best fitting model for the EID data. Negative binomial model (referred to as model 2 in analysis), had the AIC value of 429.19 which had the lowest AIC value hence it was considered as the best model. Using the step wise model selection, the following variables which were considered significant and had an effect on the final AIC value were retained; EID Testing Point, PCR Type, Testing Point, HEI prophylaxis and Maternal Prophylaxis. The PCR type that was significant was that of 2nd/3rd PCR type, which indicates that a HEI is 2 times more likely to detect the HIV virus in comparison to the initial PCR. In the HEI prophylaxis with comparison to using AZT for 6 weeks + NVP for over 12 weeks; the risk of using NVP during breastfeeding on HEI is 4.2 times higher, then the risk of using nevirapine for 6 weeks (mother not breastfeeding) is on the HEI is 5.4 times higher, the risk of using other drugs is 2.3 times high, the risk of using a combination of Sd NVP + AZT + 3TC is 3.3 times high and lastly the risk of using Sd NVP only is 2.4 times high according to the results above. Under the Maternal Prophylaxis, in comparison to the use of AZT (From 14 wks or later) + Sd NVP + 3TC + AZT + 3TC for 7 days, the risk of using a combination of AZT + 3T + ATV/r by the mother to the infant is 2.4 times higher, then the risk of using combination of AZT + 3TC + LPV/r (Lopinavir/Ritonavir, which come in tablet forms) is 3.1 times less, the risk of using a combination of TDF + 3TC + ATV/r is 3.4 times less, the risk of using a combination of TDF + 3TC + EFV is 5.9 times lesser, then the risk of using a combination of TDF + 3TC + LPV/r is 3.4 times less and lastly the risk of using a combination of TDF + 3TC + NVP is 3 times lesser.
In the ZIP model, fitting the data using all the variables and using stepwise model selection dropped most of the models and retained EID Testing Point and PCR Type which were the significant variables. Under the EID Testing Point, the risk of testing the infant between 2 -9 months is 2.9 times higher to testing between 0 -2 months. For the PCR Type in comparison to the initial PCR, it indicates that the HEI is 2 times less likely to detect the HIV virus during the Confirmatory PCR. Analyzing the model with the 2 variables gave an AIC value of 491.18. The ZINB model using the stepwise regression, and the direction as backward dropped most of the variables that were not significant and was left with 2 variables which were EID Testing Point and PCR Type. Under the EID Testing Point, the risk of testing the infant between 2 -9 months is 2.6 times higher to testing between 0 -2 months. For the PCR Type in comparison to the initial PCR, it indicates that the HEI is 2.7 times less likely to detect the HIV virus during the Confirmatory PCR. The AIC value for the ZINB model is 492.11, hence regarded as the worst model fit for the data.
In the hurdle binomial (ZANB), using the stepwise regression also dropped down the insignificant variables and was left with EID Testing Point and PCR Type. In the EID Testing Point, the risk of testing the infant between 2 -9 months is 2.65 times higher to testing between 0 -2 months. For the PCR Type in comparison to the initial PCR, it indicates that the HEI is 2.7 times less likely to detect the HIV virus during the Confirmatory PCR. The AIC value using the 2 variables was 491.73, which is the 2nd worst model fit for the data hence not preferred.
Count data with high number of zeros are commonly registered in medical research and public health particularly, monthly number of HEI. Yip [23] and Lambert [13] proposed ZI Poisson distribution and Heilbron [24] utilised ZAP and NB distributions to model ZI data. Li et al. [25] derived a multivariate version of ZIP model and used it to analyse equipment problems in processing of electronics. Although different authors have widely used zero-inflated distributions, there is no practical study that systematically compares zero-inflated outcomes in HIV exposed infant settings. Because the ZI model involves state parameter k and parameter ρ, we extensively conducted simulations by varying percentages of zeros and these parameters. The results of simulation show that ZAP generally had the lowest AIC value, when the percentage of zero was high. This is consistent with the results from the application data because the percentage of zeros in the HEI dependent variable is 88% (see Table 8). The simulation procedure selected limited important model terms to maximize the ZI likelihood functions. In all these ZI models, EID testing point and PCR type were statistically significant. Based on HEI data analysis, the proportion of HIV sero-conversion was high for EID tested between 2 -9 months compared to those tested earlier. The patient outcomes of studies done recently showed sero-status was not different between boys and girls [3] [7] [9]. This was however, not verifiable in our data because we did not collect gender covariate. There are several studies that have attempted to implement ZI model extensions in order to accommodate unstructured effects i.e. [15] [24] but not in a public health setting where government policies are not implemented uniformly. In the ZIP model, fitting the data using all the variables and using stepwise model selection dropped most of the models and retained EID Testing Point and PCR type which were the significant variables. Under the EID Testing Point, the risk of testing the infant between 2 -9 months is 2.9 times higher to testing between 0 -2 months. For the PCR type in comparison to the initial PCR, it indicates that   [26]. For Poisson model (referred to as model 1 in the analysis), using the step-wise model selection criteria dropped the variables that were not significant (county) and the variables that remained included in the EID Testing Point, PCR Type, Testing Point, HEI prophylaxis and Maternal Prophylaxis. In the EID Testing Point, the significant testing point is between 2 -9 months which shows that the risk is 2.39 times higher for HEI between 2 -9 months compared to testing between 0 -2 months. For the PCR type, the chain reaction which was significant was that of 2nd/3rd PCR hence it indicates that a HEI is 2.9 times likely to detect HIV positive result when compared to initial PCR. In the HEI prophylaxis, in comparison to using AZT for the first 6 weeks plus NVP for over 12 weeks. The risk of using nevirapine during breastfeeding on HEI is 5 times higher. The risk of using nevirapine for 6 weeks (mother not breastfeeding) on the HEI is 6.5 times higher Open Journal of Statistics while the risk of using other drugs is 3.2 times high. The risk of using a combination of April 17, 2019 10/17 Sd NVP + AZT + 3TC is 4.6 times high and lastly the risk of using Sd NVP only is 3.4 times high. Under the Maternal Prophylaxis, in comparison to the use of AZT (From 14wks or later) + Sd NVP + 3TC + AZT + 3TC for 7 days, the risk of using a combination of AZT + 3T + EFV (Efavirenz, which is a capsule and taken by mouth with plenty of water) by the mother to the infant is 5.6 times less, than the risk of using a combination of AZT + 3TC + LPV/r (Lopinavir/Ritonavir, which come in tablet forms) is 3.6 times less, the risk of using a combination of TDF + 3TC + ATV/r is 3.5 times less; the risk of using a combination of TDF + 3TC + LPV/r is 3.9 times less and lastly the risk of using a combination of TDF + 3TC + NVP is 3.3 times less. The AIC value after fitting the Poisson model is 474.69, which yield the second appropriate model for fitting EID data. The NB model (referred to as model 2 in analysis), had the AIC value of 429.19 which had the lowest AIC value hence it was considered the best model. Using the stepwise model selection, the following variables which were considered significant and had an effect on the final AIC value were retained; EID Testing Point, PCR Type, Testing Point, HEI prophylaxis and Maternal Prophylaxis. The PCR type that was significant was that of 2nd/3rd PCR type, which indicates that a HEI is 2 times more likely to detect the HIV virus in comparison to the initial PCR. In the HEI prophylaxis with comparison to using AZT for first 6 weeks plus NVP for over 12 weeks. The risk of using NVP during breastfeeding on HEI is 4.2 times higher, then the risk of using NVP for 6 weeks (mother not breastfeeding) on the HEI is 5.4 times higher. The risk of using other drugs is 2.3 times high, the risk of using a combination of Sd NVP + AZT + 3TC is 3.3 times high and lastly the risk of using Sd NVP only is 2.4 times high according to the results above. Under the Maternal Prophylaxis, in comparison to the use of AZT (From 4 weeks or later) + Sd NVP + 3TC + AZT + 3TC for 7 days, the risk of using a combination of AZT + 3T + ATV/r by the mother to the infant is 2.4 times higher, then the risk of using combination of AZT + 3TC + LPV/r (Lopinavir/Ritonavir, which come in tablet form) is 3.1 times less, the risk of using a combination of TDF + 3TC + ATV/r is 3.4 times less, the risk of using a combination of TDF + 3TC + EFV is 5.9 times less, then the risk of using a combination of TDF + 3TC + LPV/r is 3.4 times less and lastly the risk of using a combination of TDF + 3TC + NVP is 3 times less. The ZINB model using the stepwise regression, and the direction as backward dropped most of the variables that were not significant and was left with 2 variables which were EID Testing Points and PCR Type. Under the EID Testing Point, the risk of testing the infant between 2 -9 months is 2.6 times higher to testing between 0 -2 months. For the PCR Type, in comparison to the initial PCR, it indicates that the HEI is 2.7 times less likely to detect the HIV virus during the Confirmatory PCR. The AIC value for the ZINB model is 492.11, hence regarded as the worst model fit for the data.
In the hurdle binomial (ZANB), using the stepwise regression also dropped down the insignificant variables and was left with EID Testing Point and PCR Type. In the EID Testing Point, the risk of testing the infant between 2 -9 Open Journal of Statistics months is 2.65 times higher in comparison to testing between 0 -2 months. For the PCR Type in comparison to the initial PCR, it indicates that the HEI is 2.7 times less likely to detect the HIV virus during the Confirmatory PCR. The AIC value using the 2 variables was 491.73, which is the 2nd worst model fit for the data hence not preferred. ZI models and zero-altered models give almost similar results as shown from both simulated data and the HEI data. The decision when choosing between these two according to the study, heavily relied on the AIC value found after the analysis of the work. Failure to account for the zero-inflation while analyzing such data may result in false inferences. After the simulation study and analysis of EID data, the negative binomial emerges as the gold-standard for us fitting the data with both structured and non-structured zeros.

Conclusion
Simulation results offer a general idea as to which model is most appropriate; however, more conditions will need to be examined to get a more accurate relationship between the model selection and different levels whether structured or unstructured zero-inflation. One of the limitations of the study is that, predictive variables for both zero and other count data models were considered the same.
One area for further research is the issue of imbalanced covariates with missing data. Open Journal of Statistics