Superiority of Bayesian Imputation to Mice in Logit Panel Data Models

Peter Otieno Opeyo; Weihu Cheng; Zhao Xu

doi:10.4236/ojs.2023.133017

Open Journal of Statistics > Vol.13 No.3, June 2023

Superiority of Bayesian Imputation to Mice in Logit Panel Data Models

Peter Otieno Opeyo^1,2, Weihu Cheng¹, Zhao Xu¹
¹Department of Statistics, Beijing University of Technology, Beijing, China.
²Department of Statistics and Computational Mathematics, The Technical University of Kenya, Nairobi, Kenya.
DOI: 10.4236/ojs.2023.133017 PDF HTML XML 226 Downloads 910 Views Citations

Abstract

Non-responses leading to missing data are common in most studies and causes inefficient and biased statistical inferences if ignored. When faced with missing data, many studies choose to employ complete case analysis approach to estimate the parameters of the model. This however compromises on the susceptibility of the estimates to reduced bias and minimum variance as expected. Several classical and model based techniques of imputing the missing values have been mentioned in literature. Bayesian approach to missingness is deemed superior amongst the other techniques through its natural self-lending to missing data settings where the missing values are treated as unobserved random variables that have a distribution which depends on the observed data. This paper digs up the superiority of Bayesian imputation to Multiple Imputation with Chained Equations (MICE) when estimating logistic panel data models with single fixed effects. The study validates the superiority of conditional maximum likelihood estimates for nonlinear binary choice logit panel model in the presence of missing observations. A Monte Carlo simulation was designed to determine the magnitude of bias and root mean square errors (RMSE) arising from MICE and Full Bayesian imputation. The simulation results show that the conditional maximum likelihood (ML) logit estimator presented in this paper is less biased and more efficient when Bayesian imputation is performed to curb non-responses.

Keywords

Panel Data, Imputation, Monte Carlo, Bias, Conditional Maximum Likelihood

Share and Cite:

Opeyo, P. , Cheng, W. and Xu, Z. (2023) Superiority of Bayesian Imputation to Mice in Logit Panel Data Models. Open Journal of Statistics, 13, 316-358. doi: 10.4236/ojs.2023.133017.

1. Introduction

Complications in statistical analyses due to missing values in data sets have been of major concern to researchers in virtually all fields of study. If the response variable is binary, these complications are compounded by the nonlinear nature of model specification adopted to relate the continuous covariates to the categorical response. Numerous studies show that frequentist approaches to missing data, like complete case analysis, often lead to biased estimates and substantial loss of power [1] [2] [3] [4] . The logit panel data model may therefore not be an exception to this problem.

Model based techniques of imputation have proved to be more reliable than single value imputation. One customary model-based approach is to perform multiple imputations with chained equations (MICE) [4] [5] which has been widely accepted due to its good performance and ease of use. Available statistical tools designed for MICE operate in three key stages: 1) a small number (usually five or ten) of data sets are created by imputing each missing value multiple times; 2) analyzing each of the completed data sets using known complete data methods; 3) obtaining the overall result by pooling the derived estimates to take additional uncertainty created by the missing values into account. The separation of the imputation and analysis stages of MICE calls for the need to have imputation models which contain not only other covariates in the predictor function but also the outcome [6] . When the outcome is univariate and the data is complete, this can be easily performed because the outcome is just one of the variables in the data set. However, when the outcome has a multivariate nature, such as a binary outcome study, the outcome variable inclusion function becomes subjective and may take simple or complex summaries of the long trajectories, such as including only a single observed value, average value or the area under the trajectory. Evidently, it is not easy to settle for the most adequate representation for a specific analysis model, and except in very simple situations, some of the summaries adopted discard relevant information. Comparative assessment done herein shows that the inclusion of inadequate summary measures of subjects’ trajectories can lead to increased bias.

A full Bayesian imputation technique has been observed to combine the analysis model with the imputation models without having to specify an appropriate summary measure for the outcome variable. The milestone difference to MICE is that by combining the imputation and analysis in one estimation procedure, the Bayesian approach obtains inferences on the posterior distribution of the parameters and missing covariates directly and jointly. Thereby, the whole trajectory of the longitudinal outcome is implicitly taken into account in the imputation of missing covariate values, and no summary representation is necessary. This study embraces the works of Ibrahim et al. [7] , who propose a decomposition of the likelihood into a series of univariate conditional distributions which produces the sequential full Bayesian (SFB) approach that is flexible and easy to implement as an alternative to MICE. Besides, the uncertainty about the missing values is automatically taken into account and no pooling is necessary since missing values are continuously imputed in each iteration of the estimation procedure. Stubbendick and Ibrahim [8] , in their study used the likelihood based approach that factorized the joint likelihood into a sequence of conditional distributions. This approach is akin to SFB except for the model used. In addition, other authors have shown how to apply weighted estimating equations for inference in settings with incomplete data [9] [10] . Generally, studies agree that model based imputation techniques are superior to single value imputations. However, not much literature is found on conditional logit models as it were.

In this study, we describe a few strategies to include a binary outcome in the MICE algorithm and compare them with the Bayesian approach. Each of the methods was evaluated using simulation from which biases and root mean square errors (RMSE) are compared to discern the better of the two methods.

We structure the rest of the paper as follows: Section 2 briefly describes the logit panel data model. In Section 3, we introduce the problem of missing data and describe and compare the two methods of interest, MICE, and the SFB approach. Both methods are applied to simulated data in Section 4. An evaluation of the bias and RMSE for both approaches is described in Section 5 where the findings are summarized and discussed.

2. Model Specification

2.1. The Logit Panel Data Model

A general panel data model is a special case of the generalized longitudinal model where a population unit i is observed across different time periods and takes the form

$y_{i t} = {x^{'}}_{i t} β + c_{i} + γ_{t} + u_{i t}, i = 1, \dots, N; t = 1, \dots, T$ (1)

where $y_{i t}$ is the tth observation of individual i, measured at time t, $β$ denotes the vector of k regression coefficients of the design matrix of the fixed effects $X_{i}$ , where $x_{i t}$ is a column vector that contains the tth row of that matrix. $c_{i}$ is an individual-specific time-invariant parameter while $γ_{t}$ is a time-specific individual-invariant parameter. $u_{i t}$ is an error term that is normally distributed with mean zero and variance $σ_{y}^{2}$ . In many studies, individuals or subjects are assumed to change over time, and that not any two of them have the same characteristics at a time t. As such, we only have the individual specific effects, letting $γ_{t} = 0$ , Equation (1) takes the form

$y_{i t} = {x^{'}}_{i t} β + c_{i} + u_{i t}$ (2)

The association between $c_{i}$ and $u_{i t}$ establishes whether the relation (2) is treated as fixed or random effects model. This is to say that if $c_{i}$ is correlated with $x_{i t}$ then the model has only $u_{i t}$ as the stochastic part and $c_{i}$ is treated as fixed (non-random) and consequentially, we have a fixed effect (FE) panel model. On the other hand, (2) is a random effect (RE) model, if $c_{i}$ becomes part of the stochastic part $u_{i t}$ . Cumulative Equation (2) has a total of $k + N$ parameters to be estimated as $\hat{β}$ and ${\hat{c}}_{i}$ . For FE, one does not estimate the effects of the variables that are individual specific and time-invariant but rather controls for them or “partials them out” to reduce the bias from other omitted variables.

For a continuous dependent variable $y_{i t}$ , the parameters in the panel data model can be estimated unbiasedly and efficiently. Unfortunately, however, the dependent variable may be categorical which calls for specific nonlinear functions that preserve the structure of the dependent variable to be considered. The logistic function is one such nonlinear relation that yields the logit model. Conditional maximum likelihood estimation is the most preferred method for logistic regressions. In this method, the conditional maximum likelihood “conditions” the (fixed effects) out of the likelihood function [16] . This is done by conditioning the likelihood function on the total number of events observed for each subject as was first explained by Chamberlain [11] .

Most economic studies have the dependent variable as categorical with two (or more) options indicating a success or a failure of an event. Such dependent variable is normally represented by a binary choice variable

$y_{i t} = {\begin{array}{l} 1 & iftheeventhappens \\ 0 & iftheeventfails \end{array}$ for individual i at time t. Then $y_{i t}$ follows a binomial distribution with probability of success for individual i at time t as $p_{i t}$ and

$E (y_{i t}) = 1 \times p_{i t} + 0 \times (1 - p_{i}) = p_{i t}$ (3)

This expected value of the binary outcome is modeled as a function of some explanatory variables or covariates as

$p_{i t} = P r (y_{i t} = 1) = E (y_{i t} | x_{i t}, c_{i}) = F ({x^{'}}_{i t} β + c_{i})$ (4)

where $F (\cdot)$ is a link function that relates the binary outcome to the various types of covariates in. Here, we assume strict exogeneity holds i.e. the residual $u_{i t}$ is uncorrelated with all x-variables over the entire time period spanned by the panel so that $u_{i t}$ has a zero-mean.

If F is the unit function for all ${x^{'}}_{i t} β + c_{i}$ then under random sampling, the unconditional probability that $y_{i t} = 1$ is equal to the unconditional expected value of $y_{i t}$ , i.e. $P r (= 1 | x_{i t}, c_{i}) = E (y_{i t} = 1 | x_{i t}, c_{i}; β)$

So if the binary response model above is correctly specified, we have

$\begin{array}{l} P r (y_{i t} = 1 | x_{i t}, c_{i}) = {x^{'}}_{i t} β + c_{i} \\ P r (y_{i t} = 0 | x_{i t}, c_{i}) = 1 - ({x^{'}}_{i t} β + c_{i}) \end{array}}$ (5)

Model (5) is referred to as the linear probability model (LPM) since the probability of success is expressed as a linear function of the explanatory variables in the vector $x$ and the parameters can be estimated by OLS or within estimator. A major shortfall of LPM when used to estimate the parameters for any discrete choice response variable is that we can get predicted “probabilities” either less than zero or greater than one which of course absurdly contravenes the definition of probability.

The problems of LPM can be addressed by choosing a monotonically increasing function F such that $0 < F ({x^{'}}_{i t} β + c_{i}) < 1$ and

$\begin{array}{l} P r (y_{i t} = 1 | x_{i t}, c_{i}) \to 1 as x_{i t} β + c_{i} \to \infty \\ P r (y_{i t} = 1 | x_{i t}, c_{i}) \to 0 as x_{i t} β + c_{i} \to - \infty \end{array}}$ . (7)

Thus F is a nonlinear function, and hence we cannot use a linear regression model estimation techniques. Various non-linear functions for F have been suggested in the literature and the most common ones are the logistic distribution, yielding the logit model, and the standard normal distribution, yielding the probit model.

In the logit model, F takes the form,

$F ({x^{'}}_{i t} β + c_{i}) = \frac{e^{{x^{'}}_{i t} β + c_{i}}}{1 + e^{{x^{'}}_{i t} β + c_{i}}}$ (8)

which is between zero and one for all values of ${x^{'}}_{i t} β$ . This is the cumulative distribution function (CDF) for a logistic variable whose parameters can be estimated.

2.2. Implications of the Probability Function Pr(.) for Binary Outcomes

When the outcome variable is binary by specification, its probability function Pr(.) only permits two plausible values for either a success or a failure irrespective of the functional forms of the covariates in the model. Overlooking this property of the binary outcome yields a linear probability model (LPM) (5) which has quite a number of limitations. Among these limitations are: 1) we can get predicted “probabilities” either less than zero or greater than one from the LPM which is absurd since predictions outside this range are meaningless and somewhat embarrassing; 2) conceptually, it does not make sense to say that a probability is linearly related to a continuous independent variable for all possible values. If it were, then continually increasing this explanatory variable would eventually drive $P (y = 1 | x)$ above one or below zero; 3) the residual of the LPM is heteroskedastic and the only best way of solving this problem is to obtain estimates of the standard errors that are robust to heteroskedasticity; 4) the residual is not normally distributed hence small sample inferences cannot be based on the usual suite of normality-based distributions such as the t test.

Equation (8) now represents what is known as the (cumulative) logistic distribution function which cushions against the limitations of LPM. It is easy to verify that as ${x^{'}}_{i t} β + c_{i}$ ranges from −∞ to +∞, F ranges between 0 and 1 and that F is nonlinearly related to ${x^{'}}_{i t} β + c_{i}$ (i.e., ${x^{'}}_{i t}$ ), thus satisfying the requirements for a probability function. However, in satisfying these requirements, we have created an estimation problem because F is nonlinear not only in X but also in the β’s as can be seen clearly from (8). This implies that we cannot use the familiar OLS procedure to estimate the parameters. But through linearization, this problem becomes more apparent than real. This is done by obtaining the odds ratio $\frac{P r (y_{i t} = 1 | x_{i t}, c_{i})}{P r (y_{i t} = 0 | x_{i t}, c_{i})}$ in favor of a success i.e. the ratio of the probability that $y_{i t} = 1$ to the probability that $y_{i t} = 0$ . It is realized that the logarithm of the odds ratio, is not only linear in X, but also (from the estimation viewpoint) linear in the parameters. $\ln (\frac{P r (y_{i t} = 1 | x_{i t}, c_{i})}{P r (y_{i t} = 0 | x_{i t}, c_{i})}) = {x^{'}}_{i t} β + c_{i}$ . This log of the odds ratio is called the logit, and hence the name logit model.

3. Dealing with Missing Data—Model-Based Approach for Logit Panel Data Models

The relationship between a binary outcome y and predictor variables X is a linear mixed model expressed in a standard modeling framework as Equation (1).

If the data setting is complete and balanced, the probability density function of interest is $p (y_{i t} | x_{i t}, θ_{Y | X})$ where $θ_{Y | X}$ denotes the vector of all parameters of the model. Conversely, when missingness results in some of the covariates being incomplete, $X$ is partitioned into two parts, $X = [X_{o b s} X_{m i s}]$ : the completely observed variables $X_{o b s}$ and those variables containing missing value $X_{m i s}$ . We then have a measurement model that depends on unobserved data expressed as $p (y_{i t} | x_{i t, o b s}, x_{i t, m i s}, θ_{Y | X})$ , which cannot be handled by the standard complete data methods efficiently.

Two key assumptions now need to be made so that we may not require any digression from standard approaches of handling mixed effects models. These assumptions are (a) that we only have cross-sectional covariates to impute and (b) that the missing data mechanism of the outcome is ignorable, that is, missing at random (MAR) or missing completely at random (MCAR) [12] .

As it were, despite having various frequentist techniques of imputing on missing data, we narrow our imputation to model-based techniques and give in-depth discussion on multiple imputation using chained equations (MICE) and sequential full Bayesian approach (SFB). This concept of replacing a missing datum with multiple values was hatched by Donald B. Rubin [13] . He stated that

“[…] of course (1) imputing one value for missing datum can’t be correct in general, and (2) in order to insert sensible values for a missing datum we must rely more or less on some model relating unobserved values to observed values.”

1) Multiple imputations using chained equations (MICE)

The underlying principle of MI is to divide the analysis of incomplete data into three steps: MICE has become a popular imputation technique by the fact that it allows for multivariate missing data and does not require a specific missingness pattern. Since the core principle of multiple imputations is divided into three stages (imputation, analysis, and pooling), MICE becomes very vital in addressing the initial stage of imputation.

Performing MICE under certain regularity conditions, the multivariate distribution

$p (x_{i, m i s} | y_{i}, x_{i, o b s}, θ_{X})$ (9)

with $x_{i, m i s} = {(x_{i, m i s_{1}}, \dots, x_{i, m i s_{p}})}^{'}$ and $x_{i, o b s} = {(x_{i, o b s_{1}}, \dots, x_{i, o b s_{q}})}^{'}$ can be uniquely determined by its full-conditional distributions. Van Buuren et al. [14] assert that through this unique determination, Gibbs sampling of the conditional distributions can be used to produce a sample from (9). Though the MICE procedure does not actually start from a specification of (9), it directly defines a series of conditional, predictive models of the form

$p (x_{i, m i s_{l}} | x_{i, m i s_{- l}}, x_{i, o b s}, y_{i}, θ_{X_{l}})$ (10)

which links each incomplete predictor variable $x_{i, m i s_{l}}$ , $l = 1, \dots, p$ , with other incomplete and complete predictors, $x_{i, m i s_{- l}}$ and $x_{i, o b s}$ , respectively, without forgetting the response $y_{i}$ . The predictive distributions (1.2) are drawn from the extended exponential family with linear predictor expressed as

$g_{l} {E (x_{i, m i s_{l}} | x_{i, m i s_{- l}}, x_{i, o b s}, y_{i}, γ_{l}, ξ_{l}, α_{l})} = {γ^{'}}_{l} x_{i, o b s} + {ξ^{'}}_{l} x_{i, m i s_{- l}} + {α^{'}}_{l} h ( y i )$

where $g_{l} (\cdot)$ is the one-to-one monotonic link function for the 𝓁th covariate and $γ_{l}$ , $ξ_{l}$ and $α_{l}$ are vectors of parameters relating the complete and missing covariates and the outcome to $x_{i, m i s_{l}}$ .

The function $h (\cdot)$ specifies how the outcome enters the linear predictor. In the univariate case, the default choice for $h (y_{i})$ is simply the identity function. However, when we have a multivariate $y_{i}$ , such as a binary outcome, we cannot always simply specify $α^{'} y_{i}$ because $y_{i}$ may take on two different values for an individual i at time t. Hence, it is not meaningful to use the same regression coefficient $α_{l j}$ to connect outcomes of different individuals with $x_{i, m i s_{l}}$ , and a representation needs to be found that summarizes $y_{i}$ and that has the same number of elements that also have the same interpretation for all individuals.

We choose to proposed some relations for $h (y_{i})$ in this study as below

$h (y_{i}) = 0,$ (11)

$h (y_{i}) = y_{i t},$ (12)

$h (y_{i}) = \frac{1}{T} \sum_{t = 1}^{T} y_{i t},$ (13)

$h (y_{i}) = \sum_{t = 1}^{T - 1} (s_{i t + 1} - s_{i t}) \frac{y_{i t} + y_{i t + 1}}{2} .$ (14)

From the above relations, it can be deduced that $h (y_{i})$ from:

i) (11) omits the response completely from the relation $g_{l} (\cdot)$ .

ii) (12) uses only one observation chosen from the outcomes for i^th individual.

ii) (13) uses the average of the observed outcomes.

iv) (14) uses the trapezoidal area under the curve of $y_{i}$ around the observation times t and t + 1 as exemplified in Figure 1 below.

Linear combinations of the above relations may still suffice and so there can be infinitely many options of specifying the predictor function g(·). It is not therefore a simple task to determine an appropriate $y_{i}$ to use unless the setting is too simple as well. Only in very simple settings is it possible to determine which function of $y_{i}$ is appropriate. Under a random intercept model for $y_{i}$ , (13) is the appropriate summary function in the imputation model for a normal

Figure 1. Shows the area under the trajectory used as a summary of the response variable y over the study periods.

cross-sectional covariate [15] whereas for more complex analysis models or discrete covariates, it is not straightforward to derive the appropriate summary functions.

The splendid of all inclusions of the outcome variable into the predictor function would be to one in which all available outcome variables are captured in the function $h (y_{i})$ . This however can only be possible in settings where the outcome is balanced or close to balanced and does not have a large number of repeated measurements. Alternatively, we could impute missing outcome values so that all individuals have the same number of measurements at approximately the same time points.

In order to achieve high validity of imputations using MICE, we need to ensure that (a) the imputation models are correctly specified and (b) the missing data mechanism is ignorable—MAR or MCAR [12] [16] .

2) Full Bayesian imputation

Since MICE may be faced by the challenge of choosing the best summary representation of a multivariate outcome, we can settle for a full Bayesian approach. In this approach, the complete data posterior is obtained by combining the complete data likelihood with prior information to yield

$\begin{array}{l} p (θ_{Y | X}, θ_{X}, x_{i, m i s} | y_{i t}, x_{i, o b s}) \\ \propto p (y_{i t} | x_{i, m i s}, x_{i, o b s}, θ_{Y | X}) p (x_{i, m i s} | x_{i, o b s}, θ_{X}) π (θ_{Y | X}) π ( θ X ) \end{array}$

where $θ_{X}$ is a vector containing parameters that are associated with the likelihood of the partially observed covariates X_mis, and $π (θ_{Y | X})$ and $π (θ_{X})$ are prior distributions. Ibrahim et al., [7] explain that a convenient way to specify the joint likelihood of the missing covariates $p (x_{i, m i s} | x_{i, o b s}, θ_{X})$ is to use a sequence of conditional univariate distributions

$\begin{array}{l} p (x_{i, m i s_{i}}, \dots, x_{i, m i s_{p}} | x_{i, o b s}, θ_{X}) \\ = p (x_{i, m i s_{1}} | x_{i, o b s}, θ_{X_{1}}) \prod_{l = 2}^{p} p (x_{i, m i s_{l}} | x_{i, o b s}, x_{i, m i s_{1}}, \dots, x_{i, m i s_{l - 1}}, θ_{X_{l}}) \end{array}$ (15)

with $θ_{X} = {(θ_{X_{1}}, \dots, θ_{X_{p}})}^{'}$ . It is similarly assumed that each of these distributions is again chosen from the extended exponential family and in accordance with the type of the respective variable. An advantage of this sequential way of writing the joint distribution of the covariates is that it provides a straightforward way to specify the joint distribution even when the covariates are of mixed type.

After specifying the prior distributions $π (θ_{Y | X})$ and $π (θ_{X})$ , Markov chain Monte Carlo (MCMC) methods, such as Gibbs sampling, can be used to draw samples from the joint posterior distribution of all parameters and missing values. Because all missing values are imputed in each iteration of the Gibbs sampler, the additional uncertainty created by the missing values is automatically taken into account, and no pooling is necessary.

We opt to adopt the full likelihood instead of the series of predictive models (10) due to one key advantage: that we can choose how to factorize this full likelihood. Precisely, by factorizing the joint distribution $p (y_{i t}, x_{i, m i s}, x_{i, o b s} | θ_{Y | X}, θ_{X})$ in to the conditional distribution $p (y_{i t} | x_{i, m i s}, x_{i, o b s}, θ_{Y | X})$ and the marginal distribution $p (x_{i, m i s} | x_{i, o b s}, θ_{X})$ , the joint posterior distribution can be specified without having to include the outcome into any predictor, and no summary representation $h (y_{i})$ is needed. This becomes clear when writing out the full conditional distribution of the incomplete covariates, used by the Gibbs sampler:

$\begin{array}{l} p (x_{i, m i s_{l}} | y_{i}, x_{i, o b s}, x_{i, m i s_{- l}}, θ) \\ \propto {\prod_{t = 1}^{T} p (y_{i t} | x_{i, m i s}, x_{i, o b s}, θ_{Y | X})} p (x_{i, m i s} | x_{i, o b s}, θ_{X}) π (θ_{Y | X}) π (θ_{X}) \\ \propto {\prod_{t = 1}^{T} p (y_{i t} | x_{i, m i s}, x_{i, o b s}, θ_{Y | X})} p (x_{i, m i s_{l}} | x_{i, o b s}, x_{i, m i s_{< l}}, θ_{X_{l}}) \\ {\prod_{k = l + 1}^{p} p (x_{i, m i s_{k}} | x_{i, o b s}, x_{i, m i s_{< k}}, θ_{X_{k}})} π (θ_{Y | X}) π (θ_{X_{l}}) \prod_{k = l + 1}^{p} π (θ_{X_{k}}) \end{array}$

where densities $p (y_{i t} | x_{i, m i s}, x_{i, o b s}, θ_{Y | X})$ , $p (x_{i, m i s_{l}} | x_{i, o b s}, x_{i, m i s_{< l}}, θ_{X_{l}})$ and $p (x_{i, m i s_{k}} | x_{i, o b s}, x_{i, m i s_{< k}}, θ_{X_{k}})$ are members of the extended exponential family with linear predictors expressed as

$E (y_{i t} | x_{i, m i s}, x_{i, o b s}, θ_{Y | X}) = {γ^{'}}_{y} x_{i, o b s} + {ξ^{'}}_{y} x_{i, m i s}$ (16)

$g_{l} {E (x_{i, m i s_{l}} | x_{i, o b s}, x_{i, m i s_{< l}}, θ_{X_{l}})} = {γ^{'}}_{y} x_{i, o b s} + \sum_{s = 1}^{l - 1} {ξ^{'}}_{l_{s}} x_{i, m i s_{s}}$ (17)

$g_{k} {E (x_{i, m i s_{k}} | x_{i, o b s}, x_{i, m i s_{< k}}, θ_{X_{k}})} = {γ^{'}}_{k} x_{i, o b s} + \sum_{s = 1}^{k - 1} {ξ^{'}}_{k_{s}} x_{i, m i s_{s}}, k = l + 1, \dots, p$ (18)

with T denoting the number of repeated measurements of individual i, $x_{i, m i s_{< l}} = {(x_{i, m i s_{1}}, \dots, x_{i, m i s_{l - 1}})}^{'}$ , $x_{i, m i s_{< k}} = {(x_{i, m i s_{1}}, \dots, x_{i, m i s_{k - 1}})}^{'}$ , ${θ^{'}}_{Y | X} = ({γ^{'}}_{y}, {ξ^{'}}_{y})$ , ${θ^{'}}_{X_{l}} = ({γ^{'}}_{l}, {ξ^{'}}_{l})$ and ${θ^{'}}_{X_{k}} = ({γ^{'}}_{k}, {ξ^{'}}_{k})$ .

Equation (16) represents the predictor of the linear mixed model, Equation (17) the predictor of the imputation model of $x_{m i s_{l}}$ with link function $g_{l}$ from the extended exponential family, and Equation (18) represents the predictors of the covariates that have $x_{m i s_{l}}$ as a predictive variable, with $g_{k} (.)$ being the corresponding link function. Clearly, looking at Equations (16)-(18) it is noted that none of them is dependent on the outcome variable as was the case in MICE. This puts SFB as a better option over MICE at the specification stage. Synonymously, just like MICE, the SFB approach assumes ignorable missing data mechanisms and correctly specified conditional distributions of the covariates.

It has been mentioned before that it is not obvious how the imputation models in the sequence should be ordered [17] and, from a theoretical point of view, different orderings may result in different joint distributions, leading to different results. Chen and Ibrahim [18] suggest to condition the categorical imputation models on the continuous covariates. In the context of MI, it has been suggested to impute variables in a sequence so that the missing pattern is close to monotone. Our convention is to order the imputation models in (15) according to the number of missing values, starting with the variable with the least missing values. It has been shown, however, that sequential specifications, as used in the Bayesian approach, are quite robust against changes in the ordering [18] and results may be unbiased even when the order of the sequence is misspecified as long as the imputation models fit the data well enough.

4. Monte Carlo Simulation Study

4.1. Relative Theory of Monte Carlo Study

A Monte Carlo simulation (also known as multiple probability simulation) is used to model the probability of different outcomes in a process that cannot easily be predicted due to the intervention of random variables. Monte Carlo studies, or Monte Carlo experiments, encompass a loop of computational algorithms that base on repeated and randomized sampling procedures to obtain numerical results. The major goal is therefore to use this randomness to solve modeling problems. Use of Monte Carlo studies allows us to understand the impact of risk and uncertainty. Universally, Monte Carlo simulation studies follow five key steps:

1) Setting up a probability distribution for important variables.

2) Building a cumulative probability distribution for each variable.

3) Establishing an interval of random numbers for each variable.

4) Generating random numbers.

5) Actually simulating a series of trials.

4.2. Design

In this section, we subject the theoretical analysis of MICE and SFB to hypothetical binary outcome panel data to support the hypothesis that SFB may be superior to MICE. To this end we focus on the logit estimator. This simulations aims at comparing bias and RMSE of the parameter estimates obtained by the unconditional logit estimator and the conditional logit estimator in the presence of imputed missing covariates.

To take care of the different possible features of the data, this comparison will be made for two sets of data, one complete (balanced) and the other incomplete (unbalanced) due to intermittent nonresponses. The latter data set is then balanced by imputing the missing observations using the two model based approaches described in Section 3. We considered a total of four data sets: one complete set, two imputed sets by MICE using the specifications (11) and (13) of how the response variable relates to missingness and one imputed set by SFB.

We specify all the panel sets fitted to the estimation of the following model:

$y_{i t} = 1 ({x^{'}}_{i t} β + c_{i} + u_{i t} \geq 0), i = 1, 2, \dots, N; t = 1, 2$

where ${x^{'}}_{i t}$ is a vector of five explanatory variables drawn from uniform, binomial and normal distributions and the error term $u_{i t}$ is drawn from a normal distribution. The variables’ descriptions are in Table 1. All the specified predictor variables have parameters, β₁ to β₅, of the model which are used to define the dependent variable y. These parameters were fixed as β₁ = 1, β₂ = −1, β₃ = 1, β₄ = 1 and β₅ = 1. The dependent variable, y is then calculated from the relation

$y_{i t} = 1 (c_{i} + β_{1} x_{i t}^{(1)} + β_{2} x_{i t}^{(2)} + β_{3} x_{i t}^{(3)} + β_{4} x_{i t}^{(4)} + β_{5} x_{i t}^{(5)} + v_{i t} \geq 0), i = 1, 2, \dots, n; t = 1, 2$

where $v_{i t}$ is a logistic variable given by $v_{i t} = \ln | \frac{u_{i t}}{1 + u_{i t}} |$ with $u_{i t}$ being a standard normal random variable. The fixed effects $c_{i}$ are obtained as functions of $x^{(1)}$ and T by the relation $c_{i} = \frac{\sqrt{T} \sum x^{(1)}}{n} + a_{i}$ with $a_{i}$ being a standard normal random variable as well. By simulation, the probability densities of the five variables are as shown in Figure 2 where the blue density represents the complete data and the magenta densities are from the few imputed data sets by MICE and SFB.

For in-depth comparisons, we also assess the impact of sample size on the parameter estimates obtained by these approaches herein. In order to factor in small, medium and large sample size possibilities, we mainly settled for three different values of N that were used for all sets of data fitted into the models (N = 50, N = 100 and N = 250). This was established by a subjective imposition of the expected probability of success as $P r (y_{i t} = 1 | x_{i t}, c_{i}) = 0.5$ and plausible coefficients of variation (CoV) as 0.2, 0.14 and 0.09 respectively in the relation $N ≅ \frac{1}{p r (y) \times {(C o V)}^{2}}$ . Further, we evaluate the impact of the proportion of

Table 1. Description of variables.

Figure 2. Covariate densities of complete data (blue) and imputed data (magenta).

missingness from 10% to 30% by randomly deleting the desired proportion of observations from the data set and imputing them back for each sample size.

Since we estimate a fixed effect model, the coefficients are truncated in order to ensure convergence. After 1000 iterations, the summarized results are given in tabulated as in Tables 2-4 where for both estimators (unconditional logit and conditional logit) considered, we report the mean bias, and the root mean squared error (RMSE) for all the parameter estimates. R-Studio has several inbuilt packages that make use of the Bayesian framework for imputation [14] [19] [20] [21] . However for longitudinal or panel data sets, the use of WinBUGS [21] or JAGS [22] provides a much more straightforward platform to easily handle SFB.

For each of the variables x⁽¹⁾ through x⁽⁵⁾ five different samples were generated in MICE with the densities as shown. The simulated densities tend to follow the trend of the observed (complete) density in blue for each of the study variables.

4.3. Simulation Results

Table 2. Sample size n = 50, percentage of missingness = 10%; 30%.

Table 3. Sample size n = 100, percentage of missingness = 10%; 30%.

Table 4. Sample size n = 250, percentage of missingness = 10%; 30%.

5. Discussion, Conclusions and Recommendation

Through the variation of the simulated sample sizes, the study confirms the asymptomatic properties of parameter bias. Undeniably, all the reported measures (mean bias and the root mean square errors) are observed to have an inverse relationship with increasing sample size across all the models considered. We also have established as much as MICE performs well in most standard situations, SFB yields relatively stable estimates that are robust to the proportion of missingness in a binary response panel data frame. In addition, MICE does not

Figure 3. Average bias for parameter estimates.

respond consistently with categorical or discrete data but yields consistent parameter estimates for the continuous regressor. This is also observed for SFB but improves asymptomatically. Uniquely, the magnitude of the median bias produces estimates with increased biases for the conditional logit estimator compared to the unconditional logit estimator when all three imputation techniques are performed more so when the sample size is large [23] . Figure 3 shows the superiority of SFB over MICE for both scenarios considered when handling the logistic panel data model.

Since the SFB approach uses a sequence of univariate conditional distributions to specify the joint pdf, it is advantageous because the analysis model is part of the conditional distributions and the parameters of interest are estimated simultaneously with the imputation of the missing values hence no further pooling is done as is the case in MICE. This means that specification of the univariate conditional imputation model permits a much straightforward and flexible imputation.

In summary, this paper has discussed brief estimation methods and procedures for estimating nonlinear (binary choice logit) panel data regression models with missing covariates.

The key objective of this study was singled out to be an assessment of the performances of MICE and Bayesian imputation to effect of non-responses (missingness) in the parameter estimates for logit panel data models. Although studies show that MICE performs relatively well in many standard situations, this study has demonstrated the advantage of working with full likelihood in more complex specifications of the response variable. In addition, situations where imputation with MICE does not take into account the actual structure of the measurement model may yield poorly imputed values which affect the consistency of the analysis.

A key importance of deriving the estimators is to increase the theoretical understanding of the estimators and also reduce the computational complexity while estimating logit panel models. As observed from the Monte Carlo results, unbalancedness in a data set biases the parameter estimates and the different imputation techniques employed in this study respond differently to the bias and efficiency of the estimates.

As a recommendation, further developments can be done on this study by considering other imputation techniques in MICE where the response variable enters the imputation model in other different specifications.

Appendix

R CODES FOR MONTE CARLO SIMULATION

n = 50 with 10% missingness

set.seed(12345) # Use this to make the randomly generated data the same each time you run the simulation#

N.iter = 1000

j = 1:5

A <- array(0, dim=c(N.iter,5))

B <- array(0, dim=c(N.iter,5))

C <- array(0, dim=c(N.iter,5))

D <- array(0, dim=c(N.iter,5))

E <- array(0, dim=c(N.iter,5))

F <- array(0, dim=c(N.iter,5))

G <- array(0, dim=c(N.iter,5))

H <- array(0, dim=c(N.iter,5))

for(i in 1:N.iter){

n=50 # vary n

t=2 # vary t and change time in pData

nt=n*t

b1=1

b2=-1

b3=1

b4=1

b5=1

ai=rnorm(n,0,1)

x1=rnorm(nt,0,1)

x2 <- runif(nt,0,1)

x3 <- rnorm(nt,0.5,0.5)

x4 <- rbinom(nt,2,0.65) # nt different such numbers

hit=rnorm(nt,0,1)

uit=rnorm(nt,0,1)

vit=log(abs(uit/(1+uit)))

x5 <- ifelse(x1+hit>0, 1, 0)

alphai <- sqrt(t)*sum(x1)/n+ai #alphai simulation

ci <- kronecker(alphai ,matrix(1,t,1))#kronecker of alphai

wit=ci+b1*x1+b2*x2+b3*x3+b4*x4+b5*x5

zit= wit+vit

y <- ifelse(zit>0, 1, 0)

pData <- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,zit,ci,x1,x2,x3,x4,x5)

# Randomly Insert A Certain Proportion Of NAs Into A Dataframe #

pData2<- cbind(x1,x2,x3,x4,x5)

##################################

NAins <- NAinsert <- function(df, prop = .1){

n <- nrow(df)

m <- ncol(df)

num.to.na <- ceiling(prop*n*m)

id <- sample(0:(m*n-1), num.to.na, replace = FALSE)

rows <- id %/% m + 1

cols <- id %% m + 1

sapply(seq(num.to.na), function(x){

df[rows[x], cols[x]] <<- NA

}

)

return(df)

}

##############

# TRY IT OUT #

##############

XX<-NAins(pData2, .1)

pData3<- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,XX)

pData3

#Scenario 1 S1 when f(yi)=0

yS1<-y*0

XXX<-data.frame(yS1,XX)

#check missingness proportions by covariate

pMiss <- function(x){sum(is.na(x))/length(x)*100}

apply (XXX,2,pMiss)

apply(XXX,1,pMiss)

#Performing MICE in XXX

library(mice)

md.pattern(XXX)

library(VIM)

aggr_plot <- aggr(XXX, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(XXX), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

tempData <- mice(XXX,m=5,maxit=50,meth='pmm',seed=500)

summary(tempData)

#to see the new imputed covariate values, say X1

tempData$imp$x1

#showing covariate densities of observed data(blue) and imputed data( magenta)

densityplot(tempData)

#getting back the completed data and binding with original y

completedData <- complete(tempData,1)

completedDataS1<- subset(completedData, select=c(x1,x2,x3,x4,x5))

XXXS1<-data.frame(y,completedDataS1)

XXXS1

#parameter estimation, unconditional logit and conditional logit for Scenario S1

glm.out1 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=pData)

clogit1=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = pData)

glm.out2 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS1)

clogit2=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS1)

#Scenario 3 S3 when f(yi)=E(yit)=p

S3=rep(mean(y),times=nt)

yS3<-data.frame(S3)

XXX3<-data.frame(yS3,XX)

#check missingness proportions by covariate

pMiss <- function(x){sum(is.na(x))/length(x)*100}

apply(XXX3,2,pMiss)

apply(XXX3,1,pMiss)

#Performing MICE in XXX3

library(mice)

md.pattern(XXX3)

library(VIM)

aggr_plot <- aggr(XXX3, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(XXX3), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

tempData3 <- mice(XXX3,m=5,maxit=50,meth='norm',seed=500)

summary(tempData3)

#to see the new imputed covariate values, say X1

tempData3$imp$x1

#showing covariate densities of observed data(blue) and imputed data( magenta)

densityplot(tempData3)

#getting back the completed data and binding with original y

completedData3 <- complete(tempData3,1)

completedDataS3<- subset(completedData3, select=c(x1,x2,x3,x4,x5))

XXXS3<-data.frame(y,completedDataS3)

XXXS3

# Impute missing values using Bayesian imputation

imputed_data <- mice(XX[, c("x1", "x2", "x3", "x4", "x5")], method = "norm.predict") # Use correct variable names and select only the predictor variables

# Extract the completed dataset from the imputed_data object

XXX4 <- complete(imputed_data)

BayesData <- data.frame(y,XXX4)

#parameter estimation, unconditional logit and conditional logit for Complete Original Data

glm.out1 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=pData)

clogit1=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = pData)

#parameter estimation, unconditional logit and conditional logit for Scenario S1

glm.out2 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS1)

clogit2=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS1)

#parameter estimation, unconditional logit and conditional logit for Scenario S3

glm.out3 = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=XXXS3)

clogit3=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = XXXS3)

#parameter estimation, unconditional logit and conditional logit for BayesData

glm.outBayes = glm(y ~x1+x2+x3+x4+x5,family=binomial(logit), data=BayesData)

clogitBayes=clogit((y ~ (x1 +x2+x3+x4+x5)), strata (id), data = BayesData)

#Calling out coefficients

A[i,] <- glm.out1$coef[j+1]

B[i,] <- clogit1$coef[j]

C[i,] <- glm.out2$coef[j+1]

D[i,] <- clogit2$coef[j]

E[i,] <- glm.out3$coef[j+1]

F[i,] <- clogit3$coef[j]

G[i,] <- glm.outBayes$coef[j+1]

H[i,] <- clogitBayes$coef[j] }

#mean bias#

mean.biasA<- c(b1-mean(A[,1]), b2-mean(A[,2]),b3-mean(A[,3]),b4-mean(A[,4]),b5-mean(A[,5]))

mean.biasB<- c(b1-mean(B[,1]), b2-mean(B[,2]),b3-mean(B[,3]),b4-mean(B[,4]),b5-mean(B[,5]))

mean.biasC<- c(b1-mean(C[,1]), b2-mean(C[,2]),b3-mean(C[,3]),b4-mean(C[,4]),b5-mean(C[,5]))

mean.biasD<- c(b1-mean(D[,1]), b2-mean(D[,2]),b3-mean(D[,3]),b4-mean(D[,4]),b5-mean(D[,5]))

mean.biasE<- c(b1-mean(E[,1]), b2-mean(E[,2]),b3-mean(E[,3]),b4-mean(E[,4]),b5-mean(E[,5]))

mean.biasF<- c(b1-mean(F[,1]), b2-mean(F[,2]),b3-mean(F[,3]),b4-mean(F[,4]),b5-mean(F[,5]))

mean.biasG<- c(b1-mean(G[,1]), b2-mean(G[,2]),b3-mean(G[,3]),b4-mean(G[,4]),b5-mean(G[,5]))

mean.biasH<- c(b1-mean(H[,1]), b2-mean(H[,2]),b3-mean(H[,3]),b4-mean(H[,4]),b5-mean(H[,5]))

#residual errors#

rmseA<-c(sqrt(mean((b1-A[,1])^2)), sqrt(mean((b2-A[,2])^2)), sqrt(mean((b3-A[,3])^2)), sqrt(mean((b4-A[,4])^2)), sqrt(mean((b5-A[,5])^2)))

rmseB<-c(sqrt(mean((b1-B[,1])^2)), sqrt(mean((b2-B[,2])^2)), sqrt(mean((b3-B[,3])^2)), sqrt(mean((b4-B[,4])^2)), sqrt(mean((b5-B[,5])^2)))

rmseC<-c(sqrt(mean((b1-C[,1])^2)), sqrt(mean((b2-C[,2])^2)), sqrt(mean((b3-C[,3])^2)), sqrt(mean((b4-C[,4])^2)), sqrt(mean((b5-C[,5])^2)))

rmseD<-c(sqrt(mean((b1-D[,1])^2)), sqrt(mean((b2-D[,2])^2)), sqrt(mean((b3-D[,3])^2)), sqrt(mean((b4-D[,4])^2)), sqrt(mean((b5-D[,5])^2)))

rmseE<-c(sqrt(mean((b1-E[,1])^2)), sqrt(mean((b2-E[,2])^2)), sqrt(mean((b3-E[,3])^2)), sqrt(mean((b4-E[,4])^2)), sqrt(mean((b5-E[,5])^2)))

rmseF<-c(sqrt(mean((b1-F[,1])^2)), sqrt(mean((b2-F[,2])^2)), sqrt(mean((b3-F[,3])^2)), sqrt(mean((b4-F[,4])^2)), sqrt(mean((b5-F[,5])^2)))

rmseG<-c(sqrt(mean((b1-G[,1])^2)), sqrt(mean((b2-G[,2])^2)), sqrt(mean((b3-G[,3])^2)), sqrt(mean((b4-G[,4])^2)), sqrt(mean((b5-G[,5])^2)))

rmseH<-c(sqrt(mean((b1-H[,1])^2)), sqrt(mean((b2-H[,2])^2)), sqrt(mean((b3-H[,3])^2)), sqrt(mean((b4-H[,4])^2)), sqrt(mean((b5-H[,5])^2)))

mean.biasA

rmseA

mean.biasB

rmseB

mean.biasC

rmseC

mean.biasD

rmseD

mean.biasE

rmseE

mean.biasF

rmseF

mean.biasG

rmseG

mean.biasH

rmseH

n= 50 with 30% missingness

set.seed(123456) # Use this to make the randomly generated data the same each time you run the simulation#

N.iter=1000

j= 1:5

A <- array(0, dim=c(N.iter,5))

B <- array(0, dim=c(N.iter,5))

C <- array(0, dim=c(N.iter,5))

D <- array(0, dim=c(N.iter,5))

E <- array(0, dim=c(N.iter,5))

F <- array(0, dim=c(N.iter,5))

G <- array(0, dim=c(N.iter,5))

H <- array(0, dim=c(N.iter,5))

for(i in 1:N.iter){

n=50 # vary n

t=2 # vary t and change time in pData

nt=n*t

b1=1

b2=-1

b3=1

b4=1

b5=1

ai=rnorm(n,0,1)

x1=rnorm(nt,0,1)

x2 <- runif(nt,0,1)

x3 <- rnorm(nt,0.5,0.5)

x4 <- rbinom(nt,2,0.65) # nt different such numbers

hit=rnorm(nt,0,1)

uit=rnorm(nt,0,1)

vit=log(abs(uit/(1+uit)))

x5 <- ifelse(x1+hit>0, 1, 0)

alphai <- sqrt(t)*sum(x1)/n+ai #alphai simulation

ci <- kronecker(alphai ,matrix(1,t,1))#kronecker of alphai

wit=ci+b1*x1+b2*x2+b3*x3+b4*x4+b5*x5

zit= wit+vit

y <- ifelse(zit>0, 1, 0)

pData <- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,zit,ci,x1,x2,x3,x4,x5)

# Randomly Insert A Certain Proportion Of NAs Into A Dataframe #

pData2<- cbind(x1,x2,x3,x4,x5)

##################################

NAins <- NAinsert <- function(df, prop = .3){

n <- nrow(df)

m <- ncol(df)

num.to.na <- ceiling(prop*n*m)

id <- sample(0:(m*n-1), num.to.na, replace = FALSE)

rows <- id %/% m + 1

cols <- id %% m + 1

sapply(seq(num.to.na), function(x){

df[rows[x], cols[x]] <<- NA

}

)

return(df)

}

##############

# TRY IT OUT #

##############

XX<-NAins(pData2, .3)

pData3<- data.frame(id = rep(paste("stdnt", 1:n, sep = "_"), each = t),time = rep(1:2, n), y,XX)

pData3

#Scenario 1 S1 when f(yi)=0

yS1<-y*0

XXX<-data.frame(yS1,XX)

#check missingness proportions by covariate

pMiss <- function(x){sum(is.na(x))/length(x)*100}

apply(XXX,2,pMiss)