On the Estimation of Causality in a Bivariate Dynamic Probit Model on Panel Data with Stata Software : A Technical Review

In order to assess causality between binary economic outcomes, we consider the estimation of a bivariate dynamic probit model on panel data that has the particularity to account the initial conditions of the dynamic process. Due to the intractable form of the likelihood function that is a two dimensions integral, we use an approximation method: The adaptative Gauss-Hermite quadrature method. For the accuracy of the method and to reduce computing time, we derive the gradient of the log-likelihood and the Hessian of the integrand. The estimation method has been implemented using the d1 method of Stata software. We made an empirical validation of our estimation method by applying on simulated data set. We also analyze the impact of the number of quadrature points on the estimations and on the estimation process duration. We then conclude that when exceeding 16 quadrature points on our simulated data set, the relative differences in the estimated coefficients are around 0.01% but the computing time grows up exponentially.


Introduction
Testing Granger causality has generated a large set of paper in the literature.The larger part of this literature concerns the case where we have continuous dependent variables.For binary outcomes, there is also a way to consider the causality problem.As described by [1] for a vector of dependant variables, the one order Granger causality can be analyse as a probability conditional indepen-DOI: 10.4236/tel.2018.860831258 Theoretical Economics Letters dence given a set of exogenous variables and the first order lagged dependent variables.And for a binary outcome in the dependent vector, one can use a probit probability that implies the use of latent variable.
For panel data case, as the one way fix effects model estimated on a finite sample has necessarily inconsistent estimators [2], the random effect model is used.Due to the fact that we aim to test for one order Granger causality, lagged dependent variables are included as explanatory variables.For the first wave of the panel, we do not have previous values for the dependent variables, and treating them casually or as exogenous leads to inconsistent estimators [2].So we specify an other equation for initial conditions as described by [3].The equation is allowed to have different explanatory variables and different idiosyncratic error terms from the dynamic equation.
This specification leads to a likelihood function with an intractable form that is a two dimensions integral with a large set of parameters to be estimated.The estimation of this likelihood function requires the use of numerical approximation of integral function such as maximum simulated likelihood (see [4] for more details) or Gauss-Hermite quadrature (for more details see [5] [6] [7]).
The main goal of this paper is to propose and to test a method for estimating a two equations system where the explanatory variables are binary in a panel data framework.To the extent of our knowledge, there is no program to do so, especially as we propose the calculation of the Hessian matrix and the gradient vector of our maximisation program.
In this paper, we discuss on the problem of testing Granger causality with a bivariate dynamic probit model taking into account the initial conditions.The organization of this paper is the following one.In Section 2 we explain the causality test method for bivariate probit model with panel data.In Section 3, we describe the estimation method available when the likelihood function has an intractable form (two dimensions integral in our case).Section 4 presents the calculation of the gradient with respect to the model parameters and the calculation of the Hessian matrix with respect to the random effects vector.In Section 5, we present a robustness analysis of our selected estimation method by doing some simulations 1 .

Testing Causality with a Bivariate Dynamic Probit Model
This section aims to describe causality test method in the case of binary variables.We start by presenting the general approach in time series before introducing panel data case.We end this section by a discussion on the initial conditions problem.

Testing Causality: General Approach
Causality concept was introduced by [8] as a better predictability of a variable Y 1 For each section, specifics notations are down at the beginning of the section.Otherwise, in general ( ) x a f x = denote the value of the function or the matrix f at the point a.When not specify, a denote the integer part of the scalar a.
by the use of it lag values, the lag value of an other variable Z and some controls X.In his paper, [8] distinguishes instantaneous causality that means Z t is causing Y t (if Z t is included in the model it improves the predictability of Y t than if not) from lag causality that means lag values of Z improve the predictability of Y t .In this section, we rule out the instantaneous causality and deal with lag causality of one period.
The one period Granger causality can be rephrase in terms of conditional independence.Without lost of generality, we present the univariate case for time series.Let's Y t and Z t denote some dependent variables and X t denote a set of controls variables.One period Granger non-causality from Z to Y is the conditional independence of Y t from  ) ( ) Note that the same kind of relationship can be written for Granger For the left side of the Equation (1) ( ( ) For the right side of the Equation (1) ( ( ) where ( )

  
To fit the joint distribution of Y and Z conditionally to X (meaning that we estimate a bivariate model), we need to analyze four available situations that are ( ) . For each of these situations, we have: where ( )
Then testing Granger non-causality in this specification is testing 12 0 : 0 H δ = for Z is not causing Y and testing 21 0 : 0

Testing Causality: Panel Data Case
For panel data case, two major approaches can be used.The first one is to consider that causal effect is not the same for all individuals in the panel ( [9]).This approach is useful when individuals are heterogeneous or when the causal effect is not homogeneous.The specification for latent variables are: where ( ) In this approach, testing Granger non-causality is equivalent to test The second approach (that is used in this paper) is to assume the causal effects, if they exist, are the same for all individuals in the panel.With the same notation that the previous case, the latent variables are: Then testing Granger non-causality is equivalent to test 12 0 : 0 H δ = for Z is not causing Y and to test 21 0 : 0 H δ = for Y is not causing Z. Finally, Equations ( 9) and (10) are the core of our problem.Since Y and Z are binary panel outcomes and each equation includes lag dependent variables, estimating jointly these two equations can be viewed as the estimation of a bivariate dynamic probit model.

Dealing with Initial Conditions
For the first wave of the panel (initial conditions), due to the fact that we do not have data for the previous state on Y and Z (no values for ,0 i Y and ,0 i Z ) we are not able to evaluate ( ) By ignoring it in the individual likelihood, researchers also ignore the data generation process for the first wage of the panel.This means that they assume the data generating process of the first wave of the panel to be exogenous or to be in equilibrium.These assumptions hold only if the individual random effects are degenerated.If this assumption is not fulfilled, the initial conditions (the first wave of the panel) are explained by the individual random effects and ignoring them leads to inconsistent parameter estimates [2].The solution proposed by [2] for the univariate case and generalized by [3] is to estimate a static equation for the first wave of the panel (meaning that we do not introduce lagged dependent variables).In this static equation, the random effects are a linear combination of the random effects in the next wave of the panel and idiosyncratic error terms may have different structure from the idiosyncratic error terms in the dynamic equation.Formally, the latent variables for the first wave of the panel are defined as follows: where ( ) denotes the vector of idiosyncratic shocks which are zero mean and covariance matrix Σ  with 1 1 As λ can be interpreted as the influence of the Y random individual effects (respectively Z random individual effects) on Z (respectively on Y) at the first wave of the panel.

Estimation Methods
Due to the fact that the likelihood function has an intractable form (an integral function), it is impossible to estimate this likelihood by usual methods.We then deal with numerical integration methods that are numerical approximation method for an integral.In this section we describe two major methods and argue for one of them to estimate our likelihood function.

Gauss-Hermite Quadrature Method
The Gauss-Hermite quadrature is a numerical approximation method use to close the value of an integral function.The default approach is related to an univariate integral of the form: where ( ) 2 exp x − denotes the Gaussian factor 2 .Then the integral above can be approximated using: Notice that even without this factor, one can use the Gauss-Hermite quadrature by using a straightforward transformation that is to multiply and divide the integrand x q Q =  are nodes from the Hermite polynomial and , 1, , q w q Q =  are corresponding weights.This approximation supposes that the integrand can be well approximated by an 2 1 Q + order polynomial and that the integrand is sampled on a symmetric range centered on zero.So, for suitable results, these two assumptions must be taken into account.
We first assume that finding the optimal number of quadrature points can be achieved numerically.For the accuracy of the approximation, it is required to choose the optimal number of quadrature points.To do this, one can start with a number q of quadrature points and increase it to assess if it significantly changes the result, and repeat this process until convergence in terms of overall likelihood value variation and estimated coefficients variation.But, it is also important to take into account the fact that increasing number of quadrature points also increases the computing time.An example of the impact of number of quadrature points on estimated results is given in Section 5.
For the problem of suitable sampling range, the solution of using the adaptative Gauss-Hermite quadrature was proposed by [5] and [6].In this approach, instead of using ( ) φ µ σ of mean μ and variance σ 2 .That means (see [5]): Then the sampling range is transformed and the new nodes are 2 exp q q q w w x σ = .For [5], one can choose the normal density with posterior mean and variance equal respectively to μ and σ 2 .For the implementation, we can start with 0 µ = and 1 σ = and at each iteration of the likelihood maximization process, calculate the posterior weighted mean and variance of the quadrature points and use them to calculate the nodes and weights for the next iteration.For [6], one can choose μ to be the mode of the integrand ( ) f x and σ to be the square of the Hessian of the log of integrand taken in the mode.
( ) ( ) For the multivariate integral case, the same approach is used.Without lost of generality, we discuss the bivariate case that can be apply to others multivariate cases.The function to approximate is written as follows: ( ) With the assumption of independence between x and y (that can be overcome by using a Cholesky decomposition x x ′ = and y x y ρ ′ ′ = + , see [5] or [7] for more precision on these Cholesky transformation or other transformations that can lead to similar results) the integral above can be approximated by: ( ) ( ) And in this case, the nodes and weights are derived as follows: ( ) ( ) ( ) ( ) where A denotes the determinant of matrix A, and x denotes the mode of the integrand ( ) Jackel (2005) also suggests that for the nodes with low weights (when contributions to the integral value are not significant) we can prune the range from those nodes in order to save calculation time.That means to set a scalar ( ) and drop all nodes with weights lower than this scalar.

Maximum Simulated Likelihood Method
Maximum Simulated Likelihood method was introduced by [4] as a solution to maximization problems that have an integral as objective function.In this approach, the likelihood function is supposed to be defined as: where ( ) , g u u is a probability distribution function, ( ) , , , f x y u u is called simulator and denotes the function from which the mean value at some draws u 1 and u 2 gives an approximation of the overall likelihood.Without lost of generality, we only define the two dimensions case that can be generalized to fewer or larger dimensions integral.For this kind of likelihood function, [4] proposed as simulator the function , , , f x y u u with u 1 and u 2 drawn from the same probability distribution function g (the probability distribution function of the individual random effects).Then the overall likelihood function can be approximated by (u 1d denotes the d th draw from u 1 ; the same definition holds for u 2d ): where D denotes the number of draws.
To implement this method, we start by simulating a bivariate normal draw and we give them the ( ) , u u covariance matrix structure.Then we calculate the value of the simulator at these transformed draws and we repeat D times.The overall likelihood is the mean of the simulator value at each transformed draw.At each iteration, once the random effects covariance matrix is calculated, we apply it to the simulated first normal draws to transform them in draws of the random effects and use them to calculate the likelihood.This process is repeated until convergence.Theoretical Economics Letters The simulated likelihood estimator is consistent and asymptotically equivalent to the likelihood estimator ( [4]) if the number of draws tend to infinity faster than N .

GHQ or MSL: What Method to Choose?
As described above, they are two main methods to estimate our likelihood function.To choose which method to implement, we deal with the accuracy and the computing time requirement.
For our estimations, we choose the adaptative Gauss-Hermite quadrature proposed by [7] for three main reasons. Our dataset is an unbalanced panel data with 10,569 individuals observed in mean over 26 years, that leads 255,206 observations.Due to the fact that the simulated likelihood method requires that the number of draw D be larger than the square of the number of observations, we do not use it to avoid waste of time in computing process. The Gauss-Hermite quadrature requires that we find the best number of quadrature Q that is the one for whom the integrand can be well approximated by an 2 1 Q + order polynomial.If Q is small, that reduces computing time.Our estimations are achieved in general for Q between 8 and 14.It means that at each iteration, for the likelihood value calculation, we do a weighted sum of between 2  8 64 = and 2 14 196 = terms. Using the Gauss-Hermite quadrature method reduces computing time but this computing time remains very long if the integrand is not sampled at the suitable range (meaning that the adaptative method has not been used).And in this case, the maximization process spends between two and three weeks before achieving convergence on an Intel Core i7 computer at 3.4 GHz with 8 GB of RAM memory.By applying the adaptative Gauss-Hermite quadrature, the computing time is significantly reduced and then, we spend between two and three days for achieving convergence on the same computer.Note that the reduced convergence time mentioned above is in part due to the implementation of the first order derivatives of the likelihood function.Using the overall log-likelihood approximated by the Liu and Pierce adaptative Gauss-Hermite quadrature method, we can get derivatives with respect of all model parameters.The implementation of these derivative in the maximization process allows us to used the Stata's d1 method.The convergence time saved by this method is clearly huge.On our overall data set, with 8 quadratures points, when we use a non adaptative quadrature method, the convergence is not achieved: after 3 weeks of computation, the model underflows.When we use the [6] adaptative Gauss-Hermite quadrature, but without implementing the first order derivatives, the estimation process takes 11 days and 10 hours to achieve convergence.When we use the adaptative Gauss-Hermite quadrature in [6] with implemented the first order derivatives, the estimation process achieve convergence only after 1 day and 17 hours.

Chosen Method Requirements
In this section we describe some requirements of the selected method that is the adaptative Gauss-Hermite Quadrature.The first one is the fact that the adaptative Gauss-Hermite quadrature requires to derive the Hessian of the log of the integrand ( [6]).The second one is that we derive the gradient of the overall likelihood function in order to use Stata's d1 method (see [10]) for more accuracy and more speed in the calculations.

Gradient Vector Calculation
The gradient of the overall log-likelihood function has been calculated to speed up the maximization process.This will allow us to use the Stata's d1 method that requires the implementation of the gradient vector in addition to the log-likelihood.The likelihood function for an individual i is: where Using the adaptative Gauss-Hermite quadrature method by [6], the overall likelihood function is given by (we use the same notation that those used in Section 3): w w q h q w q q q h q w q q ζ η η η ρ ρ φ η To get the gradient vector, the log-likelihood above must be derived with respect to 13 parameters that are: ( ) , , , , Let's kj l denotes: q h q w q q q h q w q q ζ η η η ρ ρ φ η Then, the first order derivatives with respect to each α of the 13 parameters is given by: ( ) With respect to 1 β the first order derivative is: q w q h q q h l l q h q w q q With respect to 2 β the first order derivative is: q h q w q q w l l q h q w q q γ the first order derivative is: With respect to 2 γ the first order derivative is: ( ) ( ) With respect to 11 λ the first order derivative is: q w q h q x q h l l q h q w q q With respect to 12 λ the first order derivative is: q w q h q x q h l l q h q w q q With respect to 21 λ the first order derivative is: ( ) ( ) q h q w q x q w l l q h q w q q With respect to 22 λ the first order derivative is: ( ) ( ) q h q w q x q w l l q h q w q q With respect to 1 σ the first order derivative is: With respect to 2 σ the first order derivative is: With respect to η ρ the first order derivative is: With respect to ζ ρ the first order derivative is: q q q h q w q q l l q h q w q q With respect to ρ  the first order derivative is: ) q q q h q w q q l l q h q w q q φ ρ ρ ρ ρ , and ρ  , we used some transformations on parameters to insure that in the maximization process, each σ remains positive and each ρ remains between −1 and 1 at all iteration.For σ we use exponential transformation then in the derivation, we derive with respect to ( ) log σ .For ρ we use arc-tangency transformation (i.e.

( ) ( )
in the derivation, we derive with respect to • To easily derive a bivariate normal probability with zero mean, variance one and correlation ρ, we can transform it into an integral where the integrand is a product of an univariate normal density and an univariate normal probability as follows: , , d d . 1 1 • Given the transformation above, the first order derivatives of ( )

, ,
x y ρ Φ with respect to x and y are respectively given by: , , 1 , ,

Hessian Matrix Calculation
For the requirement of the adaptative Gauss-Hermite quadrature method, we need to derive the Hessian matrix of the log of the integrand function with respect to the random effects vector 3 .From the individual likelihood function defined in Equation 23, the log of the integrand is: it it it it i t g q h q w q q q h q w q q ζ η η η We derive from the log of the integrand in Equation ( 25) the Hessian matrix by calculating: The first order derivatives are given by: q h q w q q q h q w q q g q h q w q q q h q w q q q w q h q h q w q q q q h q h q w q q w η ρ ρ λ φ ρ In this section, ( ) ( ) ( ) it it it it it it it q w q h q h q w q q q q h And with respect to 2 i η we have: q w q h q h q w q q q q h q h q w q q w η ρ ρ λ φ ρ it it it it it it q h q w q h q w q q q w The second order derivatives are given by: T it it it it it t q h q w q q q h q w q q g q h q w q q q h q w q q q h q w q q q h q w q , it it it it it it q q h q w q q q h q w q q q h q w q q q h q w q q T it it it it it t q h q w q q q h q w q q g q h q w q q q h q w q q q h q w q q q h q w q it it it it it it q q h q w q q q h q w q q q h q w q q q h q w q q q h q w q q q h q w q q g q h q w q q q h q w q q q h q w q q q h q w q q , , q h q w q q q h q w q q q h q w q q , , it it it it it it q h q w q q q h q w q q q h q w q q where q h q w q q h q h q w q q q h q w q q q h q w q q w q h q w q q q h q w q q q h q w q q q q q h q w q q q h q w q q q h q w q q q w q h h q h q h q w w q w q h q w q q q h q w q q q w q h h q h q h q w w q w q h q w q q q q q h q w q q q w q h h q h q h q w w q w Then, the Hessian matrix is given by: ( As described in Section 3.1, after having derived this Hessian matrix, we calculate its value at the mode of the integrand and use it to re-sample the integrand.

Robustness Analysis Based on Simulations
This section aims to insure that the implemented method gives suitable results.
We consider that the implemented method give us suitable results if for a given relationship between variables, by applying the estimation method on these variables we find approximatively the same coefficients.To reach this goal, we perform a robustness analysis on the estimation method.This robustness analysis is an empirical one based on simulations.We use two different approaches for that.The first approach is to simulate bivariate binary variables by specifying a relationship between some explanatory variables (it means that we set coefficients of explanatory variables) and estimate this relationship with the implemented method in order to compare the results with the relationship specified before.In the second approach, we introduce new variables (that were not used in the data generating process) when estimating the relationship with the implemented method and compare the new results with the first ones.The implemented method is robust when it is able to correctly estimate the relationship specified even if we introduce other variables and also to estimate non significant coefficients to those other variables.Finally, the method we make use of to check for the robustness is the same that in [11].
As the estimation method implemented is a numerical approximation method, the results will depend on the selected number of quadrature points.We deal with the incidence of number of quadrature points on results in the last part of this section.For a better analysis of the results we also add the standard errors of each estimated coefficients.

Simulated Relationship between Real Variables
In this section, we use variables from the French SIP (Santé et Itinéraire Professionnel) survey data set and we simulate error terms and a relationship between some selected variables.The subset of the database use for this section is an unbalanced panel of 1202 individuals with total waves per individual between 5 and 10 waves.
We set the error terms parameters as , ′ =    as bivariate normal variables with zero mean, variance equal to 1 and covariances respectively equal to ζ ρ and ρ  .We also simulate individual random effects as bivariate normal variables with zero mean, covariance equals to η ρ and variance equals to 2 1 σ for the first component of the random effects vector and equals to 2 2 σ for the second component of the random effects vector.It has been done as follows: ( ) For the initial conditions ( 1 t = ), the simulated relationship is the following: The variable ill denotes having an illness episode in the year, unemp denotes being out of labour marking during the year, age denotes the age of individual, and Male is 1 if individual is male and 0 otherwise.Estimation results for 16 quadrature points are displayed in Table 1.For all equations, we give the coefficients that are used in the DGP and those that are estimated by our program.As we can see, all the coefficients from the DGP are very closed from the estimates ones.

Simulated Relationship with Additional Variables
In this section, we keep the same DGP than in Section 5.1 and we add other variables in the model that we estimate in order to evaluate the robustness of the estimation method by the fact that all estimated coefficients for variables in the DGP should remain the same and the added variables coefficients should not

Impact of Number of Quadrature Points on Estimated Results
As the accuracy of the method depends on the number of quadrature points used for the likelihood calculation, we propose an assessment of how it affects the results when this number increases.For doing so, we fit the same model with different numbers of quadrature points and we calculate the relative difference in log-likelihood and in estimated parameters.
We fit some models by using the same simulated relationship between variables as in Section 5.The results are displayed in the Table 3 for dynamic equations and in the Table 4 for initial conditions equations and errors terms covariance matrix structure.
As we can see from Table 3 and Table 4, by increasing the number of quadrature points the changes in results decline and the relative differences are around 0.01% for significant coefficients and 0.1% or at most 1% for non significant coefficients.After 16 quadrature points, the relative differences in log-likelihood and in estimated coefficients become fewer as we increase the number of quadrature points.The estimations with 22 quadrature points are closer to those with 24 quadrature points than the others.So when we increase the number of quadrature points the changes in estimated coefficients are not significant but the computing time grows up exponentially.For these models, estimation time on an i5 core computer at 2.5 GHz with 6 GB of RAM memory for the different number of quadrature points are given in Table 5.

Conclusions
This paper describes the bivariate dynamic probit model with endogenous initial conditions starting by justifying the econometric specification of the model, giving the estimation method and its requirements and ending by presenting a robustness analysis.We calculate the derivatives of the log-likelihood function (the gradient) with respect to the 13 parameters in the model.This is the main contribution of our research as many programs use numerical computation of the gradient vector instead of encoding the mathematically derived expression of the gradient.Furthermore, for the use of the adaptative Gauss-Hermite quadrature, we also calculate the Hessian matrix with respect to individual random effects vector.The implementation has been done using Stata software.We wrote 2 ado-files for this purpose.We use Stata's d1 method for the maximization process.For the use of this method, we implement the gradient vector for the 13 parameters and we also implement the Hessian matrix with respect the random effects vector in order to use the adaptative Gauss-Hermite quadrature.We also wrote two others ado-files for the estimation of the bivariate probit for panel data and the bivariate dynamic probit without initial conditions for panel data.These ado-files are written using the same method (Stata's d1 method) with the adaptative Gauss-Hermite quadrature.These ado-files are available upon request.Theoretical Economics Letters Due to the fact that the integration is bi-dimensional, estimation time is very high and stills increasing when the number of quadrature points or the number of observation or the number of explanatory variable increase.For an estimated model, one should insure that when increasing the number of quadrature point, the computed results don't change significantly before using them.It means that the relative difference in the results must be around 0.1% or fewer, and if so, we can conclude that the results remain stable when increasing the number of quadrature points.And, it means that there is no need to increase the number of quadrature points that will increase computing time but will not improve significantly the results.However, increasing the number of quadrature points also increases the computation time.One way for major improvement of the program is the use of multi-core (parallel) computing scheme.This scheme allows to make the computation of the contributions to the likelihood (Equation Finally, our method gives reasonable computing durations with real dataset.
In [12], we make use of the full SIP data set with 10,569 individuals and 255,206 observations.

Φ
x φ denotes the univariate normal density function, ( ) , , x y φ ρ denote the bi- variate normal density with correlation ρ, ( ) 1 x Φ denote the univariate normal probability funcdenote the bivariate normal probability function with correlation ρ.
denote the random normal density with mean μ and standard deviation σ.As individuals effects are time invariant, we simulate η as follows: (23)) at each quadrature separately and simultaneously on several cores.It has the advantage to save time since the contributions are computed in the same time.
Z respectively) and make the assumption that Y and Z have positive outcomes (equals to 1) if their latent variables are positive.The latent variables are defined as follows: non-causality from Y to Z.As Y t and Z t are binary outcome variables, we can use latent variables ( * Y and *

Table 1 .
Simulated data set estimation's results.

Table 2 .
Columns 1 and 2 in Table 2 are the same than corresponding columns in Table 1.We provide in Table 2, column 3, the new results with the additional variables in order to compare with previous estimates 4 .
4We do the same with columns 1', 2' of Table1 and Table 2(new results are in column 3') and with columns 4 and 5 of both tables (new results in column 6).

Table 2 .
Simulated data set with added variables estimation's results.

Table 3 .
Impact of the number of quadrature points on estimation results.Part A.

Table 4 .
Impact of the number of quadrature points on estimation results.Part B.

Table 5 .
Computing time for different number of quadrature points.