A Kullback-Leibler Divergence for Bayesian Model Diagnostics

This paper considers a Kullback-Leibler distance (KLD) which is asymptotically equivalent to the KLD by Goutis and Robert [1] when the reference model (in comparison to a competing fitted model) is correctly specified and that certain regularity conditions hold true (ref. Akaike [2]). We derive the asymptotic property of this Goutis-Robert-Akaike KLD under certain regularity conditions. We also examine the impact of this asymptotic property when the regularity conditions are partially satisfied. Furthermore, the connection between the Goutis-Robert-Akaike KLD and a weighted posterior predictive p-value (WPPP) is established. Finally, both the Goutis-Robert-Akaike KLD and WPPP are applied to compare models using various simulated examples as well as two cohort studies of diabetes.


Introduction
Information theory provides a general framework of developing statistical techniques for model comparison (Akaike [2]; Shannon [3]; Kullback and Leibler [4]; Lindley [5]; Bernardo [6]; Schwarz [7]).The Kullback-Leibler distance (KLD) is perhaps the most commonly used information criterion for assessing model discrepancy (Akaike [2]; Kullback and Leibler [4]; Lindley [5]; Schwarz [7]).In essence, a KLD is the expected logarithm of the ratio of the probability density functions (p.d.f.s) of two models, one being a fitted model and the other being the reference model, where the expectation is taken with respect to the reference model.Thus KLD can be viewed as a measure of the information loss in the fitted model relative to that in the reference model.KLDs that are suitable for model comparison in the Bayesian framework typically involve the integrated likelihoods of the competing models, where the integrated likelihood under each model is obtained by integrating the likelihood with respect to the prior distribution of model parameters (e.g., Lindley [5] and Schwarz [7]).KLDs based on the ratio of integrated likelihoods however have been challenged by identifying priors that are compatible under the competing models and that the resulting integrated likelihoods are proper.As a way to overcome the challenges associated with prior elicitation in calculating KLD under the Bayesian framework, one may consider the Bayesian estimate of the Kullback-Leibler projection by Goutis and Robert [1], henceforth G-R KLD.More specifically, for a given reference model indexed by parameter(s)  , the G-R KLD is the infimum KLD between the likelihood under the reference model and all possible likelihoods arising from the competing fitted model.Thus if the reference model is correctly specified, then the G-R KLD is asymptotically equivalent to the KLD between the reference model and the competing fitted model evaluated at its MLE (ref.
Akaike [2]).The Bayesian estimate of G-R KLD is obtained by integrating the G-R KLD with respect to the posterior distribution of model parameters under the reference model.First, the Bayesian estimate of G-R KLD is clearly not subject to the drawback due to impropriety of the prior as long as the posterior under the reference model is proper.Second, G-R KLD is suitable for comparing the predictivity of the competing models since it is calculated with respect to the posterior density of model parameters under the reference model.However, the G-R KLD was originally developed for comparing nested generalized linear models while assuming a known true model, and its extension to general model comparison remains limited.For example, if the refer-ence model is not correctly specified, then the G-R KLD is not necessarily reduced to to the KLD between the reference model and the competing fitted model evaluated at its maximum likelihood estimate or MLE (ref. Akaike [2]), referred to as the Goutis-Robert-Akaike KLD or G-R-A KLD a more tractable model discrepancy measure.
This paper proposes to use G-R-A KLD for assessing model discrepancy in terms of the fit of certain statistics n that is central to our inference or model diagnostic purpose.That is, we evaluate the G-R-A KLD between the probability density function (p.d.f.) of n under the reference model and that under the assumed model T T r f evaluated at its MLE (see Section 2).We investigate the (asymptotic) property of G-R-A KLD under certain regularity conditions as well as under the violation of some regularities, including non-nested models.Note that unlike G-R KLD (Goutis and Robert [1]), the G-R-A KLD considered herein does not require the reference model to be the true model, nor is the true model to be specified.Also, while G-R KLD has been limited to comparing nested generalized linear models, the G-R-A KLD seems to be more flexible for comparing nested or non-nested models that are broader than generalized linear models (see Sections 3 and 4).Theorem 1 shows that under certain regularity conditions, the asymptotic expression of the posterior estimator of G-R-A KLD is comprised of a leading term for model discrepancy in the mean of n , a term for model discrepancy in the variance of n , and a constant to penalize model complexity, plus a smaller order term.Since the first two leading terms in the G-R-A KLD estimator resemble a measure that differentiates the predictability between models and r T T r f , it is natural to study its connection with Bayesian model discrepancy measures based on predictive statistics (Guttman [8], Rubin [9], Gelman et al. [10]).In particular, we consider the posterior predictive check technique using a one-sided weighted posterior predictive p-value (WPPP).The WPPP of evaluates the predictive distribution of n T under , where , pred r n T denotes the prediction of n T under , the posteriors are derived under , and the weight is used to account for the variation of under .Theorem 2 explicitly shows that for any n satisfying certain asymptotic normality and regularity conditions, how the model discrepancy is reflected by the G-R-A KLD in connection to that by WPPP.To verify the results in Theorem 1 and Theorem 2 as well as to evaluate G-R-A KLD under partial violation of the regularity conditions, we examine the (asymptotic) property of G-R-A KLD and WPPP via both simulations and real data applications motivated by two cohort studies of diabetes.These examples include the comparison between nested models as well as non-nested models.
The paper is organized as follows.Section 2 studies the G-R-A KLD for (predictive) p.d.f.s of n between two competing models.It also derives the relationship between G-R-A KLD and WPPP.Sections 3 and 4 evaluate model fit using both G-R-A and WPPP for examples that meet all the regularity conditions required in Theorems 1 and 2 as well as for examples that meet only part of these regularity conditions.for the fitted model, both governed by  , where r  and f  are the corresponding the parameter spaces.Also, when a capital letter is used to denote for a random variable (or an estimator), the corresponding lower case is for its realization (or an estimate).Let

Define Kullback-Leibler Divergence
Consider that model adequacy is evaluated based on its fit for certain statistics n that is pertinent to the inference or model diagnostics.Stemmed from Goutis and Robert [1], we assess the relative fit between models using the KLD of the distribution of n under and that under We shall refer (1) to as the Goutis-Robert-Akaike KLD or G-R-A KLD since (1) is asymptotically equivalent to the KLD proposed by Goutis and Robert [1] when the reference model is the true model (ref. Akaike [2]).r In general, for each r    , G-R-A KLD given in (1) can be regarded as a measure of the minimal information gain in model from model r f since the minimum information loss under f is achieved at ˆf  .Note that unlike G-R KLD (Goutis and Robert [1]), (1) by definition does not require the reference model to be the true model.To understand the utility of (1) in the Bayes-ian framework, below we derive its Bayesian estimate and the associated (asymptotic) property under certain regularity conditions (see Sections 2.3 and 2.4).We also study the property of (1) when partial regularity conditions hold true using simulated data (Section 3) as well as two applications of diabetes studies (Section 4).

Bayesian Estimation of the Proposed KLD
Estimating (1) involves approximating the integral with respect to ( | )   n r t  and estimating unknown model parameters  .To account for the uncertainty of model parameters, we consider the following estimator: where n , as a function of , is used for deriving the posterior of  , and  2) is nonnegative for any given n , the closer it is to zero, the less is the information loss by fitting f , instead of , to statistic .r n To gain further insight about the utility of (2), we derive its asymptotic properties below.The approximation of (2) is tied to the assumptions of n under and T T r f , which will be described in Theorems 1 and 2. In what follows, let the statements be interpreted as "almost sure" statements.Also, define for which has a unique minimum attained at .Denote  for model parameters, and for the posterior means , and ( )   , respectively.Assume the following regularity conditions under Theorems 1 and 2.
(A1) For each x , both (A2) For all sufficiently large > 0 (A4) The prior density π( ) T be asymptotically normally distributed under both models such that and The proof of Theorem 1 is given in Appendix 1.Since model comparison in real applications can rely on the relative fit to a multi-dimensional statistic, it is useful to study whether the results in Theorem 1 are applicable to the multivariate case with a fixed dimension.Suppose that n is a p-dimensional statistic ( and independent of n) with Then following the derivation given in Appendix 1, it can be shown that

KLD vs Posterior Predictive p-Value
where and are the density functions of n T under and , respectively.We shall call (7) a onesided weighted posterior predictive p-value (or WPPP) with respect to model since as shown above, it is equivalent to the mean predictive p-value of n under over all possible posterior predicted values of n T arising from .Thus WPPP can be viewed as a model discrepancy measure since it evaluates the predictivity of under .Note that WPPP proposed here differs from the posterior predictive p-value originally proposed in Rubin [9]: WPPP calculates the predictive p-value of n under an assumed model should n originate from the reference model r , while Rubin's original proposal [9] assesses the predictive p-value of n under an assumed model.WPPP is considered here since it is more coherent with the nature of G-R-A KLD as a discrepancy measure between models and .

Remark
, the mean of n is the same both and , but the variance of n T differs between r and ).Then where 2 > 0


. Let .Suppose that and converges in probability to a positive quantity, which suggests the discrepancy between and r f .However the magnitude of the model discrepancy assessed by , and 2 ( ) In this case,  has the same statistical meaning under both f and since r . Yet the substantive meaning of converges in probability to a positive quantity (i.e.,   ), and converges in probability to 0.5 (since ( ) Examples 3.4-3.6below consider situations where n is an unbiased estimate for the parameter of interest under both T f and .However n is the MLE of certain parameter under r T f but not under .In these examples, since MLEs of the mean of n under both r T f and converge to the same in probability, r

Example 3.4 (Nested). As an extension of Example
where 2 > 0  and We conduct numerical calculation for the following combination 2  , 32 = 4,25  with   , it implies that in the ory KLD should be able to distinguish f from .However, the numerical values of KLD only deviate from 0 by for 1 < .Thus in practice, KLD can hardly be differentiated from WP which converges to 0.5 in probability r ( ) KLD r f U can detect model discrepancy in the dispersion parameter.In contrast, converges to 0.5 in probability, which is not sensitive to the discrepancy in the dispersion parameter.
( ) , and  n is greater than 1.Consider fitting the data by where , while can not be obtained in a closed form.Thus KLD r f U is evaluated numerically using eight combinations given in the table below.The results suggest that the degree to which

  , |
t n KLD r f U can distinguish f from varies by situation: closer converges to 0.5 in probability, (see Table 1).
Example 3.7 below shows that the asymptotic relationship between

  , |
t n KLD r f U and does not hold in the sense of Theorem 2 due to violation of the asymptotic normality assumption specified in (3) a ( ) and (4).
In this example, both

Application of ( , | )
t n

KLD r f U to Diabetes Studies
In this section, we apply both WPPP U ies of diabetes.In these applications, we assess the model fit to the entire dataset due to its clinical implication.
Here the selected diagnostic statistic is a multivariate statistic of variables, and it does not meet all the regularity conditions specified in Section 3. Study I originated from a clinical cohort of 507 veterans with type 2 diabetes who had poor glucose control (indicated by glycosylated hemoglobin A1c or HbA1c greater than 6.5) at the baseline (fiscal year 1999), and were all treated by metformin as the mono oral glucose-lowering agent.As the literature suggested that the glucose-lowering response due to metformin may vary by an individuals obesity status, the goal of our study was to compare models that assessed whether obesity was associated with the net change in glucose level between baseline and the end of 5-year follow-up.In this study, the empirical mean of the net change in HbA1c over the 5 year period was similar between the obese vs. non-obese groups (-0.498 vs. -0.379),yet the empirical variance was greater in the obese group (1.207 vs. 0.865).Also, the distribution of HbA1c was reasonably symmetric.Thus we considered two candidate models for fitting the HbA1c change: a mixture of normals vs. a t-distribution (note that the overall empirical variance of HbA1c change is 1.03); that is, , and 4 = 0.487  (% in the obesity group) vs.
96, suggesting that model provided a modest better fit to the data compared to model r f .This result was also consistent with Figures 1 and 2 which contrasted the empirical quantiles with predicted quantiles under r and f .Note that both ˆ( ) Thus the model discrepancy assessed by  , t KLD r f is primarily attributed to the difference in the variance assumption between and  n U r f (as evident in Figures 1 and 2).In contrast, WPPP = 0.522 suggested that the overall fit were similar between the two models since the estimated net change in HbA1c was similar between these two models.been described previously (Hazuda et al. [11]).The goal of our analyses was to compare models that assessed whether glucose control trajectory class (poorer vs. better) was associated with the lower-extremity physical functional limitation score during the first follow-up period.The lower-extremity physical functional limitation score was measured by the Short Physical Performance Battery or SPPB, which is a well-established, validated measure of physical functioning.SPPB score is a sum of three items: 8-foot walking times, repeated chair stands, and balance scores, each being a 5-point liker scale 0 -4.Hence the SPPB score is discrete in nature with a range of 0 -12.Higher SPPB scores indicate better performance and less functional limitation.Exploratory data analyses suggested that the empirical variance of SPPB (15.60 vs. 14.33) was greater than the mean (7.23 vs. 8.02) in both glucose control classes.Also, due to the   left-skewedness of the distribution of SPPB, we considered models fit to the reversed SPPB, i.e., = (12 i X  .Given the nature of SPPB distribution, we compare two candidate models: .63, suggesting that has a much better fit than r f , which is coherent to the quantile plot shown in Figures 3 and 4. Since both and r f yielded similar estimates of , the model discrepancy assessed by primarily be attributed to the difference in variance estimation between and r f (as evident in Figure 2).In contrast, WPPP = 0.539 suggested similar fit between and r f as expected due to similar estimates of under both and

Discussion
This paper considers the G-R-A KLD as given in (2).This KLD is appropriate for quantifying information discrepancy regarding n contained in the competing models and T r f .We derive the asymptotic property of the G-R-A KLD in Theorem 1, and its relationship to a weighted posterior predictive p-value (WPPP) in Theo-Copyright © 2011 SciRes.OJS ) .
Therefore, applying the arguments similar to those in Ghosal and Samanta [12]       .This completes the proof.


for the p.d.f.'s of * given  under model and 6) can be viewed as a discrepancy between and in terms of their posterior predictivity of n .We show next hown r T f  , | t KLD r f Uis related to a weighted posterior predictive p-value, a typical approach for assessing model discrepancy regarding the predictivity of n in the Bayesian framework (see Rubin[9]; Gelman et al.[10]).

T
derived in Theorem 2 below.Here we assume regularity conditions (A1)-(A5) as assumed in Theorem 1.

. Theorem 2
discussed previously through two sets of examples.Examples 3.1-3.6demonstrate simulated examples that confirm the results proved in Theorems 1 and 2. All these six examples meet the regularity conditions required, while Examples 3.3, 3.5, and 3.6 involve comparing non-nested models.Example 3.7 studies the departure from Theorems 1 and 2 due to violation of the regularity condition (A5).We let in all calculations below.
n ( , | ) t KLD r f U and to compare non-nested models in two stud-( ) r n

4. 2 .
Study II: Analysis of the Functioning Score in Older Adults with Diabetes Study II arose from the subset of 119 participants with diabetes in the San Antonio Longitudinal Study of Aging (SALSA), a community-based study of the disablement process in Mexican American and European American older adults.Details of the SALSA study design, sampling approach, recruitment and field procedures have

Figure 1 .
Figure 1.Quantile plot for HbA1c for T2DM participants with obesity in VA (open triangle: quantile estimates based on the model assuming a mixture of normal distributions solid circle: quantile estimates based on the model assuming a t-distribution solid line: empirical quantiles).

Figure 2 .
Figure 2. Quantile plot for HbA1c for T2DM participants without obesity in VA (open triangle: quantile estimates based on the model assuming a mixture of normal distributions solid circle: quantile estimates based on the model assuming a t-distribution solid line: empirical quantiles).

Figure 3 .
Figure 3. Quantile plot for SPPB for T2DM participants with good glycemic control in SALSA (open triangle: quantile estimates based on the negative binomial model; solid circle: quantile estimates based on the Poisson model; solid line: empirical quantiles).

rem 2 .Figure 4 .
Figure 4. Quantile plot for SPPB for T2DM participants with poor glycemic control in SALSA (open triangle: quantile estimates based on the negative binomial model; solid circle: quantile estimates based on the Poisson model; solid line: empirical quantiles).
The result above implies the importance of choosing to yield an effective model diagnostic based on t KLD r f U .For example, n is typically chosen to be the sufficient statistic f r T o  since it contains all the information f o  .Yet, n T clearly is not the best diagnostic statistic if the estimates of the mean f n T are unbiased under both models r and o f as