A Mixture-Based Bayesian Model Averaging Method

Bayesian model averaging (BMA) is a popular and powerful statistical method of taking account of uncertainty about model form or assumption. Usually the long run (frequentist) performances of the resulted estimator are hard to derive. This paper proposes a mixture of priors and sampling distributions as a basic of a Bayes estimator. The frequentist properties of the new Bayes estimator are automatically derived from Bayesian decision theory. It is shown that if all competing models have the same parametric form, the new Bayes estimator reduces to BMA estimator. The method is applied to the daily exchange rate Euro to US Dollar.


Introduction
Several models are a priori plausible in statistical modeling; it is thus quite common nowadays to apply some model selection procedure to select a single one.For an overview of frequentist model selection criteria, see Leeb and Poetscher [1], Zucchini [2], and Zucchini et al. [3].Alternatively one can give weights to all plausible models and work with the resulting weighted estimator.This can be done either in frequentist approach (frequentist model averaging, FMA) or Bayesian context (Bayesian model averaging, BMA).References on frequentist model averaging include: Nguefack-Tsague [4], Burnham and Anderson [5], Nguefack-Tsague and Zucchini [6], and Nguefack-Tsague [7]- [9].In the Bayesian context, there are three fundamental factors in decision theory: 1) a distribution family of the observation (sampling distribution), ( ) 2) a prior distribution for the parameter, ( ) π µ ; 3) a loss function associated to a decision δ , ( ) , L µ δ ; with expected loss ( ) ( ) ( ) . The posterior distribution of µ is given by ( ) ( ) ( ) ( ) ( )   .d A posterior distribution and a loss function lead to an optimal decision rule (Bayes rule), together with its risk function and its frequentist properties.

Bayesian Model Selection
Consider a situation in which some quantity of interest, μ, is to be estimated from a sample of observations that can be regarded as realizations from some unknown probability distribution, and that in order to do so, it is necessary to specify a model for the distribution.There are usually many alternative plausible models available and, in general, they each lead to different estimates of μ.Consider a sample of data, x, and a set of K models ( ) containing the true model M t .Each M k consists of a family of distributions ( ) where k η represents a parameter (or vector of parameters).The prior probability that M k is the true model is denoted by ( ) and the prior distribution of the parameters of M k (given that M k is true) by ( ) η .Conditioning on the data x and integrating out the parameter k η , one obtains the following posterior model probabilities: where is the integrated likelihood under M k .If ( ) Bayesian model selection involves selecting the "best" model with some selection criterion; more often the Bayesian information criterion (BIC), also known as the Schwarz criterion [10] is used; it is an asymptotic approximation of the log posterior odds when the prior odds are all equal.More information on Bayesian model selection and applications can be found in Nguefack-Tsague and Ingo [11], Guan and Stephens [12], Nguefack-Tsague [13], Carvalho and Scott [14], Fridley [15], Robert [16], Liang et al. [17], and Bernado and Smith [18].Other variants of model selection include Nguefack-Tsague and Ingo [11] who used BMA machinery to derive a focused Bayesian information criterion (FoBMA) which selects different models for different purposes, i.e. their method depends on the parameter singled out for inferences.

Bayesian Model Averaging
Let μ be a quantity of interest depending on x, for example a future observation from the same process that generated x.The idea is to use a weighted average of the estimates of μ obtained using each of the alternative models, rather than the estimate obtained using any single model.More precisely, the posterior distribution of μ is given by Note that ( ) µ is a weighted average of the posterior distributions ( ) The posterior mean and posterior variance are given by ( ) ( ) ( ) A classical reference is Hoeting et al. [19] with an extensive framework of BMA methodology and applications for different statistical models.Various real data and simulation studies have investigated the predictive performance of BMA (Clyde [20]; Clyde and George [21]).A discussion on the issue of using BMA for dealing with model uncertainty is given in Clyde and George [21].Nguefack-Tsague [13] uses BMA in the context of estimating a multivariate mean.Others references on BMA include Marty et al. [22] for reliable ensemble forecasting, Simmons et al. [23] for benchmark dose estimation, Fan and Wang [24] (Autoregressive regression models), Corani and Mignatti [25] (presence-absence data), Tsiotas [26] (quantile regression), Baran [27] (truncated normal components), Lenkoski et al. [28] (endogenous variables), Fan et al. [29] (regression models), Madadgar [30] (integration of copulas), Koop et al. [31] (instrumental variables), and Clyde et al. [32] (variable selection).
Clyde and Iversen [33] developed a variant of BMA in which it is not assumed that the true model belongs to competing ones (M-open framework).They developed an optimal weighted scheme and showed that their method provides accurate predictions than any of the proxy models.
An R [34] package for BMA is now available for computational purposes; this package provides ways for carrying out BMA for linear regression, generalized linear models, and survival analysis using Cox proportional hazard models.For computations, Monte Carlo methods, or approximating methods, are used; thus many BMA applications are based on the BIC.As one can realize in deriving BMA, there are no unique statistical model and unique prior distribution associated with BMA, taught these are available for each competing model.This renders frequentist properties of BMA hard to obtain from pure Bayesian decision theory.This was the main motivation of this paper in proposing alternative Bayesian model in which the long run properties of resulted estimators could be automatically obtained from Bayesian decision theory.The present paper is organized as follows.Section 2 introduces the new BMA method; Section 3 provides practical examples while Section 4 provides discussions.The paper ends with concluding remarks.

The Model
The purpose of this section is to define a new BMA method.The prior of the quantity of interest can be defined as where ( ) The parametric statistical model ( ) P x µ can also be defined as with ( ) The use of Bayes rule leads to the posterior of the quantity of interest ( ) Defining a loss function, Bayesian estimates are then obtained with its long and short run properties known.All the frequentist properties of Bayes rules now apply, in particular one can find conditions under which there are consistent and admissible.This approach is referred to as Mixed based Bayesian model averaging (MBMA).

Proof.
In the numerator of (11), The numerator of ( 11) is therefore

P x M P x M P M P x M P x M P M P M P x M P x M P M P x M P x M P M P x M P x M P M P x M P x M P
Therefore, the denominator of ( 11),

P x P x M P M P x M P M P M P x M P M P x M P M P M P x M P M P x M P M P M P x M P M P x M P M P x M P M P x M P M
Thus in this special case, the posteriors mean and variance using the MBMA are those of BMA given in Equations ( 6) and ( 7).

Frequentist (Long Run) Evaluation of MBMA
Evaluating the long run properties of MBMA involves studying frequentist issues, including: asymptotic methods, consistency, efficiency, unbiasedness, and admissibility.Details about derivations for more general Bayes estimates can be found e.g. in Gelman [35] (p.83).The following are proven in Gelman [35] for any Bayes estimate, in particular for MBMA.Let ( ) J µ be the Fisher information, ( ) I µ the observed information, Mod the posterior mode; and μ 0 the value of the parameter that makes the model distribution closest (e.g. in the sense of Kullback-Leiber information) to the true distribution.2) If the likelihood ( ( ) µ ) is a continuous function of μ and the true parameter value μ 0 is not on the boundary of the parameter space, as the sample size n tends to ∞, the posterior distribution of μ approaches normality with mean μ 0 and variance ( ) ( ) and Mod is consistent for μ 0 .
3) Suppose the normal approximation for the posterior distribution ( ( ) µ ), Mod → μ 0 and the true data distribution is included in the class of models, then , where I is the identity matrix.
4) When the truth is included in the family of models ( ( ) µ ) being fitted, the posteriors mode, mean and median are consistent and asymptotically unbiased and efficient under mild regularity conditions.

5) If a prior distribution ( ( )
P µ ) is strictly positive with finite Bayes risk and the risk function is continuous, MBMA is admissible.

Predictive Performance of MBMA
One measure of predictive performance is the Good's logarithm score rule [36].From the nonnegativity of Kullback-Leiber information divergence, it follows that if f and g two probabilities distribution functions, MBMA provides thus better predictive performance than any single model.

Applications
Laplace distribution, ( ) Laplace distribution (the double exponential) is symmetric with fat tails (much fatter than the normal).It is not bell-shaped (it has a peak at x θ = ).Suppose that the mean is known and the quantity of interest is σ α β for both models (the idea remains the same for different priors, e.g., uniform priors).Model probabilities are assigned for each model.Table 1 shows the properties of the competing models, BMA, and MBMA.Starting from equal prior for M 1 and M 2 , i.e. 0.5 each; after observing data, M 1 is more likely to be true (0.83) than M 2 (0.17).While M 1 , M 2 and MBMA have priors (over the parameter of interest) and statistical models; BMA does not have.This implies that the frequentist properties of MBMA can be automatically derived form Bayesian decision theory (see Subsection); this is not possible for BMA.The bayesian estimates (conditional on the observations) of these models are very similar, with MBMA having the smaller conditional variance (0.03).

Discussion
In general, as Bayes estimate, the form of the posteriors mean and variance for MBMA are not known in advance; in a special case, the properties of MBMA are those of BMA and are given in Equations ( 6) and (7).Posterior distributions of MBMA are very complex, thus a major challenge is in computing.MBMA estimate is thus computationally demanding (but feasible) since the posterior ( ) P x µ involves many sums, especially if the number K of models is large.This is not new as BMA faces the same drawback, though nowadays program exist for complex computations (e.g.R [34]).Another problem is the selection of priors both for models and parameters (common to any Bayesian model).In most cases, uniform priors are used for each model, i.e. ( ) When the number of models is large, model search strategies are sometimes used to reduce the set of models (e.g.Occam's window method, Hoeting et al. [19]), by eliminating those that seem comparatively less compatible with the data.Most currently Bayesian mixtures are based either in the priors or on the statistical model, not both as the new MBMA described in this paper.For example Abd and Al-Zaydi [37] [38] used statistical mixtures model for order statistics; Al-Hussaini and Hussein [39] for exponential components; Ley and Steel [40] used a prior of mixtures with economic applications.Other Bayesian mixtures include Schäfer et al. [41] (spatial clustering), Yao [42] (Bayesian labeling), Sabourin and Naveau [43] (extremes), and Rodrguez and Walker [44] kernel estimation).Programming codes are under development for performing model averaging using MBMA with real data and simulations, and will be available as an add-on package on R [34].

Concluding Remarks
This paper proposes a new method (with application) for model averaging in Bayesian context (MBMA) when the main focus of a data analyst is on the long run (frequentist) performances of the Bayesian estimator.The method is based on using a mixture of priors and sampling distributions for model averaging.When conditioning on data at hand, the well popular Bayesian model averaging (BMA) should be preferable, given the complexity in computing of MBMA.MBMA is especially useful when exploiting the well known frequentist properties within the framework of Bayesian decision theory.
where the k-th weight, ( ) k P M x , is the posterior probability that M k is the true model.The posterior distribution of μ, conditioned on M k being true, is given by

µ
being the parametric statistical model for model M k (i.e. the sampling distribution of M k ).

2 .
by (b) yields the result.Corollary Suppose that all the models have identical sampling distribution, that is and j, then MBMA reduces to BMA.

1 )
If the sample size is large and the posterior distribution ( ) P x µ is unimodal and roughly symmetric, one can approximate it by a normal distribution centered at Mod with variance

(
data are the daily foreign exchange rates Euros versus US Dollars from January 3 2000 till 15 2006 (the aim being their return value).The prior for 2