_{1}

^{*}

Computations involved in Bayesian approach to practical model selection problems are usually very difficult. Computational simplifications are sometimes possible, but are not generally applicable. There is a large literature available on a methodology based on information theory called Minimum Description Length (MDL). It is described here how many of these techniques are either directly Bayesian in nature, or are very good objective approximations to Bayesian solutions. First, connections between the Bayesian approach and MDL are theoretically explored; thereafter a few illustrations are provided to describe how MDL can give useful computational simplifications.

Bayesian computations can be difficult, in particular those in model selection problems. For instance, it may be noted that learning the structure of Bayesian networks is in general of the computational complexity type NPcomplete ([1,2]). It is therefore meaningful to consider alternative computationally simpler solutions which are approximations to Bayesian solutions. Sometimes direct computational simplifications are possible, as shown, for example, in [

The discussion below is organized as follows. In Section 2 we briefly describe the MDL principle and then indicate in Sections 3 and 4 how it applies to model fitting and model checking. It is shown that a particular version of MDL is equivalent to the Bayes factor criterion of model selection. Since this is computationally difficult most often, some approximations are desirable, and it is next shown how a different version of MDL can provide such an approximation. Following this discussion, new applications are presented in Section 5. Specifically, MDL approach to step-wise regression in Section 5.1, wavelet thresholding in 5.2 and a change-point problem in 5.3 are described.

The MDL approach to model fitting can be described as follows (see [5,6]). Suppose we have some data. Consider a collection of probability models for this set of data. A model provides a better fit if it can provide a more compact description for the data. In terms of coding, this means that according to MDL, the best model is the one which provides the shortest description length for the given data. The MDL approach as discussed here is also related to the Minimum Message Length (MML) approach of [

If data is known to arise from a probability density, then (see [

The optimal code length of is valid only in the discrete case. To handle the continuous case later, discretize x and denote it by where

denotes the precision. In effect we will then be considering

instead of itself as far as coding of x is considered when x is one-dimensional. In the r-dimensional case, we will replace the density by the probability of the -dimensional cube of side containing, namely, so that the optimal code length changes to.

Consider data, and suppose

is the collection of models of interest. Further, let be a prior density for. Given a value of (or a model), the optimal code length for describing is, but since is unknown, its description requires a further bits on average. Therefore the optimal code length is obtained upon minimizing

so that MDL amounts to seeking that model which minimizes the sum of

• the length, in bits, of the description of the model, and

• the length, in bits, of data when encoded with the help of the model.

Now note that the posterior density of given the data is

where is the marginal or predictive density. Therefore, minimizing

over is equivalent to maximizing. Thus MDL for estimation or model fitting is equivalent to finding the highest posterior density (HPD) estimate of. Note, however, that a prior is needed for these calculations. The approach that a Bayesian adopts in specifying the prior is not, in general, what is accepted by practitioners of the MDL approach. Therefore, the equivalence of MDL and HPD approaches is either subject to accepting the same prior, or as an asymptotic or similar approximation. MDL mostly prefers an approximately uniform prior when for some fixed (same across all models), leading to the maximum likelihood estimate (MLE). The case of having model parameters of different dimensions is different and is interesting. This can be easily seen in the continuous case upon discretization. Now denote by and by. Then

Here and are the precisions required to discretize and, respectively. Note that the term is common across all models, so it can be ignored. However, the term which involves the dimension of in the model varies and is influential. According to [6,12], is optimal (see [

Minimizing this will not lead to MLE even when π(θ^{k}) is assumed to be approximately constant. In fact, [

In spite of these desirable properties, however, MDL leads to the HPD estimate of, which is not the usual Bayes estimate. Posterior mean is what is generally preferred, so that the error in estimation has an immediate simple answer in the posterior standard deviation. In summary, therefore, the Bayesian approach doesn’t seem to find attractive solutions in the MDL approach as far as estimation or model fitting is concerned unless the models under consideration are hierarchical having parameters of varying dimension. On the other hand, when such hierarchical models are of interest the inference problem usually involves model selection in addition to model fitting. Thus the possible gains from studying the MDL approach are in the context of model selection as described below.

Let us recall the Bayesian approach to model selection and express it in the following form. Let

. Suppose

. Consider testing

where, for some and. Let be a prior on. Then can be expressed as

where and and are the conditional densities (with respect to some dominating - finite measure) of under and respectively. Then

where

and

Note that is simply the marginal or predictive density of under and is the unconditional predictive density obtained upon averaging and. Consequently, the posterior odds ratio of relative to is,

with denoting the Bayes factor of relative to. When we compare two competing models and, we usually take, and hence settle upon the Bayes factor as the model selection tool. This agrees well with the intuitive notion that the model yielding a better predictive ability must be a better model for the given data.

Let us consider the MDL principle now for model selection between and. Once the conditional prior densities and are agreed upon, MDL will select that model which obtains a smaller value for the code length, , between the two. This is clearly equivalent to using Bayes factor as the model selection tool, and hence this version of MDL is equivalent to the Bayes factor criterion. In the MDL literature, this version of MDL is known as “mixture MDL”, and is distinguished from the “two-stage MDL” which separately codes the model and the prior. The two-stage MDL can be derived as an approximation to the mixture MDL as discussed later. See [

Example 1. Suppose X^{n} is a random sample from

with known. We want to test

Consider the prior on with known under. Then the marginal distribution of is

under and under it is

. A continuous model and a continuous prior is considered here. Since the precision of the prior parameter is the same across all models upon discretization we will ignore the distinction and proceed with densities. Then, both the Bayes factor criterion and the MDL principle will select over if and only if

where and are the corresponding densities. Since we are comparing two logarithms, let us switch to natural logarithms. Then

and

Noting that

and

we obtain

Therefore is preferred over either by the Bayes factor or by the mixture MDL if and only if

Example 2. Let us consider the previous example with unknown now. Suppose the prior on is the default under both models. The prior on under is now assumed to depend on, i.e., , where is assumed to be a known constant for now. Then, provided,

and letting, and denote the marginal density of given and,

Thus

Therefore, the Bayes factor criterion or the mixture MDL reduces to a criterion which is very similar to that given in the previous example, except that is now replaced by an estimator.

Example 3. (Jeffreys’ Test) This is similar to the problem discussed above, except that, with density

the Cauchy prior. The prior on is the same as before under both models:. This approach suggested by Jeffreys ([

Note that is the same as in the previous example, namely

whereas

No closed form is available for m_{1}(x^{n}) in this case. To calculate this one can proceed as follows as indicated in Section 2.7 of [

where is the mixing Gamma variable. Now one can integrate over and in closed form to simplify (m_{1}x^{n}). Finally, one has a one-dimensional integral over left, which can be numerically computed whenever needed.

Now, let us note from the examples discussed above that an efficient computation of m_{i}(x^{n}) relies on having an explicit functional form for it. This is generally possible only when a conjugate prior is used as in Examples 1 and 2. For other priors, such as in Example 3, some numerical approximation will have to be employed. Thus we are lead to considering possible approximations to the mixture MDL technique, or equivalently to the Bayes factor,.

From Sections 4.3.1 and 7.1 of [

and denote the Hessian of, i.e.,

Also, let denote either the MLE or the posterior mode. Then

so that

Ignoring terms that stay bounded as, [

where

for implementing MDL. See [12,16-18] for further details.

If are i.i.d. observations, then we have

where is the Fisher Information matrix and hence

Now ignoring terms that stay bounded as, we obtain the Schwarz criterion ([

which can be seen to be asymptotically equivalent to SIC.

Example 4. [

versus

where and. Model selection criterion of Rissanen is SIC as described in Equations (13) and (14). However, in this problem a better approximation is employed for the mixture MDL, the mixture being Jeffreys mixture, i.e., the conditional prior densities under for are given by

where is a compact subset of the relevant parameter space and. Consequentlyit follows that,

where is the MLE of under. This yields,

Compare this with (15) and note that the term involving vanishes.

We would like to note here that many authors [21,22] define the MDL estimate to be the same as the HPD estimate with respect to the Jeffreys’ prior restricted to some compact set where its integral is finite:

which is the stochastic complexity approach advocated above.

It must be emphasized that proper priors are being employed to derive the SIC criterion, and hence indeterminacy and inconsistency problems faced by techniques employing improper priors are not a difficulty in this approach. Moreover, this approach can be viewed as an implementable approximation to an objective Bayesian solution.

Now consider the two-stage MDL which codes the prior and the likelihood separately and adds the two description lengths. This approach is therefore similar to estimating the parameter with the HPD estimate when there is an informative prior, or with the MLE, but the resulting minimum description length does have interesting features. To see when and how this approach approximates the above mentioned model selection criterion, let us look at some of the specific details in the two stages of coding. See [12,13] for further details. Again, recall the setup in (4) and (5).

Stage 1. Let be an estimate of such as the posterior mean, HPD or MLE under. This needs to be coded. Consider the prior density conditional on being true. Usually MDL would choose a uniform density. Restrict to a large compact subset of the parameter space and discretize it as discussed in Section 3 with a precision of. Then the codelength required for coding is

Stage 2. Now the data is coded using the model density. Discretization may again be neededsay with precision. Thus the description length for coding will be

Summing these two codelengths, therefore, we obtain a total description length of

Since the second term above, , is constant over both M_{0} and, and the third term stays bounded as increases, these two terms are dropped from the MDL two-stage coding criterion. Thus, for regular parametric models, the two-stage MDL simplifies to the same criterion (for) as BIC, namely,

In more complicated model selection problems, the two-stage MDL will involve further steps and may differ from BIC.

It may also be seen upon comparing (19) with (15) that the performance of SIC based MDL should be superior to the simplified two-stage MDL for moderate since SIC uses a better precision for coding the parameter, namely, one based on the Fisher information.

Model selection is an important part of parametric and nonparametric regression and smoothing. Variable selection problems in multiple linear regression, order of the spline to fit and wavelet thresholding are some such areas. We will briefly consider these problems to see how MDL methods can provide computationally attractive approximations to the respective Bayesian solutions.

Variable selection is an important and well studied problem in the context of normal linear models. Literature includes [23-32]. We will only touch upon this area with the specific intention of examining useful and computationally attractive approximations to some of the Bayesian methods.

Suppose we have an observation vector y^{n} on a response variable Y and also measurements on a set of potential explanatory variables (or regressors). Following [

where. Note that is, then, a Bernoulli sequence associated with the set of regression coefficients, also. Let denote the vector of non-zero regression coefficients corresponding to, and the corresponding design matrix, which results in the model

Selecting the best model, then, is actually an estimation problem, i.e., find the HPD estimate of starting with a prior on and a prior on given. The two-stage MDL, which is the simplest, uses the criterion of minimizing

MLE for and given are easily available:

Consider the uniform prior on, all values receiving the same weight. Using these, we can re-write the MDL criterion as the one which minimizes (as in Example 2)

where is the number of.

We can also derive the mixture MDL or stochastic complexity of a given model. If is the prior density under,

Applying (13), (14) after evaluating the information matrix of the parameters and ignoring terms that are irrelevant for model selection, one obtains (see [

If is chosen to be the conjugate prior density, then the marginal density can be explicitly derived. Details on this and further simplifications obtained upon using Zellner’s g-prior can be found in [

This method is only useful if one is interested in comparing a few of these models, arising out of some pre-specified subsets. Comparing all models is not a computationally viable option for even moderate values of, since for each model, , one has to compute the corresponding and.

We are more interested in a different problem, namely, whether an extra regressor should be added to an already determined model. This is the idea behind the step-wise regression, forward selection method. In this set-up, the model comparison problem can be stated as comparing

versus

This is actually a model building method, so we assume that, and hence is the intercept which gives the starting model. Then we decide whether this model needs to be expanded by adding additional regressors. Thus, at step, we have an existing model with regressors and we fix to be one of the remaining regressors as the candidate for possible selection. Now the two-stage MDL approach is straight forward. From (22) and (24), we note that is to be selected if and only if

where is the description length of the model with regressors and is its residual sum of squares as given in (23). A closer look at (27) reveals certain interesting facts. We need the following additional notations involving design matrices and the corresponding projection matrices. We assume that the required matrix inverses exist.

Then we note the following result which may be found, for example, in [

where

It then follows that

and hence

and

where is simply the partial correlation coefficient between and conditional on. Substituting these in (27), we see that

Therefore,

if and only if

if and only if

This method does have some appeal, in that at each step, it tries to select that variable which has the largest partial correlation with the response (conditional on the variables which are already in the model), just like the step-wise regression method. However, unlike the stepwise regression method it does not require any stopping rule to decide whether the candidate should be added. It relies on the magnitude of the partial correlation instead.

One can also apply the stochastic complexity criterion given in (26) above. Then we obtain,

(31)

which is related to the step-wise regression approach, but uses more information than just the partial correlation.

A full-fledged Bayesian approach using the g-prior can also be implemented as shown below. Note that

and

where and, respectively, are the prior densities under and. Taking these priors to be g-priors, namely,

along with the density for, a (proper prior) density for the hyperparameter, we obtain,

where

with and.

The one-dimensional integrals in (33), however, cannot be obtained in closed form. One could also approximate

with, where are the ML-II (cf. [

Example 5. We illustrate the MDL approach to stepwise regression by applying it to the Iowa-corn-yielddata (see [37,38]). We have not included “year” as a regressor (which is a proxy for technological advance) and instead have considered only the weather-related regressors.

In this data set the variables are: X_{1} = Year, 1 denoting 1930, X_{2} = Pre-season precipitation, X_{3} = May temperature, X_{4} = June rain, X_{5} = June temperature, X_{6} = July rain, X_{7} = July temperature, X_{8} = August rain, X_{9} = August temperature, and Y = X_{10} = Corn Yield.

As mentioned earlier, we always keep the intercept and check whether this regression should be enlarged by adding more regressors. We first apply the Two-stage MDL criterion. From (30), at step, we consider only those regressors (which are not already in the model and)

which satisfy (=0.1005 in this example). From this set we pick the one with the largest

. The values of for the relevant steps are listed below.

According to our procedure we select first, followed by and the selection ends there.

We consider the SIC criterion next. From (31), at step, we pick the regressor with the largest value for

provided it is positive. The values of for the relevant steps are given below.

According to SIC our order of selection is.

Consider the nonparametric regression problem where we have the following model for the noisy observations:

where are i.i.d. errors with unknown error variance, and is a function (or signal) defined on some interval. Assuming s is a smooth function satisfying certain regularities (see [15,39,40]), we have the wavelet decomposition of:

where are the wavelet functions and

is the corresponding vector of wavelet coefficients. We assume that the normally infinite sum in (36) can be taken to be a finite sum (or at least a very good approximation) as indicated in [

Upon applying the discrete wavelet transform (DWT) to, we get the estimated wavelet coefficients,. Consider now the equivalent model:

where.

The model selection problem here involves determining the number of non-zero wavelet coefficients:

versus

where are the number of wavelet coefficients of interest.

The prior distribution on the non-zero is assumed to be i.i.d. under.

Since we have not identified the locations (indices) of the non-zero wavelet coefficients, , we proceed as follows to describe the prior structure. With each we associate a binary variable as in [

Finally, we let be i.i.d. (be the corresponding joint density), and define the following structure under.

The nuisance parameter which is common under both models is given the prior density. Then it follows that

Note that only those for which appear in the integral above.

The Two-stage MDL approach is clearly the easiest to take in this problem. As described earlier, it approximates by coding the prior and the likelihood (both evaluated at an estimate) separately and sums the codelengths to obtain the description length. In this case, discretizing and to a precision of and ignoring terms that stay bounded as increases, this amounts to

where the first term is obtained from Stirling’s approximation, the second term is for coding the non-zero’s and is an estimate such as the MLE. On the other hand, computing SIC or is not an impossible task either. In fact, to integrate out the in of Equation (40) we argue as follows.

and

Now, as we argued in Jeffreys test, we take, and integrate out also. This leaves us with the following expression where we have a sum over.

The term is interesting. Most of the contribution to this sum is expected from with corresponding to the largest of the, which yields the MLE of upon normalization. The Bayes estimate, on the other hand, will arise from a weighted average of all the sums, with weights depending on the posterior probabilities of the corresponding. As is clear, weighted average over the space is computationally very intensive when and are large. An appropriate approximation is indeed necessary, and MDL is important in that sense.

Even though we have justified the two-stage MDL for wavelet thresholding by showing that it is an approximation to a mixture MDL corresponding to a certain prior, a few questions related to this prior remain. First of all, the prior assumption that are i.i.d. is unreasonable; wavelet coefficients corresponding to wavelets at different levels of resolution must be modeled with different variances. Specifically they should decrease as the resolution level increases to indicate their decreasing importance (see [40,42,43]). Secondly, wavelet coefficients tend to cluster according to resolution levels (see [

We shall now consider MDL methods for a problem which attempts to decide whether there is a change-point in a given time series data. We use the data on British road casualties available in [

We follow the approach of [

We model the change-point as the model selection problem:

versus

where is the possible change-point. We further let be i.i.d.. Note that and are hyperparameters.

First, we approximate the likelihood function assuming as follows.

(43)

where. Expanding (43) about, its maximum, in Taylor series and ignoring higher order terms in, we obtain

What is appealing and useful about (44) is that it is proportional to the multivariate normal likelihood function (for) with mean vector x and covariance matrix where

and

Thus hierarchical Bayesian analysis of multivariate normal linear models is applicable (see [15,36,47]). We note that the hyper-parameters and do not have substantial influence and hence treat them as fixed constants (to be chosen based on some sensitivity analysis) in the following discussion. Consequently, denoting by and the respective prior densities under and,

Now, from multivariate normal theory, observe that

and subsequently that, can be integrated out as in Example 2. Thus,

Expression (47) is not available, in general, in closed form. Approaching it from the MDL technique, we look for a subsequent approximation employing an ML-II type estimator (cf. [

For fixed and this involves only examining a smooth function of a single variable, , which is a simple computational task.

We proceed exactly as above to derive also. Partition the required vectors and matrices as follows.

Then we have

(48)

As before, the MDL technique invloves deriving the ML-II estimator of from, for fixed and. Obtaining (for fixed and

) which maximizes is very similar to that for.

We have applied this technique to analyze the British Road Casualties data. Figures 1(a) and (b) show

and as a function of for and, for the LGV and HGV data, respectively. As mentioned previously, and do not seem to play any influential role; any reasonably small value of seems to yield similar results, and any which is not too close to 0 behaves similarly.

There seems to be strong evidence for a change-point in the intensity rate of casualties (induced by the ‘seatbelt law’) in the case of the LGV data, whereas this is absent in the case of the HGV data. This is evident from the very high value of near the ML-II estimate of for the LGV data.

There is a vast literature related to MDL, mostly in engineering and computer science. See [