Bayes Factor with Lindley Paradox and Tow Standard Methods in Model ()
1. Introduction
Many statisticians are naturally involved in the question of model selection [1], in case to define the “best model” to fit real data, different approaches have been proposed since last century, many well-known methods such as F-test [2], AIC, BIC [3], Bayesian model averaging [4]. We are focusing on Bayesian approach, as we analyze data from some possible models
. We denote
as parameter and
as prior probability, then for likelihoods
and prior
. The posterior for model
with parameter
is proportional to
, we get posterior probability as
In a Bayesian analysis, the priors
on each model and
on the parameters of model k are proper and subjective. And the Bayesian solutions to do questions are to compute the posterior probability
for each model. For model selection, we would choose the model from Bayesian conclusion as maximizes
.
However, Bayes factor has its only limitation, that is Bayes factors itself can only show the difference of how hypothesis model is against a null model [5]. Also, Bayes factor has a close connection with priors, if we change the width of the prior, it will also change the Bayes factor. At this point, we may need to consider about Lindley Paradox.
In Section 2, we give a simple and general explanation of Bayes factor. Following, in Section 3, we will talk about Lindley’s Paradox. And Section 4 can be one of the main parts of the theoretical approach for AIC and BIC, for which we give the derivation. A simple example is given as well to use AIC and BIC.
2. Bayes Factor
Before talking about all things, first we would construct one of the most important variables within Bayesion Methods-Bayes Factor [6].
Suppose we have data D with prior
and
as two different models. By Condition Rule, we have:
Recall for Odds we have
. And for
is the marginal likelihood, which
.
denotes prior. Then, by Bayes’ Rule,
where
is defined as Bayes’ factor, and realized it is also the ratio of marginal likelihood. Furthermore, we denote Bayes’ factor as:
Bayesian method fits in many models for testing because it can provide a decisiveness of the evidence agree the null model in contrast p-values [7] which are usually just regarded as evidence mearsurement against the alternative [8]. Also, the Bayes factor (Jefferys, 1961) [9] is used in Bayesian hypothesis. Assuming that
are the likelihoods for D under two competing models
and
, and the parameters are
. Meanwhile, let
be their prior distributions [10]. The Bayes factor for
against
:
Above these, evidence from the data agrees
, against
. So Bayes factor can avoid many limitations in p-value testing. The development of Bayes factor in statistical models test can applicate in many areas of research [11].
3. Priors and Lindley Paradox
3.1. Introduction to Lindley Paradox
The Lindley’s Paradox shows how a value (or the number of standard deviations) is used in a Frequent Assumption [12] test results in a completely different inference from Bayesian hypothesis [13].
When we faced with improper priors (like priors can’t integrate to one) in the null hypothesis and model selection, we will find some problems. Such priors can be acceptable, but for other purposes it is also acceptable. So we consider testing the hypotheses:
Defining
for marginal density, so we can use the following model:
Making
and
are proper density functions, the posterior is given by:
Then we can suppose that we use improper priors, making
and
. So:
Establishing model i that
is the marginal likelihood or the integrated. So we assume that
Then an equation can be obtained:
So we can use different z that we want to change the posterior arbitrarily. Meanwhile, when using proper and not clear priors might cause similar problems. Because the probability of data in a complex model with a diffuse prior will be very small. So one thing we must know, when we do research in Bayes factor a clearer and simper model is better. It was called the Lindley paradox.
3.2. A Simple Model in Lindley Paradox
Many authors [14] have discussed this so-called paradox [15] in different ways [16]. So I want to find a simple way to consider this problem. The usual point null hypothesis testing problem is to test:
In normal model
. The prior probability is
.
Let
be the prior distribution for the unknowm parameter
in the model.
The Bayes factor is given by:
In order to consider the paradox, we can formalise it and compare the two following normal models:
Consider a physical system where quantity X may be measured and assume. And we need to use the
to define both the priors. The prior of the null hypothesis is
supposing the
can depend on
.
Computing the Bayes factor representing the odds of the null hypothesis
is:
In this case, prior probabilities
and
for two hypotheses can be expressed. Given the result x, in Bayes theory that:
for
,
is prior probabilities and
is the conditional distribution,
can outcome the overall distribution. Posterior probability
is in the hypothesis
. In Bayes theory we can evaluate the posterior probabilities,
is given by:
Then, we can use the mean value in prior distribution with
and make the rest of the prior probability as a normal distribution with variance
, so:
Evaluating the conditional probabilities:
We can evaluate
and
, overall:
So we have an equation like before, we can talk about the prior
. Our approach is to measure the value of alternative assumptions about zero. In Asymptotically Bayesian attribute, if the model is incorrectly specified, the posterior will accumulates in the model. In the case of the Kullback-Leibler divergence, the closest to the real model [17]. As a result, divergence
represents the loss. Because we know the prior before. The excepted loss can be given:
The model prior represent the loss relatied with a probability statement, it also determined self-information loss function. So we have the prior on the alternative model is:
The prior of the null hypothesis is
, then we can get:
Then, this applies to the category of large
and
goes to zero, so
. Therefore, this method is consistent, we do not advocate the choice of big
.
4. BIC and AIC
4.1. BIC
4.1.1. Notation (Table 1)
4.1.2. Derivation of BIC
In this section we are going to talk about the basic idea [18] of how BIC (Bayesian information criterion) constructed and given the derivation of BIC [4].
As what we have showed in section one,
as Bayes factor for two models, then we consider more models
which
where
is the vector of parameters in the model
, L is the likelihood function and
is the p.d.f. of the distribution of parameters
Denoting
as the posterior mode, then we use Taylor expansion, let
,
.
where
is a
matrix such that
, where
. since Q attains its maximum, the Hessian matrix
is negative definite. Let us denote
, and then approximate
:
Then, by higher dimension normal distribution,
Furthermore, let us think about Weak Law of Large Numbers. For y is given data,
is the likelihood
and L attains its maximum at the maximum likelihood estimate
.
We set
, then each element in the matrix,
, can be expressed as:
Then, for
as a Fisher information matrix that,
In this case, for the data
is IID, and n is large, we would apply Weak Law of Large number here, as random variable
we have
, Moreover, for Fisher information matrix:
For which
is the Fisher information matrix for a single data point
, and after substituting we final get for BIC:
4.2. AIC
4.2.1. Notation (Table 2)
4.2.2. Derivation of AIC
We can measure the quality of
(as an estimate of p) by the Kullback-Leibler distance [19] :
So, we want to minimize
over j, which is the same as maximizing
For calculating
, we can use Monte Carlo method to do an estimate
However, this estimate is very biased because the data are being used twice: first to get the MLE and second to estimate the integral by Monte Carlo method, and the bias is approximayely
. That means we should prove [20]
Choose
, s.t.
, and let
So,
is the Jacobi martix of
, and
is the Hessian martix of
.
where
where
and
From the knowledge of asymptotic distribution, we have three claims [21] :
Claim 4.1
, where
.
Claim 4.2
Claim 4.3 Let
be a random vector with mean
and covariance
, and
, then,
So, with these calims above,
So, we define
4.3. Example of Simple Model
Let us consider again with the example in section 3, if we take data
, and compare it with two models, such that,
and
. Then take the same hypothesis as in section 3.2, we test:
By standard normal distribution we have,
In case to avoid Type I error in our test, for
, by Z table, we would reject
if
(we take
). Which implies if
, we reflect
.
Case 1: BIC
For what we have showed in section 4.1, we proved that
. However, in case to make comparison with two models, we could get away some unnecessary part, we take
. Thus,
For
,
and
,
where
. If we want to choose
as a better model, then we would make
, in other words,
. And BIC is an estimate of a function of the posterior probability of a model under Bayesian setup.
Case 2: AIC
And from section 4.2, for
, for which as what we have defined above
that
. Further deduce
Thus,
For
,
and
,
If we want to choose
as a better model at this point, we would take
, implies
. Which AIC is estimate a constant plus the relative distance between unknow likelihood function.
5. Conclusion
The question of how to choose a best model and what is a best model, it is hard to define. More precise, the controversy has existed for a long time, and no doubt it will continue longer. In this paper, we have discussed Bayes factor in hypothesis. It is obviously that Bayes factor is increasingly used in many fields of statistic research. For Bayes factor standard methods, AIC and BIC, we would consider to use for model selection. However, we also should notice that for all methods they all have their own limitation, such as the sensitivity of priors in Lindley’s paradox. Even both frequentist and Bayesian statisticians have came up with different new ideas, it is still hard to be implemented or understand by all other. Moreover, from statistic point, the method also needs to be general enough to apply. Such as for Lindley’s paradox, the partial Bayes factor in case to avoid the sensitive of priors, it takes the minimal training sample from data set to get prior and then apply with rest of the data. Partial Bayes factor at some point did deduce the influence of sensitivity of prior, but how to find the minimal training sample could also be a hard problem. Same as fractional Beyes factor, even it proves the method of choosing data for partial Bayes facto, it still has many limitations we need consider.