^{1}

^{*}

^{1}

^{*}

The general functional form of composite likelihoods is derived by minimizing the Kullback-Leibler distance under structural constraints associated with low dimensional densities. Connections with the *I*-projection and the *maximum entropy distributions* are shown. Asymptotic properties of composite likelihood inference under the proposed information-theoretical framework are established.

The composite likelihood has been increasingly used when the full likelihood is computationally intractable or difficult to specify due to either high dimensionality or complex dependence structures. Consider a random vector X with probability density

where

As discussed in [

If the inferential interest is also on parameters prescribing a dependence structure, a pairwise composite likelihood [

Conditional composite likelihood [

There are other important variations and applications of the composite likelihoods designed for various inferential purposes such as composite likelihood BIC for model selection in high-dimensional data in [

Since there are various composite likelihoods with different functional forms, it might be desirable to consider a unifying theme based on information-theoretic justifications. Under an information-theoretic framework, composite likelihoods can then be viewed as a class of inferential functions based on optimal probability density under structural constraints imposed on low dimensional densities when the complete joint density is either unknown or untractable. We show that the optimal densities associated with the composite likelihood are also connected with the I-projection density well-known in probability theory and the maximum entropy distributions in information theory. Although likelihood weights are employed in the original formulation of composite likelihood in [

This paper is organized as follows. In Section 2, we derive the composite likelihood as the optimal inferential device by minimizing the relative entropy or Kullbak-Leilber distance under structural constraints. Asymptotic properties are established in Section 3. Discussions are given in Section 4.

Suppose that

The relative entropy is widely used in information theory and also known as I-divergence in probability. In [

For any probability density function (pdf)

where g is a probability density function.

In statistical inference, the pdf

When significant characteristics associated with the low dimensional projections of the joint probability density function, it is then desirable to incorporate this information formally into the statistical inference. To improve the chosen model, one might utilize constraints associated with known features under an information theoretic framework to be described in the following. As in [

where d is a constant vector and

If

is defined as the I-projection of

The following theorem follows immediately from the above theorem in [

Theorem 1. Given pdf’s

where, for

Then the optimal probability density function (the I-projection of

where

Similar to the I-projection, the maximum entropy distribution is also an optimal density under constraints. It is also known as the Maxwell-Boltzmann distribution, the optimal probability density function under temperature constraints. Consider the following maximization problem:

in which

By applying the maximum entropy theorem in [

Theorem 2. Let

where

It is clear that the I-projection and the maximum entropy distribution could belong to the same functional class when a set of pdf’s are used to formulate the constraints.

If we consider the functional space of all probability density functions satisfying certain conditions and adopt the relative entropy as a pseudo-metric, then a more natural view of point is to seek an optimal density minimizing the relative entropy with constraints characterized by the pseudo distance between the optimal density and a collection of candidate models,

In the context of composite likelihoods, the statistical model

ture, i.e.,

kelihood framework, however, is capable of going beyond this often over-simplified model.

To ensure that the optimal density reflects some known key characteristics in the low dimensional densities of the true pdf, one can apply the idea of I-projection or maximum entropy distribution by considering the following minimization problem:

where

We now present our main theorem of this section.

Theorem 3. Given probability density functions

where, for

Then the optimal probability density function satisfying

takes the form

where

The assertion of this theorem implies that the constraints in the original I-projection can be further generalized such that they are also a functionals of the probability density we seek as well. It can also be seen that

The optimal pdf under the current constraints belongs to the following functional class:

where

We now consider four special cases:

1) (INDEPENDENT CASE) For example, if we assume that

ginals only, do not bring in any additional structural information than

if all the weights equal to 1.

2) (CORRELATION CASE) If the constraints are defined by

The optimal density is then constructed by the marginals and all pairwise bivariate densities. A simplified form is given by

if

3) (CONDITIONAL CASE) If the constraints are defined by

4) (SPATIAL AND TEMPORAL CASE) The weights might be most appropriate for the spatial or temporal settings. Consider

In this section, we establish the asymptotic properties associated with the composite likelihood inference under the proposed information-theoretic framework. The consistency of the estimators is proved by following the argument in [

For clear presentation, we first define the following notations:

・ Denote the true density function by

・ Denote

with

and

・ Let

・ Let

Define

We make the following assumptions.

Assumption 1.

Assumption 2. For

Assumption 3. If

Assumption 4. If

Assumption 5.

Assumption 6.

Assumption 7.

We first give four lemmas in the following before we present the theorems regarding the limiting behavior of the weighted composite likelihood estimators.

Lemma 1. The following hold true:

(L1) Under Assumption 1,

(L2) Under Assumption 2,

(L3) Assume that Assumption 3 holds. If

(L4) Assume that Assumptions 4 and 7 holds. If

Lemma 2. Assume that Assumptions 1, 2, 6 hold. For any

Lemma 3. Assume that Assumptions 1 - 3 hold. Then

Lemma 4. Assume that Assumptions 1, 2, 4, 7 hold. Then

The four theorems describing the limiting behavior of the weighted composite likelihood estimators are given below.

Theorem 4. Assume that Assumptions 1 - 6 hold. Let

Theorem 5. Assume that Assumptions 1 - 7 hold. Let

for any n and for all observations. Then

Theorem 6. Assume that Assumptions 1 - 7 hold. Then

Remark 1. Note that in the proof of Theorem 4, the strong law of large numbers is used. If we prove it using the method given in [

Remark 2. For simple presentation, we have assumed that

In the following we assume that λ is a constant vector. For easy presentation, define

For convenience, denote

and

for a twice differentiable function

Assumption 8. For each

where

Assumption 9.

Assumption 10. There exist a positive number

for all

Define

and

We have the following theorem.

Theorem 7. Assume that Assumptions 1 - 10 hold. Then

Remark 3. In light of [

Remark 4. Let

By modifying the proof of Theorem 7,

The proposed information-theoretic framework provides theoretical justifications for the use of composite likelihood. It also serves as a unifying theme for various seemingly different composite likelihoods and connects them with I-projection and maximum entropy distribution. Significant characteristics of low dimensional models are incorporated into the constraints associated with component likelihoods. Asymptotic properties established in this article could be useful for further theoretical analysis of the properties of the composite likelihoods. The findings presented in this article will lead to more in-depth investigations on the theoretical properties of composite likelihoods and establish some possible connections with information theory.

Proof of Theorem 1: Let

This completes the proof.

Proof of Theorem 3: By the Lagrange method, we seek to minimize the following objective function

where

The objective function can then be rearranged so that

where

Since

where the derivative is taken with respect to g.

Thus, we have

It then follows that the optimal density function takes the form

where

Proof of Lemma 2: In view of the definition of

Proof of Lemma 3: By Lemma 1, Lemma 3 can be proved by following the proof of Lemma 2 of Wald (1949).

Proof of Lemma 4: By applying Lemma 1, Lemma 4 can be proved by following the proof of Lemma 3 of Wald (1949).

Proof of Theorem 4: By Lemmas 2 and 4, we can find a positive number

Let

Since

where

In light of (1.7)-(1.8), we have

and

Therefore,

which jointly with (1.9) implies (1.6).

Proof of Theorem 5: For any

Hence, for infinitely many n,

By Theorem 4, this event has zero probability. Thus all limit points

Proof of Theorem 7: By following the proof of Theorem 4.17 of Shao (2003), it can be shown that

Hence,

which, jointly with Slutsky’s theorem and the central limit theorem, concludes the proof of the theorem.