Theoretical Properties of Composite Likelihoods

The general functional form of composite likelihoods is derived by minimizing the Kullback-Leibler distance under structural constraints associated with low dimensional densities. Connections with the I-projection and the maximum entropy distributions are shown. Asymptotic properties of composite likelihood inference under the proposed information-theoretical framework are established.


Introduction
The composite likelihood has been increasingly used when the full likelihood is computationally intractable or difficult to specify due to either high dimensionality or complex dependence structures.Consider a random vector X with probability density ( )  , and the composite likelihood proposed in [1] is defined by ( ) ( ) where k λ 's are non-negative weights to be chosen.As discussed in [2], there are two general types of composite likelihood: marginal and conditional composite likelihood.The simplest composite likelihood is the one constructed under the independence assumption: ; .If the inferential interest is also on parameters prescribing a dependence structure, a pairwise composite likelihood [2] [3] is defined as the following: ; , ; .Conditional composite likelihood [4] [5] can be constructed by multiplying all pairwise conditional densities: ; ; .There are other important variations and applications of the composite likelihoods designed for various inferential purposes such as composite likelihood BIC for model selection in high-dimensional data in [6].Detailed discussions and review of composite likelihoods were provided in [2].
Since there are various composite likelihoods with different functional forms, it might be desirable to consider a unifying theme based on information-theoretic justifications.Under an information-theoretic framework, composite likelihoods can then be viewed as a class of inferential functions based on optimal probability density under structural constraints imposed on low dimensional densities when the complete joint density is either unknown or untractable.We show that the optimal densities associated with the composite likelihood are also connected with the I-projection density well-known in probability theory and the maximum entropy distributions in information theory.Although likelihood weights are employed in the original formulation of composite likelihood in [1], equal weights are often adopted due to convenience.We show that adaptive likelihood weights can indeed improve the performance of composite likelihood inference using equal weights.
This paper is organized as follows.In Section 2, we derive the composite likelihood as the optimal inferential device by minimizing the relative entropy or Kullbak-Leilber distance under structural constraints.Asymptotic properties are established in Section 3. Discussions are given in Section 4.

I-Projection and Maximum Entropy Distribution
Suppose that ( ) g x and ( ) f x are generalized densities of a dominated set of probability measures on the measurable space ( ) , Ω  .The relative entropy is defined as The relative entropy is widely used in information theory and also known as I-divergence in probability.In [7], Cover and Thomas provide an excellent account on its properties and applications in information theory and coding theory.As demonstrated in [8], the relative entropy can play an important role in statistical inference.The relative entropy is also called I-divergence and its geometric properties are studied in [9].Although the relative entropy or I-divergence is not a metric and in general does not define a topology, Csiszár in [9] shows that certain analogies exist between properties of probability distributions and Euclidean geometry, where I-divergence plays the role of squared distance.It is a measure of discrepancy between the probability densities g and f.
For any probability density function (pdf) 0 f , Csiszár in [9] defines an I-sphere centered around 0 f with a radius ρ as the following: where g is a probability density function.
In statistical inference, the pdf 0 f is the model of choice when the true pdf is unknown.In high dimensional or complex cases, it is high unlikely that the assumed model 0 f is correct.When no other information on the dependence structure is available, the best model might be the one based on the independent assumption.
When significant characteristics associated with the low dimensional projections of the joint probability density function, it is then desirable to incorporate this information formally into the statistical inference.To improve the chosen model, one might utilize constraints associated with known features under an information theoretic framework to be described in the following.As in [8], one might consider minimizing ( ) where d is a constant vector and ( ) t x a measurable multivariate statistic.
If  is a convex set of pdf intersecting ( ) S f ρ , an optimal pdf g * satisfying ( ) is defined as the I-projection of 0 f on  in [9].If such a projection exists, the convexity of  guarantees its uniqueness since ( ) 0 , I g f is strictly convex in g.The following theorem follows immediately from the above theorem in [9].
where, for Then the optimal probability density function (the I-projection of 0 f ) where ( ) is the normalizing constant.Similar to the I-projection, the maximum entropy distribution is also an optimal density under constraints.It is also known as the Maxwell-Boltzmann distribution, the optimal probability density function under temperature constraints.Consider the following maximization problem: in which ( ) By applying the maximum entropy theorem in [7] with the constraints set as the logarithm of certain density functions, we then have the following result.
Theorem 2. Let 0 1 , , , m f f f  be a set of probability density functions.If we set ( )  , then there exists one unique maximum entropy density function that takes the form: where ( ) It is clear that the I-projection and the maximum entropy distribution could belong to the same functional class when a set of pdf's are used to formulate the constraints.

Derivation of Composite Likelihood Using Pseudo-Metric
If we consider the functional space of all probability density functions satisfying certain conditions and adopt the relative entropy as a pseudo-metric, then a more natural view of point is to seek an optimal density minimizing the relative entropy with constraints characterized by the pseudo distance between the optimal density and a collection of candidate models, 0 1 2 , , , , m f f f f  .In the context of composite likelihoods, the statistical model 0 f is the joint statistical model assumed while other pdf's are low dimensional densities to be used to complete the construction of a refined model which may or may not coincides with 0 f .For example, one could assume a statistical model under an independence struc- ture, i.e., 0 , , , m f f f  are low dimensional probability density functions.The composite likelihood framework, however, is capable of going beyond this often over-simplified model.
To ensure that the optimal density reflects some known key characteristics in the low dimensional densities of the true pdf, one can apply the idea of I-projection or maximum entropy distribution by considering the following minimization problem: are functions of the true joint pdf f.The constraints employed here are different and more natural than those in the I-projection and maximum entropy formulation.In the original setup of the I-projection and maximum entropy distribution, the constraints are expectations of some certain statistics.The theorems of I projection and maximum entropy, however, are no longer applicable as the current set of constraints involves ( ) log g x .We now present our main theorem of this section.

Theorem 3. Given probability density functions
, where, for 1, , Then the optimal probability density function satisfying where ( ) The assertion of this theorem implies that the constraints in the original I-projection can be further generalized such that they are also a functionals of the probability density we seek as well.It can also be seen that , the sphere in the functional space of all probability functions as in the context of Iprojection.
The optimal pdf under the current constraints belongs to the following functional class: where 1 2 , , , m f f f  are low dimensional density functions.We now consider four special cases: 1) (INDEPENDENT CASE) For example, if we assume that ( ) [ ] ( ) x , the marginals.Note that we use [ ] i f to denote the marginals in order to distinguish them from the probability density i f used in the construction.If we set

( )
[ ] ( ) x , it then implies that the constraints, which are based on the marginals only, do not bring in any additional structural information than 0 f .Therefore, it follows that the optimal functional density is of the form x if all the weights equal to 1.
2) (CORRELATION CASE) If the constraints are defined by [ ] ( ) The optimal density is then constructed by the marginals and all pairwise bivariate densities.A simplified form is given by 3) (CONDITIONAL If the constraints are defined by ( ) ( ) , , we can then derive the conditional composite likelihood.
4) (SPATIAL AND TEMPORAL CASE) The weights might be most appropriate for the spatial or temporal settings.Consider ijts ij ts y x x = − for some given t and i.The composite likelihood can also be derived if the Jacobian for transformation is ignored due to its complexity.This would allow spatial and temporal correlation structure to be incorporated.

Asymptotic Properties of Composite Likelihood
In this section, we establish the asymptotic properties associated with the composite likelihood inference under the proposed information-theoretic framework.The consistency of the estimators is proved by following the argument in [10].
For clear presentation, we first define the following notations: • Denote the true density function by be the set of density function components under consideration.
• Denote ( ) . The set of probability density functions may not contain the true density function be the distance function defined over the space of all density functions.Assume that there is a unique ( ) is chosen as the K-L divergence in this paper.• Let θ be the estimate of ϑ such that ( ) , ; sup , ; .
We make the following assumptions.
are measurable, and linearly independent in probability.
for sufficiently small  and for sufficiently large τ .
We first give four lemmas in the following before we present the theorems regarding the limiting behavior of the weighted composite likelihood estimators.
Lemma 1.The following hold true: (L1) Under Assumption 1, ( ) , f x ϑ is measurable, and hence for any > 0  , ( ) The four theorems describing the limiting behavior of the weighted composite likelihood estimators are given below.
Remark 1.Note that in the proof of Theorem 4, the strong law of large numbers is used.If we prove it using the method given in [11], the consistency of ϑ may be extended to a large class of dependent observations.Remark 2. For simple presentation, have that { } 1 , , k f f  are parametric.This restriction is not necessary.
In the following we assume that λ is a constant vector.For easy presentation, define ( ) ( ) Let θ be a solution of the following equations: for a twice differentiable function ( ) , l x θ .To investigate the limiting distribution of the composite likelihood estimator, we make the following three more assumptions.
Assumption 8.For each x θ is twice continuously differentiable in θ , and satisfies Assumption 10.There exist a positive number τ θ and a positive function ( ) In light of [12], the assumptions 1 -8 made in Theorem 7 may be replaced by the assumptions similar to those assumed in Theorem 4.17 of Shao (2003).
Remark 4. Let θ be the solution of ( ) This completes the proof.◊ Proof of Theorem 3: By the Lagrange method, we seek to minimize the following objective function where 0 1 , , , m     are Lagrange multipliers.The objective function can then be rearranged so that Since ( ) U g is not a function of g′ , the first order derivative of g, the Euler-Lagrange equation is then given by where the derivative is taken with respect to g.Thus, we have It then follows that the optimal density function takes the form ϖ be the subset of ϖ consisting of all points ϑ of Ω × Ξ for which 0 κ ≤ ϑ . By Lemmas 2 -3, for each point ϖ is a closed set, there exists a finite number of points

2 :Lemma 4 :
In view of the definition of * ϑ , the properties of K-L divergence and Lemma 1, Lemma 2 can be proved by following the proof of Lemma 1 of Wald (1949) ◊ .Proof of Lemma 3: By Lemma 1, Lemma 3 can be proved by following the proof of Lemma 2 of Wald (1949).◊ Proof of By applying Lemma 1, Lemma 4 can be proved by following the proof of Lemma 3 of Wald (1949).◊ Proof of Theorem 4: By Lemmas 2 and 4, we can find a positive number 0 By Theorem 4, this event has zero probability.Thus all limit points ϑ of { } n  ϑ satisfy the inequality ε * ≤  ϑ − ϑwith probability one, which concludes the theorem.◊ Proof of Theorem 7: By following the proof of Theorem 4.17 ofShao (2003), it can be shown that