Bayesian Analysis of Simple Random Densities

A tractable nonparametric prior over densities is introduced which is closed under sampling and exhibits proper posterior asymptotics.


Introduction
The early 1970's witnessed Bayesian inference going nonparametric with the introduction of statistical models with infinite dimensional parameter spaces; the most conspicuous being the Dirichlet process [5], which is a prior on the class of all probability measures over a given sample space that trades great analytical tractability for a reduced support: as shown by Blackwell [4], its realizations are, almost surely, discrete probability measures. The posterior expectation of a Dirichlet process is a probability measure that gives positive mass to each observed value in the sample, making the plain Dirichlet process unsuitable to handle inferential problems such as density estimation. Many extensions and alternatives to the Dirichlet process have been proposed [6].
In this paper we construct a prior distribution over the class of densities with respect to Lebesgue measure. Given a partition in subintervals of a bounded interval of the real line, we define a random density whose realizations have a constant value on each subinterval of the partition. The distribution of the values of the random density on each subinterval is specified by transforming and conditioning a multivariate normal distribution.
Our simple random density is the finite dimensional analogue of the stochastic processes introduced by Thorburn [10] and Lenk [7]. Computations with these stochastic processes involve an intractable normalization constant, and are restricted to values of the random density on a finite number of arbitrarily chosen domain points, demanding some kind of interpolation of the results. The finite dimensionality of our random density makes our computations more direct and transparent and gives us simpler statements and proofs.
An outline of the paper is as follows. In Section 2, we give the formal definition of a simple random density. In Section 3, we prove that the distribution of a simple random density is closed under sampling. The results of the simulations in Section 4 depict the asymptotic behavior of the posterior distribution. We extend the model hierarchically in Section 5 to deal with random partitions. Although the usual Bayes estimate of a simple random density is a discontinuous density, in Section 6 we compute smooth estimates solving a decision problem in which the states of nature are realizations of the simple random density and the actions are smooth densities of a suitable class. Additional propositions and proofs of all the results in the paper are given in Section 7.

Simple random densities
Let (Ω, F , P ) be the probability space from which we induce the distributions of all random objects considered in the paper. For some integer k ≥ 1, let R k + be the subset of vectors of R k with positive components. Write R k for the Borel sigma-field of R k . Let λ k denote Lebesgue measure over (R k , R k ). We omit the indexes when k = 1. The components of a vector v ∈ R k are written as v 1 , . . . , v k .
Suppose that we have been given an interval [a, b] ⊂ R, and a set of real numbers The class of simple densities with respect to this partition consists of the nonnegative simple functions which have a constant value on each subinterval and integrate to one. Let d i = t i − t i−1 , for i = 1, . . . , k, and define the map S ∆ : Each simple density f : R → R within this class can be represented as in which h = (h 1 , . . . , h k ) ∈ R k is such that each h i ≥ 0, and S ∆ (h) = 1. The h i 's are the heights of the steps of the simple density f . From now on, let H r = {v ∈ R k + : Note that, by the definition of the d i 's given above, it follows that H r = ∅ if r ≤ 0. Moreover, define the projection on the first k − 1 coordinates π : . For a normal random vector Z = (Z 1 , . . . , Z k ) with mean m ∈ R k and k × k covariance matrix Σ, denote by U ∼ L k (m, Σ) the distribution of the lognormal random vector U = (e Z 1 , . . . , e Z k ). If Σ is nonsingular, it is easy to show that U has a density in which |Σ| is the determinant of Σ, and we have introduced the notations log u = (log u 1 , . . . , log u k ) ⊤ and m = (m 1 , . . . , m k ) ⊤ .
We define a random density whose realizations are simple densities with respect to the partition induced by ∆ by specifying the distribution of the random vector of its steps heights. Informally, the steps heights will have the distribution of a lognormal random vector U given that S ∆ (U) = 1. The formal definition of the random density is given in terms of a version of the conditional distribution of U given S ∆ (U) and the expression of its conditional density with respect to a dominating measure. However, we are outside the elementary case in which the joint distribution is dominated by a product measure. In fact, we have in Proposition 7.1 a simple proof that Lebesgue measure λ k+1 and the joint distribution of U and S ∆ (U) are mutually singular.
A suitable family of measures that dominate the conditional distribution of U given S ∆ (U), for each value of S ∆ (U), is described in the following lemma.
, for r ∈ R. Then, each τ r is a measure over (R k , R k ).
The proof of Lemma 2.1 is given in Section 7. Figure 1 gives a simple geometric interpretation of the measures τ r when the underlying partition is formed by three subintervals.
The following result is the basis for the formal definition of the random density.
Theorem 2.2. Let U ∼ L k (m, Σ), with nonsingular Σ, and let {τ r } r∈R be the family of measures over (R k , R k ) defined on Lemma 2.1. Then, we have that µ U |S ∆ (U ) : is a regular version of the conditional distribution of U given S ∆ (U), in which The necessary lemmata and the proof of Theorem 2.2 are given in Section 7. The following definition of the random density uses the specific version of the conditional distribution constructed in Theorem 2.2.
is a simple random density, in which H = (H 1 , . . . , H k ) are the random heights of the steps of ϕ, with distribution given by µ H (A) = µ U |S ∆ (U ) (A | 1), for A ∈ R k , and µ U |S ∆ (U ) is the regular version of the conditional distribution of U given S ∆ (U) obtained in Theorem 2.2. Hence, for every A ∈ R k , we have in which τ 1 (A) = d −1 k λ k−1 (π(A ∩ H 1 )) and it holds that µ H (H 1 ) = 1. We use the notation ϕ ∼ ∆(m, Σ).

Conditional model
Now, we model a set of absolutely continuous observables conditionally, given the value of a simple random density ϕ. The following lemma, proved in Section 7, describes the conditional model and determines the form of the likelihood.
and let the random variables X 1 , . . . , X n be conditionally independent and identically distributed, given that H = h, with distribution The factorization criterion implies that c = (c 1 , . . . , c n ) is a sufficient statistic for ϕ. That is, in this conditional model, as one should expect, all the sample information is contained in the countings of how many sample points belong to each subinterval of the partition induced by ∆.
Using the notation of Lemma 3.1, and defining c = (c 1 , . . . , c k ) ⊤ , we can prove that the prior distribution of ϕ is closed under sampling.
This result, proved in Section 7, makes the simulations of the prior and posterior distributions essentially the same, the only difference being the computation of m * .

Stochastic simulations
We summarize the distribution of a simple random density ϕ ∼ ∆(m, Σ), represented as ϕ( for ǫ > 0, and taking as a credible set the B(ĥ, ǫ) with the smallest ǫ such that The Random Walk Metropolis algorithm [8] is used to draw dependent realizations of the steps of ϕ as values of a Markov chain {H (i) } i≥0 . The two summaries are computed through ergodic means of this chain. For example, the credible set is determined with the help of the almost sure convergence of As for the parameters appearing in Definition 2.3, we take in our experiments all the m i 's equal to one, and the covariance matrix Σ = (σ ij ) is chosen in the following way. Given some positive definite covariance function C : In our examples we study the family of Gaussian covariance functions defined by C ρ,θ (x, y) = ρ e −θ (x−y) 2 , with dispersion parameter ρ > 0 and scale parameter θ > 0. We observe the same asymptotic behavior of the posterior distribution with data coming from a triangular distribution and a mixture of normals (with appropriate truncation of the sample space).

Random partitions
Inferentially, we have a richer construction when the definition of the simple random density involves a random partition. Informally, we want a model for the random On each graph, the black simple density is the estimateφ, the light gray region is a credible set with credibility level of 95%, and the dark gray curve is the data generating density.
density in which the underlying partition adapts itself according to the information contained in the data.
We consider a family of uniform partitions of a given interval [a, b]. Each partition of this family will be described by a positive integer random variable K, which determines the number of subintervals in the partition. Since the parameter ρ of the family  Figure 4. Posterior summaries for Example 5.1. The black simple density is the estimateφ, the light gray region is a credible set with credibility 95%, and the dark gray curve is the data generating density.
of Gaussian covariance functions used to induce Σ may have different meanings for different partitions, we treat it as a positive random variable R.
Explicitly, we are considering the following hierarchical model: K and R are independent. Given that K = k e R = ρ, we choose the uniform partition of the interval [a, b] induced by induce Σ ρ,θ from the family of Gaussian covariance functions, and take the simple random density ϕ ∼ ∆(m, Σ ρ,θ ). Finally, the observables are modeled as in Lemma 3.1. This hierarchy is summarized in the following graph.
In the following example, instead of specifying priors for K and R, we define the likelihood of K and R by L x (k, ρ) = f X|K,R (x | k, ρ), whose form is obtained in Proposition 7.6, find the maximum (k,ρ) = arg max k,ρ L x (k, ρ), and use these values in the definitions of the prior, determining the posterior summaries as we did in Section 4.
Example 5.1. With a sample of size 2 000 generated from a Beta(4, 2) distribution, we find the maximum of the likelihood of K and R at (k,ρ) = (9, 1.43). In Figure  4 we have the posterior summaries obtained using these values in the definition of the prior. Moreover, in the left graph of Figure 5 we have the distribution function F corresponding to the estimated posterior density. For the sake of comparison, we plot in the right graph of Figure 5 some quantiles of this distributionF against the quantiles of the distribution F 0 from which we generated the data.

Smooth estimates
It is possible to go beyond the discontinuous densities obtained as estimates in the last two sections and get smooth estimates of a simple random density ϕ solving a Bayesian decision problem in which the states of nature are the realizations of ϕ and the actions are smooth densities of a suitable class.
In view of Theorem 3.2, it is enough to consider the problem without data. As before, the sample space is the interval [a, b], which is partitioned according to some ∆. For a density f with respect to Lebesgue measure, we denote its L 2 norm by f 2 = f 2 dλ 1/2 .
Then, the Bayes decision isφ = N i=1α i g i , in whichα i minimize globally the quadratic form  Figure 6. Example 6.2. On the right graph, the black simple density is the estimateφ, and the light gray region is a credible set with credibility 95%. On both graphs the dark gray curve is the data generating density. On the left graph, the black smooth density is the Bayes decision of Proposition 6.1.
subject to the constraints α i ≥ 0, for i = 1, . . . , N, and We use the result of Proposition 6.1, proved in Section 7, choosing the g i 's inside a class of smooth densities that serve approximately as a basis to represent any continuous density with the specified support.
For the next example, suppose that the support of the densities is the interval [0, 1]. Bernstein's Theorem (see [3], Theorem 6.2) states that the polynomial approximates uniformly any continuous function f defined on [0, 1], when N → ∞. Suppose that f is a density. If we define, for i = 0, . . . , N, we can rewrite the approximating polynomial as B N (x) = N i=0 α i g i (x), in which g i is a density of a Beta(i+ 1, N −i+ 1) random variable. Hence, if we take a sufficiently large N, we expect that any continuous density with support [0, 1] will be reasonably approximated by a mixture of these g i 's.
Example 6.2. Suppose that we have a sample of 5 000 data points simulated from a truncated exponential distribution, whose density is Repeating the analysis made in Example 5.1, we find the maximum of the likelihood of K e R at (k,ρ) = (9, 0.86). The left graph of Figure 6 presents the posterior summaries. After that, we solved the problem of constrained optimization in Proposition 6.1 and found the results shown in the right graph of Figure 6.

Additional results and proofs
In this section we present some subsidiary propositions and give proofs to all the results stated in the paper.

Proof. Define the set
by definition of S ∆ . On the other hand, note that λ k+1 (A) = 0, since this is the (k+1)volume of the k-dimensional hyperplane defined by the set A. Since µ U,S ∆ (U ) (A c ) = 0, the result follows.
Proof of Lemma 2.1. When r ≤ 0, the result is trivial, since in this case H r = ∅, making τ r a null measure. Suppose that r > 0 and let g : R k → R k be the function defined by Define h r : R k−1 → R k by h r (y) = g(y, r). We will show that π(A ∩ H r ) = h −1 r (A), for every A ∈ R. Suppose that y ∈ π(A ∩ H r ). Then, there is a v ∈ A ∩ H r such that y = π(v) = (v 1 , . . . , v k−1 ) and Since v ∈ H r , we have that 1 Since v ∈ A, it follows from the definition of the inverse image of h r that y ∈ h −1 r (A) and, therefore, we conclude that π(A ∩ H r ) ⊂ h −1 r (A). To prove the other inclusion, suppose that y ∈ h −1 r (A) and define v = h r (y). Hence, v ∈ A and by the definition of h r we have that v = g(y, r) = y 1 , . . . , y k−1 , Since v ∈ A ∩ H r and y = π(v), it follows that y ∈ π(A ∩ H r ). Therefore, h −1 r (A) ⊂ π(A ∩ H r ). Hence, we have that r and the properties of the inverse image of h r and the Lebesgue measure entail that each τ r is a measure over (R k , R k ).
Proof. Define the function T : in which the fifth equality is obtained transforming by T , u ∈ R k and r ∈ R. It follows that µ U,S ∆ (U ) ≪ ξ, and the Radon-Nikodym derivative has the desired expression.
in which u ∈ R k and r ∈ R.
Proof. Define f : . Hence, f is a differentiable function whose inverse is the differentiable function g defined on Lemma 2.1. The value of the Jacobian on the point v ∈ R k is J g (v) = d −1 k . Let A ∈ R k , y ∈ R k−1 , r ∈ R, and define h r as in Lemma 2.1. When r > 0, we have already shown in the course of the proof of Lemma 2.1 that π(A ∩ H r ) = h −1 r (A), for every A ∈ R k . Remembering that, by definition, H r ⊂ R k + , it follows that π(A ∩ H r ) = h −1 r (A ∩ R k + ) and we conclude that I π(A∩Hr) (y) = I A∩R k + (g(y, r)). Now suppose that r ≤ 0. In this case, since H r = ∅, we have that I π(A∩Hr) (y) = I ∅ (y) = 0. As for the value of I A∩R k + (g(y, r)), consider two subcases: since g(y, r) = y 1 , . . . , y k−1 , if any of the y i ≤ 0, then I A∩R k + (g(y, r)) = 0, otherwise, we have and again it happens that I A∩R k + (g(y, r)) = 0. Therefore, we conclude that in this case also I π(A∩Hr) (y) = I A∩R k + (g(y, r)). Hence, for A ∈ R k and B ∈ R, we have that in which y ∈ R k−1 and r ∈ R, the third equality is obtained transforming by f , and the penultimate equality is a consequence of Tonelli's Theorem. The result follows from the Product Measure Theorem and Fubini's Theorem (see [1], Theorems 2.6.2 e 2.6.4). Proof. Let A ∈ R, u ∈ R k , and r ∈ R. Let ξ be the measure defined on Lemma 7.2. We have that in which the penultimate equality follows from Lemma 7.2, and the last equality follows from Lemma 7.3. Hence, µ S ∆ (U ) ≪ λ and the Radon-Nikodym derivative has the desired expression.
Proof of Theorem 2.2. Let µ U,S ∆ (U ) be the joint distribution of U and S ∆ (U), and let µ S ∆ (U ) be the distribution of S ∆ (U). For A ∈ R k and B ∈ R, by the definition of conditional distribution, we have that for almost every r [λ]. Therefore, we have that µ U |S ∆ (U ) ( · | r) ≪ τ r , for almost every as desired. The fact that µ U |S ∆ (U ) (H r | r) = 1 follows immediately.
Hence, µ X|H ( · | h) and α h agree on the π-system of product sets that generate R n . Therefore, by Theorem A.26 of [9], both measures agree on the whole sigma-field R n . It follows that µ X|H ( · | h) ≪ λ n , almost surely [µ H ], and the Radon-Nikodym derivative has the desired expression.
Proof of Theorem 3.2. By Bayes Theorem, for each A ∈ R k , we have that in which we have used the expression of the likelihood obtained in Lemma 3.1, the Leibniz rule for the Radon-Nikodym derivatives, the expression of dµ H /dτ 1 in Definition 2.3, and the constant C 0 is such that µ H|X (H 1 | x) = 1. The remainder of the proof relies on some matrix algebra. Let I be the identity matrix. Since, by definition, Σ is symmetric , we have that Therefore, we have that (Σ −1 ) ⊤ = Σ −1 . Write l = log h. Since the scalar l ⊤ Σ −1 m is equal to its transpose (l ⊤ Σ −1 m) ⊤ = m ⊤ Σ −1 l, we have that Using this result in the expression of µ H|X together with the expression of f U , we have in which C 2 = (C 0 C 1 )/f S ∆ (U ) (1) and f U * is a density of the random vector U * ∼ L k (m * , Σ). We conclude that, given that X = x, the vector H has the distribution of the heights of the steps of ϕ * ∼ ∆(m * , Σ), as desired.
Proposition 7.5. Suppose that the random variables X 1 , . . . , X n+1 are modeled according to Lemma 3.1. Denote by µ X i the distribution of X i , for i = 1, . . . , n + 1. For convenience, use the notations X (n) = (X 1 , . . . , X n ) and x (n) = (x 1 , . . . , x n ) ∈ R k . Then, for every A ∈ R, we have ] dλ(y), for i = 1, . . . , n + 1; Proof. By Definition 2.3, we have In an analogous manner, we have For item (a), note that in which the fourth equality follows from Tonelli's Theorem. For item (b), for each B ∈ R n , we have On the other hand, we have in which the third equality follows from the hypothesis of conditional independence and Theorem B.61 of [9], the fourth equality is a consequence of Theorem 2.6.4 of [1], and the sixth equality is due to Tonelli's Theorem. Comparing both expressions for P {X n+1 ∈ A, X (n) ∈ B}, we get the desired result.
Proposition 7.6. Let µ K = P • K −1 over (N, 2 N ) be the distribution of K and let µ R = P • R −1 over (R, R)be the distribution of R. Denote by µ K,R the joint distribution of K and R, which by the independence of K and R is equal to the product measure µ K × µ R , and let µ K,R,H be the joint distribution of K, R and H. In the hierarchical model described in Section 5, we have that µ X|K,R ( · | k, ρ) ≪ λ n , almost surely [µ K,R ], with Radon-Nikodym derivative for the f X|H defined on Lemma 3.1.
Proof. Let A ∈ R n and B ∈ 2 N ⊗ R. By the definition of conditional distribution, we have P {X ∈ A, (K, R) ∈ B} = B µ X|K,R (A | k, ρ) dµ K,R (k, ρ) .
On the other hand, by arguments similar to those used in the proof of Proposition 7.5, we have in which we have used the linearity of the integral. Therefore, minimizing the expected loss is the same as solving the problem of constrained minimization of the quadratic form Q. For the matrix M = (M ij ), note that, for every non null y = (y 1 , . . . , y N ) ⊤ ∈ R N , we have in which we have used the linearity of the integral. Therefore, the matrix M is positive definite, yielding (see [2]) that the quadratic form Q is convex and the problem of constrained minimization of Q has a single global solution (α 1 , . . . ,α N ). Since the Bayes decision is the f that minimizes the expected loss, the result follows.