Confidence Regions with Nuisance Parameters

Jan Vrbik

doi:10.4236/ojs.2022.125039

Open Journal of Statistics > Vol.12 No.5, October 2022

Confidence Regions with Nuisance Parameters

Jan Vrbik
Department of Mathematics and Statistics, Brock University, St. Catharines, Canada.
DOI: 10.4236/ojs.2022.125039 PDF HTML XML 87 Downloads 461 Views

Abstract

Consider a distribution with several parameters whose exact values are unknown and need to be estimated using the maximum-likelihood technique. Under a regular case of estimation, it is fairly routine to construct a confidence region for all such parameters, based on the natural logarithm of the corresponding likelihood function. In this article, we investigate the case of doing this for only some of these parameters, assuming that the remaining (so called nuisance) parameters are of no interest to us. This is to be done at a chosen level of confidence, maintaining the usual accuracy of this procedure (resulting in about 1% error for samples of size , and further decreasing with 1/n). We provide a general solution to this problem, demonstrating it by many explicit examples.

Keywords

Confidence Regions, Maximum Likelihood, Nuisance Parameters, Asymptotic Distribution

Share and Cite:

Vrbik, J. (2022) Confidence Regions with Nuisance Parameters. Open Journal of Statistics, 12, 658-675. doi: 10.4236/ojs.2022.125039.

1. Introduction

There is a basic technique (expounded in detail by M. S. Bartlett—see [1] [2] and [3] —nicely summarized in [4]) for constructing confidence regions (intervals) for parameters of a specific distribution (assuming a regular case, meaning the distribution’s support is not a function of any of the distribution’s parameters) which rests on the fact that

$2 \ln L (X; \hat{θ}) - 2 \ln L (X; θ_{0})$ (1)

has approximately the chi-square distribution with K degrees of freedom (K is the number of parameters to be estimated), where

$\ln L (X; θ) : = \sum_{i = 1}^{n} \ln f (x_{i}; θ)$ (2)

$X$ is the set of n observations, individually denoted $x_{i}$ (allowing for a possibility of a multivariate distribution), $f (x; θ)$ denotes the corresponding probability density function, $\hat{θ}$ is the vector of the resulting maximum-likelihood (ML) estimators of the parameters, and $θ_{0}$ represents their true (even though unknown) values.

The proof rests on expanding the LHS of following K-component equation

$\sum_{i = 1}^{n} \frac{\partial \ln f (x_{i}; θ)}{\partial θ} = 0$ (3)

with respect to $θ$ at $θ_{0}$ to a linear (in $θ - θ_{0}$ ) accuracy, making the answer equal to $0$ and solving for $θ$ , thereby getting

$\hat{θ} - θ_{0} ≃ M^{- 1} Y$ (4)

where

$Y : = \frac{1}{n} \sum_{i = 1}^{n} {\frac{\partial \ln f (x_{i}; θ)}{\partial θ} |}_{θ = θ_{0}}$ (5)

and

$M : = - E ({\frac{\partial^{2} \ln f (X; θ)}{\partial θ^{2}} |}_{θ = θ_{0}}) \equiv E ({\frac{\partial \ln f (X; θ)}{\partial θ} \circ \frac{\partial \ln f (X; θ)}{\partial θ} |}_{θ = θ_{0}})$ (6)

Note that $Y$ is a K-component vector, while $M$ represents a symmetric, positive-definite K by K matrix (the small circle stands for a direct product of two vectors).

Similarly expanding (1) and utilizing (4) we get

$2 n Y (\hat{θ} - θ_{0}) - n {(\hat{θ} - θ_{0})}^{T} M (\hat{θ} - θ_{0}) + \dots = \sqrt{n} Y^{T} M^{- 1} Y \sqrt{n}$ (7)

where $Y \sqrt{n}$ has (by Central Limit Theorem), approximately, a K-variate Normal distribution with the mean of $0$ and the variance-covariance matrix of $M$ ; this implies that (7) has, to the same level of approximation, the $χ_{K}^{2}$ distribution (since the components of $M^{- 1 / 2} Y \sqrt{n}$ are then asymptotically independent, each having the mean of 0 and the variance of 1).

From all this, it then follows that an approximate confidence region is found by first finding the maximum likelihood estimators $\hat{θ}$ , then making (1) equal to a critical $χ_{K}^{2}$ value and solving (usually, only graphically) for $θ_{0}$ .

We demonstrate this by generating a random independent sample of 200 observations from a Negative Binomial distribution (with parameters $α$ and p) and constructing the corresponding 90% confidence region for the two parameters; a simple, self-explanatory Mathematica code to do exactly that looks like this:

The resulting confidence region is displaced in Figure 1.

Similarly, to test a null hypothesis which claims specific values for the K parameters, we evaluate (1) with $θ_{0}$ being the hypothesized (rather than the true) values, and check the result against the critical value of $χ_{K}^{2}$ ; something this article will not elaborate on any further.

2. Partial Confidence Regions

The aim of this article is to show how to construct a confidence region (called partial) for only some parameters of distribution, even though all of its parameters are unknown and need to be estimated by the maximum-likelihood technique.

We should mention that there is some existing literature on using partial likelihood functions (LF), for example [5], but its goals and results bear little resemblance to ours. Similarly, articles on marginal LF (for example [6]) and conditional LF deal with only rather specialized issues while our approach is fully general; the only shared feature is the occasional use of identical terminology.

Figure 1. 90% confidence region for α and p.

Extending the technique delineated in our Introduction, we now need to find an approximate distribution of

$2 \ln L (X; \hat{θ}) - 2 \ln L (X; {\hat{Θ}}_{0})$ (8)

where only some components of ${\hat{Θ}}_{0}$ are equal to the true values of the corresponding (we call them pivotal) parameters, while the rest (the nuisance parameters; a term introduced by [7] and further explored by [8]) are set to their $\hat{θ}$ values. Knowing this distribution will then enable us to construct confidence regions (or test hypotheses) for the pivotal parameters only while ignoring the estimates of the nuisance parameters.

Since we already have a good approximation to (1), namely (7), or equivalently

$\sqrt{n} {(\hat{θ} - θ_{0})}^{T} M (\hat{θ} - θ_{0}) \sqrt{n}$ (9)

we now need a similar approximation for

$2 \ln L (X; {\hat{Θ}}_{0}) - 2 \ln L (X; θ_{0})$ (10)

and then, for the corresponding difference. To approximate (10), we go back to the LHS of (7) and replace $\hat{θ} - θ_{0}$ by ${\hat{Θ}}_{0} - θ_{0}$ (i.e. keeping the nuisance components of $\hat{θ} - θ_{0}$ and setting the pivotal components to 0) while $Y = M (\hat{θ} - θ_{0})$ , computed from (4), remains unchanged. This results in

$\begin{array}{l} 2 n {(\hat{θ} - θ_{0})}^{T} M ({\hat{Θ}}_{0} - θ_{0}) - n {({\hat{Θ}}_{0} - θ_{0})}^{T} M ({\hat{Θ}}_{0} - θ_{0}) \\ : = n {(\hat{θ} - θ_{0})}^{T} M_{0} (\hat{θ} - θ_{0}) \end{array}$ (11)

where $M_{0}$ is the original $M$ matrix with all pivotal-by-pivotal elements set to 0 (a notation to be used with other matrices as well).

This can be shown by rearranging the parameters to start with the pivotal and be followed by the nuisance ones, and visualizing the corresponding 2 by 2 block structure of the symmetric matrix $M$ . In such representation, the previous equation reads

$n {(\hat{θ} - θ_{0})}^{T} [\begin{matrix} O & ■ \\ O & ■ \end{matrix}] (\hat{θ} - θ_{0}) + n {(\hat{θ} - θ_{0})}^{T} [\begin{matrix} O & O \\ ■ & ■ \end{matrix}] (\hat{θ} - θ_{0})$

$- n {(\hat{θ} - θ_{0})}^{T} [\begin{matrix} O & O \\ O & ■ \end{matrix}] (\hat{θ} - θ_{0}) = n {(\hat{θ} - θ_{0})}^{T} [\begin{matrix} O & ■ \\ ■ & ■ \end{matrix}] (\hat{θ} - θ_{0})$

where a full square $■$ indicates keeping the original block of the $M$ matrix, while each $O$ represents a zero sub-matrix of the corresponding dimensions.

Subtracting (11) from (9) then yields the desired approximation to (8), namely

$\sqrt{n} {(\hat{θ} - θ_{0})}^{T} (M - M_{0}) (\hat{θ} - θ_{0}) \sqrt{n}$ (12)

Introducing $U = (\hat{θ} - θ_{0}) \sqrt{n}$ , which we know from our Introduction to be approximately K-variate Normal, having zero means and the variance-covariance matrix of $M^{- 1}$ , we can now find the moment generating function (MGF) of (12) by

$\begin{array}{l} \frac{\int \dots \int_{- \infty}^{\infty} \exp (- \frac{u^{T} M u}{2} + t u^{T} (M - M_{0}) u) d u}{{(2 π)}^{K / 2} \sqrt{\det (M^{- 1})}} \\ = \sqrt{\frac{1}{\det (I - 2 t I + 2 t M_{0} M^{- 1})}} = \sqrt{\frac{1}{\det (I - 2 t I + 2 t M {(M^{- 1})}_{0})}} \end{array}$ (13)

where $I$ is the K by K identity matrix. The last equality follows from the fact that

$M_{0} M^{- 1} = [\begin{matrix} M_{p, i} {(M^{- 1})}_{i, p} & \dots \\ O & I \end{matrix}] = [\begin{matrix} I - M_{p, p} {(M^{- 1})}_{p, p} & \dots \\ O & I \end{matrix}]$ (14)

and

$M {(M^{- 1})}_{0} = [\begin{matrix} M_{p, i} {(M^{- 1})}_{i, p} & O \\ \dots & I \end{matrix}]$ (15)

make the same contribution to the determinant in (13), where $I$ is now the L by L identity matrix (L being the number of pivotal parameters). In (14), we also use the fact that

$M_{p, i} {(M^{- 1})}_{i, p} + M_{p, p} {(M^{- 1})}_{p, p} = I$ (16)

where the $p$ and $i$ subscripts refer to the corresponding pivotal and/or nuisance block of the matrix.

It is easy to see that the result of (13) does not change after replacing $M$ (the variance-covariance matrix of $\sqrt{n} Y$ ) by the corresponding correlation matrix $ℂ = D^{- 1 / 2} M D^{- 1 / 2}$ , where $D$ is the main-diagonal matrix of the corresponding variances, and correspondingly replacing $M_{0}$ by $ℂ_{0} = {(D^{- 1 / 2} M D^{- 1 / 2})}_{0} = D^{- 1 / 2} M_{0} D^{- 1 / 2}$ (recall that the 0 subscript indicates setting all pivotal-pivotal elements equal to 0), since clearly

$\det (I - 2 t I + 2 t M_{0} M^{- 1}) = \det (I - 2 t I + 2 t ℂ_{0} ℂ^{- 1})$ (17)

Similarly, we can replace the asymptotic variance-covariance matrix of $\sqrt{n} (\hat{θ} - θ_{0})$ , namely $M^{- 1}$ , by its correlation matrix $\tilde{ℂ} = {\tilde{D}}^{- 1 / 2} M^{- 1} {\tilde{D}}^{- 1 / 2}$ and $M_{0}$ by ${({\tilde{ℂ}}^{- 1})}_{0} = {\tilde{D}}^{1 / 2} M_{0} {\tilde{D}}^{1 / 2}$ without affecting the value of the determinant.

Summary

Based on (14), the MGF of (12) is given by

$\prod_{l = 1}^{L} \frac{1}{\sqrt{1 - 2 t \cdot e_{l}}}$ (18)

where $e_{1}, e_{2}, \dots, e_{L}$ are the eigenvalues of $M_{p, p} {(M^{- 1})}_{p, p}$ (note that $M$ can be replaced by $ℂ$ or $\tilde{ℂ}$ , whichever is more convenient). The resulting PDF is then that of a convolution of the individual Gamma ( $\frac{1}{2},2 e_{l}$ ) distributions.

It is important to note that, when there are more pivotal than nuisance parameters (i.e. when $L > K - L$ ), the L by L matrix $M_{p, i} {(M^{- 1})}_{i, p}$ in (16) is of rank $K - L$ only (this is now determined by the number of columns of $M_{p, i}$ ), implying that $L - (K - L) = 2 L - K$ of its eigenvalues are equal to 0, and correspondingly simplifying the eigenvalues of $M_{p, p} {(M^{- 1})}_{p, p}$ ( $2 L - K$ of which will be equal to 1). Furthermore, the remaining eigenvalues of $M_{p, i} {(M^{- 1})}_{i, p}$ are the same as those of $M_{i, p} {(M^{- 1})}_{p, i}$ ; this is based on a general result stating that, when $A$ is an n by m matrix while $B$ is m by n, $A B$ and $B A$ will share all their non-zero eigenvalues, while the extra eigenvalues, if any, will be equal to 0.

This implies that, when $L > K - L$ , we can replace the eigenvalues of $M_{p, p} {(M^{- 1})}_{p, p}$ by the eigenvalues of $M_{i, i} {(M^{- 1})}_{i, i}$ (a smaller matrix), knowing that each of the remaining $2 L - K$ eigenvalues is equal to 1.

The final note: some steps of the resulting procedure for constructing partial confidence regions may be carried out analytically, while the rest require a numerical approach. This can be observed in our subsequent examples: some avoid explicit formulas entirely, performing each step numerically (always an available option), while our last example is almost completely analytical, to facilitate investigation of the technique’s accuracy.

3. Multivariate Normal Distribution

The most important multi-parameter distribution is the Normal distribution of several (say n) random variables $X_{1}, X_{2}, \dots, X_{n}$ , collectively denoted $X$ . It is fully specified by the following parameters: n individual means (collectively denoted $μ$ ), n standard deviations $σ$ , and $(\begin{matrix} n \\ 2 \end{matrix})$ correlation coefficients $ρ_{i j}$ ( $1 \leq i < j \leq n$ ) usually collected in a symmetric matrix $ℝ$ (each $ρ_{i j}$ will

appear twice, on both sides of the main diagonal, whose elements are all equal to 1). This section is a review of basic formulas relating to ML estimation of these parameters.

The natural logarithm of the corresponding PDF (aka likelihood function) is given by

$- \frac{{(x - μ)}^{T} S^{- 1} ℝ^{- 1} S^{- 1} (x - μ)}{2} - \frac{1}{2} \ln \det ℝ - \ln \det S - \frac{n}{2} \ln (2 π)$ (19)

where $S$ is a main-diagonal matrix of the n values of $σ$ . Note that $S^{- 1} (x - μ)$ , explicitly expanded, yields the following vector $〈 \frac{x_{1} - μ_{1}}{σ_{1}}, \frac{x_{2} - μ_{2}}{σ_{2}}, \dots, \frac{x_{n} - μ_{n}}{σ_{n}} 〉$ .

To find the corresponding $M$ matrix, we use the following well-known formulas

$\frac{d ℝ^{- 1}}{d ρ_{k l}} = - ℝ^{- 1} \frac{d ℝ}{d ρ_{k l}} ℝ^{- 1}$ (20)

$\frac{d \ln \det ℝ}{d ρ_{k l}} = Tr (\frac{d ℝ}{d ρ_{k l}} ℝ^{- 1})$ (21)

where Tr indicates taking the matrix’ trace. Note that all but two elements of

$\frac{d ℝ}{d ρ_{k l}} : = ℍ^{[k l]} = v^{[k]} \circ v^{[l]} + v^{[l]} \circ v^{[k]}$ (22)

are equal to 0 (the remaining two elements are equal to 1), where $v^{[k]}$ stands for a vector of $n - 1$ zeros with only its k^th component equal to 1.

Differentiating (19) twice with respect to $μ$ and changing the sign yields

$S^{- 1} ℝ^{- 1} S^{- 1}$ (23)

which represents the $μ$ by $μ$ block of $M$ , while the $μ$ by $σ$ and $μ$ by $ρ_{i j}$ blocks are both zero sub-matrices, since $E (x - μ) = 0$ —that goes for the $σ$ by $μ$ and $ρ_{i j}$ by $μ$ blocks as well.

To find the $σ$ by $σ$ block, we first differentiate (19) with respect to $σ_{i}$ and then with respect to $σ_{j}$ (assuming $i \neq j$ ), getting

$- \frac{{(x - μ)}_{i} {(ℝ^{- 1})}_{i j} {(x - μ)}_{j} + {(x - μ)}_{j} {(ℝ^{- 1})}_{i j} {(x - μ)}_{i}}{2 σ_{i}^{2} σ_{j}^{2}}$ (24)

Reversing the sign and taking the expected value yields the following expression for the off-diagonal elements of the $σ$ by $σ$ block

$\frac{ℝ_{i j} {(ℝ^{- 1})}_{i j}}{σ_{i} σ_{j}}$ (25)

since $E ((x_{i} - μ_{i}) (x_{j} - μ_{j})) = σ_{i} σ_{j} ℝ_{i j}$ .

When differentiating with respect of $σ_{i}$ twice, the corresponding second derivative is

$\begin{array}{l} - \sum_{j = 1}^{n} \frac{{(x - μ)}_{i} {(ℝ^{- 1})}_{i j} {(x - μ)}_{j} + {(x - μ)}_{j} {(ℝ^{- 1})}_{i j} {(x - μ)}_{i}}{σ_{i}^{3} σ_{j}} \\ - \frac{{(x - μ)}_{i} {(ℝ^{- 1})}_{i i} {(x - μ)}_{i}}{σ_{i}^{4}} + \frac{1}{σ_{i}^{2}} \end{array}$ (26)

Reversing the sign and taking the expected value then yields

$2 \sum_{j = 1}^{n} \frac{{(ℝ^{- 1})}_{i j} ℝ_{j i}}{σ_{i}^{2}} + \frac{ℝ_{i i} {(ℝ^{- 1})}_{i i}}{σ_{i}^{2}} - \frac{1}{σ_{i}^{2}} = \frac{ℝ_{i i} {(ℝ^{- 1})}_{i i}}{σ_{i}^{2}} + \frac{1}{σ_{i}^{2}}$ (27)

for the main-diagonal elements of the $σ$ by $σ$ block (note that $\sum_{j = 1}^{n} {(ℝ^{- 1})}_{i j} ℝ_{j i} = I_{i i} = 1$ ).

This means that

$\frac{ℝ_{i j} {(ℝ^{- 1})}_{i j} + I_{i j}}{σ_{i} σ_{j}}$ (28)

is the resulting expression for both the diagonal and off-diagonal elements of the $σ$ by $σ$ block.

To find the $ρ_{k l}$ - $ρ_{m p}$ element of the $M$ matrix, we first need the corresponding second derivative of (19), namely

$- \frac{2 {(x - μ)}^{T} S^{- 1} ℝ^{- 1} ℍ^{[k l]} ℝ^{- 1} ℍ^{[m p]} ℝ^{- 1} S^{- 1} (x - μ)}{2} + \frac{1}{2} Tr (ℍ^{[k l]} ℝ^{- 1} ℍ^{[m p]} ℝ^{- 1})$ (29)

Reversing the sign and taking the expected value results in

$\frac{1}{2} Tr (ℍ^{[k l]} ℝ^{- 1} ℍ^{[m p]} ℝ^{- 1}) = {(ℝ^{- 1})}_{k m} {(ℝ^{- 1})}_{l p} + {(ℝ^{- 1})}_{k p} {(ℝ^{- 1})}_{l m}$ (30)

Finally, to get the $σ_{i}$ - $ρ_{k l}$ element of $M$ , the corresponding differentiation of (19) yields

$- \frac{{(x - μ)}_{i} S^{- 1} ℝ^{- 1} ℍ^{[k l]} ℝ^{- 1} S^{- 1} (x - μ) + {(x - μ)}^{T} S^{- 1} ℝ^{- 1} ℍ^{[k l]} ℝ^{- 1} S^{- 1} {(x - μ)}_{i}}{2 σ_{i}}$ (31)

which, after changing the sign and taking the expected value, results in

$\sum_{j = 1}^{n} \frac{ℝ_{i j} {(ℝ^{- 1} ℍ^{[k l]} ℝ^{- 1})}_{j i}}{σ_{i}} = \frac{{(ℍ^{[k l]} ℝ^{- 1})}_{i i}}{σ_{i}} = \frac{{(ℝ^{- 1})}_{i k} I_{i l} + {(ℝ^{- 1})}_{i l} I_{i k}}{σ_{i}}$ (32)

having a non-zero value only when $i = k$ or $i = l$ .

4. Case of Asymptotic Independence

This section discusses the situation in which all elements of the pivotal-nuisance and (consequently) nuisance-pivotal blocks of $M$ (and, correspondingly, of $M_{0}$ and $M^{- 1}$ ) are equal to 0. In that case, $M_{0} M^{- 1}$ has the following simple form: its nuisance-nuisance block is the identity matrix, the remaining three blocks are all zero sub-matrices (implying that $I - M_{0} M^{- 1}$ of (13) has an identity matrix in the pivotal-pivotal block, and zero sub-matrices elsewhere). The resulting MGF of (8) thus equals to ${(1 - 2 t)}^{- L}$ , where L is the number of pivotal parameters, which means that the asymptotic distribution of (8) is $χ_{L}^{2}$ , making a construction of the corresponding partial confidence region fairly routine.

Let us add that such asymptotic independence is not uncommon; for example, when a symmetric (with respect to its mean) distribution has a location and scale parameters only, the corresponding ML estimators are then always asymptotically independent.

To see why, note that, in this case, the PDF of the sampled distribution can be expressed as

$f (x) = \frac{1}{σ} \cdot f_{0} (\frac{x - μ}{σ})$ (33)

where $μ$ and $σ$ are the location and scale parameters respectively, and $f_{0} (y)$ is a parameter-free PDF, symmetric with respect to 0. The off-diagonal element of $M$ is then proportional to

$\int_{- \infty}^{\infty} {f^{'}}_{0} (y) \cdot \frac{y \cdot {f^{'}}_{0} (y) + f_{0} (y)}{f_{0} (y)} d y$ (34)

due to (6); the integrand is clearly an anti-symmetric (odd) function of y, resulting in a zero integral. This enables us to find a partial confidence interval for either $μ$ of for $σ$ (while ignoring the other parameter) by using the $χ_{1}^{2}$ distribution for (8).

Similarly, in the case of a bivariate Normal distribution, the ML estimators of the two means on one hand, and of the two standard deviations and the correlation coefficient on the other, form two such mutually independent sets as well; this is clear from what we learned in the previous section but, due to the importance of this example, let us be more explicit and quote the well-known asymptotic correlation matrix of $\sqrt{n} (\hat{θ} - θ_{0})$ , namely

$\tilde{ℂ} = [\begin{matrix} 1 & ρ & 0 & 0 & 0 \\ ρ & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & ρ^{2} & - \frac{ρ}{\sqrt{2}} \\ 0 & 0 & ρ^{2} & 1 & - \frac{ρ}{\sqrt{2}} \\ 0 & 0 & - \frac{ρ}{\sqrt{2}} & - \frac{ρ}{\sqrt{2}} & 1 \end{matrix}]$ (35)

where the five parameters (collectively denoted $θ$ ) are the usual $μ_{1}, μ_{2}, σ_{1}, σ_{2}$ and $ρ$ (in that order); the respective asymptotic variances are $σ_{1}^{2} / n, σ_{2}^{2} / n, σ_{1}^{2} / (2 n), σ_{2}^{2} / (2 n)$ and ${(1 - ρ^{2})}^{2} / n$ .

Constructing a confidence region for either set of parameters is then quite simple (knowing that $\tilde{ℂ}$ has the above block-diagonal structure is all we need, the individual elements are irrelevant) as the following Mathematica program demonstrates.

The program produces the output presented in Figure 2 and Figure 3; the first graph is the resulting partial 95% confidence region for $μ_{1}$ and $μ_{2}$ , the second one is the 90% confidence region for $σ_{1}, σ_{2}$ and $ρ$ . This assumes that one is interested only in one or the other (not both)—if a confidence region for all five parameters is desired, one would use the basic procedure of our Introduction.

5. General Case

In general (and with no asymptotic independence to help) things get more complicated. We now go over all possible cases involving up to five parameters.

Figure 2. Confidence region for μ₁ and μ₂.

Figure 3. Confidence region for σ₁, σ₂ and ρ.

5.1. Two Parameters

In this situation, the only possibility is constructing a confidence interval for one of the two parameters, while ignoring the other. Since the $ℂ$ (or $\tilde{ℂ}$ ) matrix will always have the following form

$ℂ = [\begin{matrix} 1 & ρ \\ ρ & 1 \end{matrix}]$ (36)

we get

$ℂ_{1,1} {(ℂ^{- 1})}_{1,1} = \frac{1}{1 - ρ^{2}}$ (37)

which implies that the distribution of (8) is that of a $χ_{1}^{2}$ random variable, further divided by $1 - ρ^{2}$ . This means that, to construct a $1 - α$ confidence interval (CI) for the pivotal parameter, we make (8) equal to the corresponding critical value of $χ_{1}^{2}$ , also divided by $1 - ρ^{2}$ , substitute the ME estimate of the nuisance parameter, and solve for the pivotal one to get the two CI boundaries.

The following Mathematica program demonstrates the construction of a 90% confidence interval for the $α$ parameter of the Gamma ( $α, β$ ) distribution.

Note that in this case $ρ = 1 / \sqrt{α ψ_{1}}$ , based on

$M = [\begin{matrix} ψ_{1} & \frac{1}{β} \\ \frac{1}{β} & \frac{α}{β^{2}} \end{matrix}]$ (38)

where $ψ_{1}$ is the second derivative of $\ln Γ (α)$ , called “PolyGamma[1, α]” by Mathematica; to evaluate it, we had to use the ML estimate of $α$ , instead of its true value. That is how we deal with this problem in general; this does not change the asymptotic distribution of (8).

5.2. Three Parameters

A three-parameter situation requires discussing two possibilities:

When constructing a CI for a single parameter (say $θ_{1}$ ), (8) has again the $χ_{1}^{2} \cdot {(ℂ^{- 1})}_{1,1} \equiv χ_{1}^{2} \cdot {({\tilde{ℂ}}^{- 1})}_{1,1}$ distribution; this follows from $ℂ_{1,1} {(ℂ^{- 1})}_{1,1}$ having only one element (its only eigenvalue), and from $ℂ_{1,1} \equiv 1$ . Note that this result (when interested in only one parameter) is true for any K.

When finding a confidence region for $θ_{2}$ and $θ_{3}$ , the distribution of (8) is a convolution (i.e. an independent sum) of $χ_{1}^{2}$ and another $χ_{1}^{2}$ , the latter multiplied $s : = {(ℂ^{- 1})}_{1,1}$ ; this is based on the arguments following (18). The corresponding PDF is

$\frac{1}{2 \sqrt{s}} \exp (- \frac{s + 1}{4 s} \cdot y) I_{0} (\frac{s - 1}{4 s} \cdot y)$ (39)

where $I_{0}$ denotes the modified Bessel function of the first kind; note that using this PDF to find a critical value can be done only numerically (there is no analytic expression for the corresponding CDF, let alone for its inverse).

As an example, we assume sampling a distribution with the following PDF

$f (x; α, β, γ) = \frac{γ x^{α - 1} \exp (- {(\frac{x}{β})}^{γ})}{Γ (\frac{α}{γ})} when x > 0$ (40)

(each of the three parameters must be positive) and constructing a 95% confidence region for $β$ and $γ$ only. Leaving out routine details, the expression for s turns out to be

$s = 1 + \frac{α}{γ} + \frac{α^{2} ψ_{1} - (α + γ) γ}{γ^{2} + ψ_{1} (γ^{2} - α^{2} ψ_{1})}$ (41)

where $ψ_{1}$ is the second derivative of $\ln Γ$ , evaluated at $\frac{α}{γ}$ .

The following Mathematica code demonstrates the algorithm.

The confidence region for the $β$ and $γ$ thus produced is displayed in Figure 4.

Figure 4. Confidence region for β and γ.

5.3. Four Parameters

A CI for $θ_{1}$ results in (8) having the $χ_{1}^{2} \cdot {(ℂ^{- 1})}_{1,1}$ distribution, as discussed previously.

The complementary task of constructing a confidence region for $θ_{2}, θ_{3}$ and $θ_{4}$ then leads to a convolution of the previous distribution and that of $χ_{2}^{2}$ (to account for the two extra eigenvalues of $I - ℂ_{p, 1} {(ℂ^{- 1})}_{1, p}$ , both equal to 1); this convolution has a PDF given by

$\frac{1}{2 \sqrt{1 - s}} \exp (- \frac{y}{2}) erf (\sqrt{\frac{(1 - s) \cdot y}{2 s}})$ (42)

Finally, to build a confidence region for $θ_{1}$ and $θ_{2}$ requires using the following PDF for (8):

$\frac{\sqrt{t_{1} t_{2}}}{2} \exp (- \frac{t_{1} + t_{2}}{4} \cdot y) I_{0} (\frac{t_{1} - t_{2}}{4} \cdot y)$ (43)

where $t_{1}$ and $t_{2}$ are the two eigenvalues of $M_{p, p} {(M^{- 1})}_{p, p}$ ; using $ℂ$ instead of $M$ is still possible, but has no longer any advantage, since critical values of (42) and (43) can again be found only numerically.

To show how to use the last formula, we assume sampling a mixture of Exponential and Normal distributions (the four parameters are: the mean of the Exponential distribution, denoted b, and its weight in the mixturea, followed by the mean c and standard deviation d of the Normal distribution). Note that in this example we also bypass an analytic solution for elements of the $M$ matrix—these can be easily computed by substituting values of the ML estimates for $θ_{0}$ in (6), and only then computing, by numerical integration, the corresponding expected value. The following Mathematica program demonstrates the complete algorithm.

Producing the 90% confidence region displayed in Figure 5 for the mean (horizontal scale) and standard deviation (vertical scale) of the Normal-distribution part of the mixture.5.4. Five ParametersAs in all previous cases, a confidence interval for only one (say $θ_{1}$ ) of the parameters requires using the $χ_{1}^{2}$ distribution, further multiplied by $s : = {(ℂ^{- 1})}_{1,1}$ .

Figure 5. Confidence region for c and d.

For the complementary task of building a (four-dimensional) confidence region for $θ_{2}, θ_{3}, θ_{4}$ and $θ_{5}$ , the distribution of (8) becomes a convolution of $χ_{1}^{2} \cdot s$ and $χ_{3}^{2}$ ; the corresponding PDF is

$\frac{y}{4 \sqrt{s}} \exp (- \frac{s + 1}{4 s} \cdot y) [I_{0} (\frac{s - 1}{4 s} \cdot y) + I_{1} (\frac{1 - s}{4 s} \cdot y)]$ (44)

A confidence region for $θ_{1}$ and $θ_{2}$ is found using the PDF of (43), found in the case of four parameters.

The PDF used for the three-dimensional confidence region of the true values of $θ_{3}, θ_{4}$ and $θ_{5}$ is then a convolution of the last PDF and that of an independent $χ_{1}^{2}$ ; no explicit formula for the resulting PDF exists, but the following example demonstrates how to bypass this problem.

This time, we assume sampling a tri-variate Normal distribution having identical means (denoted $μ$ ) and identical standard deviations (denoted $σ$ ). We now construct a three-dimensional confidence region for $μ, σ$ and $ρ_{12}$ (while ignoring $ρ_{13}$ and $ρ_{23}$ ), using the following Mathematica code.

The resulting confidence region is shown in Figure 6.

Figure 6. Confidence region for μ, σ and ρ₁₂.

6. Technique’s Accuracy

In this section, we go over a rather special example, allowing us to analytically find not only all ML estimators, but also an exact expression for (8). These are then used to build (still analytically) formulas for boundaries of a confidence interval for one of the parameters. Furthermore, the exact distribution of the estimator is also known, which gives us a unique opportunity to compute the error of our technique.

Let us now proceed with the actual example of constructing a confidence interval for the correlation coefficient of a bivariate Normal distribution, and computing its exact level of confidence (while ignoring the remaining four parameters).

Firstly, it is well known what the ML estimators of the two means ( $μ_{x}$ and $μ_{y}$ ), the two standard deviations ( $σ_{x}$ and $σ_{y}$ ), and the correlation coefficient $ρ$ are; we will denote them $\bar{X}, \bar{Y}, S_{x}, S_{y}$ and r respectively. The likelihood function, once we replace all five parameters by their estimators, then reads

$\begin{array}{l} - \frac{1}{2 (1 - r^{2})} \sum_{i = 1}^{n} (\frac{{(X_{i} - \bar{X})}^{2}}{S_{x}^{2}} + \frac{{(Y_{i} - \bar{Y})}^{2}}{S_{y}^{2}} - 2 r \frac{(X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{S_{x} S_{y}}) \\ - \frac{n}{2} \ln (1 - r^{2}) - n \ln S_{x} - n \ln S_{y} = - n - \frac{n}{2} \ln (1 - r^{2}) - n \ln S_{x} - n \ln S_{y} \end{array}$ (45)

When similarly replacing the first four parameters only (keeping the exact value of $ρ$ ), the likelihood function becomes

$\begin{array}{l} - \frac{1}{2 (1 - ρ^{2})} \sum_{i = 1}^{n} (\frac{{(X_{i} - \bar{X})}^{2}}{S_{x}^{2}} + \frac{{(Y_{i} - \bar{Y})}^{2}}{S_{y}^{2}} - 2 ρ \frac{(X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{S_{x} S_{y}}) \\ - \frac{n}{2} \ln (1 - ρ^{2}) - n \ln S_{x} - n \ln S_{y} = - n \frac{1 - ρ r}{1 - ρ^{2}} - \frac{n}{2} \ln (1 - ρ^{2}) - n \ln S_{x} - n \ln S_{y} \end{array}$ (46)

making (8) equal to

$- 2 n \frac{ρ r - ρ^{2}}{1 - ρ^{2}} - n \ln \frac{1 - r^{2}}{1 - ρ^{2}}$ (47)

It is easy to find that ${({\tilde{ℂ}}^{- 1})}_{5,5} = 1 + ρ^{2}$ , where $\tilde{ℂ}$ is the matrix of (35); this means that the random variable (47) has, approximately, the $χ_{1}^{2}$ distribution further multiplied by $1 + ρ^{2}$ . Making (47) less than the critical value of this distribution (say $C_{α}$ ) has, approximately, the probability of $1 - α$ of being correct. Since the sampling distribution of r is known, we can also compute the exact probability of the same inequality, thus getting the error of the approximation. This is done by the following Mathematica program (for any chosen set of values for $ρ, n$ and $α$ )

where the first two lines (a continuation of a single Mathematica statement) spell out the exact PDF of r. The program then specifies $ρ, n$ and $α$ , makes (47) equal to $1 - α$ , and solves for r (note that this is the reverse of what is done when finding boundaries of a confidence interval for $ρ$ , given a value of r); the last line then evaluates the corresponding exact probability. Note that the PDF of r is notoriously slow in reaching its Normal limit (on which our approximation is based), making the errors of confidence intervals for $ρ$ atypically large; this example is thus close to presenting the worst-case scenario.

By executing the program using various values of $ρ, n$ and $1 - α$ , we get errors (in percent) presented in Table 1 (these are quoted for $n = 10, 30$ and 100 respectively).

Based on these results, we can make the following observations:

• The exact confidence level is always less than the claimed one.

• The error does not change much with the value of $ρ$ (surprisingly, the error is usually the largest at $ρ = 0$ and decreases slightly towards both extremes).

• It decreases as the confidence level goes up, but increases relative to $α$ .

• It decreases, to a good approximation, with 1/n.

Table 1. List of the technique’s errors (in %).

7. Conclusion

We have shown how to construct confidence regions for parameters of interest while ignoring one or more additional (nuisance) parameters. This is done based on ML estimates of all parameters, utilizing the corresponding likelihood function and formulas of this article. The error of the resulting procedure behaves similarly (i.e. decreasing with the first power of the sample size) to the error of ordinary confidence regions based on the $χ^{2}$ distribution of (1). Our explicit examples have covered situations involving up to five parameters in total, but a general approach for dealing with any number of parameters has been clearly delineated as well. Future research will undoubtedly supply specific details of any such multi-parameter situation, and come up with a way of simplifying, at least numerically, the resulting distributions.

Appendix: Notation and Abbreviations

• $L (X; θ)$ : the Likelihood Function; $X$ is the set of observations, $θ$ are the distribution parameters.

• LF: likelihod function.

• ML: maximum likelihood.

• LHS: left hand side.

• MGF: moment generatin function.

• PDF: probability density function.

• $χ_{K}^{2}$ : the chi-square distribution with K degrees of freedom.

• $E$ : a symbol for taking expected value.

• $I$ and $O$ : identity and zero matrix, respectively.

• $M$ : the variance-covariance matrix of multivariate Normal distribution.

• $M_{0}$ : the previous matrix with all pivotal by pivotal elements set to 0.

• $M_{p, i}$ : the pivotal by nuisance (or incidental) block of $M$ .

• $v^{[k]}$ : a vector with k^th component equal to 1, the rest equal to 0.

• $ℍ^{[k l]}$ : a matrix with ${(k, l)}^{th}$ and ${(l, k)}^{th}$ components equal to 1, the rest equal to 0.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1]	Bartlett, M.S. (1953) Approximate Confidence Intervals. Biometrica, 40, 12-19. https://doi.org/10.1093/biomet/40.1-2.12
[2]	Bartlett, M.S. (1953) Approximate Confidence Intervals II. More Than One Unknown Parameter. Biometrica, 40, 306-317. https://doi.org/10.1093/biomet/40.3-4.306
[3]	Bartlett, M.S. (1955) Approximate Confidence Intervals III. A Bias Correction. Biometrica, 42, 201-204. https://doi.org/10.1093/biomet/42.1-2.201
[4]	Kendall, M.G. and Stuart, A. (1969) The Advanced Theory of Statistics, Vol. 1, Chapter 3, Hafner Publishing Company, New York.
[5]	Cox, D.R. (1975) Partial likelihood. Biometrika, 62, 269-276. https://doi.org/10.1093/biomet/62.2.269
[6]	Kalbfleisch, J.D. and Sprott, D.A. (1973) Marginal and Conditional Likelihoods. Sankhyā: The Indian Journal of Statistics, Series A, 35, 311-328.
[7]	Hotelling, H. (1940) The Selection of Variates for Use in Prediction with Some Comments on the General Problem of Nuisance Parameters. The Annals of Mathematical Statistics, 11, 271-283. https://doi.org/10.1214/aoms/1177731867
[8]	Basu, D. (1977) On the Elimination of Nuisance Parameters. Journal of the American Statistical Association, 72, 355-366. https://doi.org/10.1080/01621459.1977.10481002

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies