Tuning of Prior Covariance in Generalized Least Squares

doi:10.4236/am.2021.123011

Applied Mathematics
Vol.12 No.03(2021), Article ID:107768,14 pages
10.4236/am.2021.123011

William Menke

●How to Cite this Article

Lamont-Doherty Earth Observatory of Columbia University, New York, USA

This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).

http://creativecommons.org/licenses/by/4.0/

Received: February 5, 2021; Accepted: March 14, 2021; Published: March 17, 2021

ABSTRACT

Generalized Least Squares (least squares with prior information) requires the correct assignment of two prior covariance matrices: one associated with the uncertainty of measurements; the other with the uncertainty of prior information. These assignments often are very subjective, especially when correlations among data or among prior information are believed to occur. However, in cases in which the general form of these matrices can be anticipated up to a set of poorly-known parameters, the data and prior information may be used to better-determine (or “tune”) the parameters in a manner that is faithful to the underlying Bayesian foundation of GLS. We identify an objective function, the minimization of which leads to the best-estimate of the parameters and provide explicit and computationally-efficient formula for calculating the derivatives needed to implement the minimization with a gradient descent method. Furthermore, the problem is organized so that the minimization need be performed only over the space of covariance parameters, and not over the combined space of model and covariance parameters. We show that the use of trade-off curves to select the relative weight given to observations and prior information is not a form of tuning, because it does not, in general maximize the posterior probability of the model parameters, and can lead to a different weighting than the procedure described here. We also provide several examples that demonstrate the viability, and discuss both the advantages and limitations of the method.

Keywords:

Bayesian Inference, Covariance, Error, Generalized Least Squares, Gradient Descent, Interpolation, Regularization, Trade-Off Curve, Variance

1. Introduction

Generalized Least Squares (GLS, also called least-squared with prior information) is a tool for statistical inference [1] - [6] that is widely used in geotomography [7] - [12] and geophysical inversion [13] [14], as well as other areas of the physical sciences and engineering. One of the attractive features of GLS that makes it especially useful in the imaging of multidimensional fields (for example, density, velocity, viscosity) is its ability to implement, in a natural and versatile way, prior information of the behavior of the field. Widely-used types of prior information include the field being smooth, as quantified by its low-order derivatives [15], having a specified power spectral density or autocovariance [7] [15], and satisfying a specified partial differential equation (such as the geostrophic flow equation [16] or the diffusion equation [4] ). The word “regularization” sometimes is used to describe the effect of prior information on the solution process [17].

We review the Generalized Least Squares (GLS) method here, following the notation in [6], in order to provide context and to establish nomenclature. In GLS, observations (or data) and prior information (or inferences) are combined to arrive at a best-estimate of initially-unknown model parameters (which might, for example, represent a field sampled on a regular grid). The data are assumed to satisfy the linear equation $G m = d$ , where $d \in ℝ^{N}$ is a vector of data, $m \in ℝ^{M}$ is a vector of model parameters, and $G$ is a known “kernel” matrix associated with the data. Prior information is assumed to satisfy a linear equation $H m = h$ , where $h \in ℝ^{K}$ is a vector of prior values and $H$ is a kernel matrix associated with the prior information. GLS problems are assumed to be over-determined, with $N + K > M$ . For observed data $d^{o b s}$ , known prior information $h^{p r i}$ and a specified model $m$ , the prediction error is $e \equiv d^{o b s} - G m$ and prior information error is $l \equiv h^{p r i} - H m$ . These errors are assumed to be Normally-distributed with zero mean and prior covariance $C_{d}$ and $C_{h}$ , respectively. Then, the normalized errors $\tilde{e} \equiv C_{d}^{- 1 / 2} e$ and $\tilde{l} \equiv C_{h}^{- 1 / 2} l$ are independent and identically-distributed Normal random variables with zero mean and unit variance. Bayes theorem can be used to show that the best estimate $m^{e s t}$ of the solution is the one that minimizes the generalized error $Φ \equiv E + L$ , with $E \equiv {\tilde{e}}^{T} \tilde{e}$ and $L \equiv {\tilde{l}}^{T} \tilde{l}$ [1] [2] [5]. The solution can be expressed in a variety of equivalent forms, among which is the widely-used version [6]:

$m^{e s t} = Z^{- 1} (G^{T} C_{d}^{- 1} d^{o b s} + H^{T} C_{h}^{- 1} h^{p r i}) with Z \equiv G^{T} C_{d}^{- 1} G + H^{T} C_{h}^{- 1} H$ (1)

The assumption of linear kernels $G$ and $H$ is a very restrictive one. In the well-studied nonlinear generalization [1] [6], the products $G m$ and $H m$ are replaced with vector functions $g (m)$ and $h (m)$ . Then, a common solution method is to linearize the data and prior information equations around a trial solution $m^{(0)}$ :

$\begin{array}{l} G^{(0)} Δ m = Δ d with G_{i j}^{(0)} \equiv {\frac{\partial g_{i}}{\partial m_{j}} |}_{m^{(0)}} and Δ d \equiv d^{o b s} - g (m) \\ H^{(0)} Δ m = Δ h with H_{i j}^{(0)} \equiv {\frac{\partial h_{i}}{\partial m_{j}} |}_{m^{(0)}} and Δ h \equiv h^{p r i} - h (m) \end{array}$ (2)

and $Δ m = m - m^{(0)}$ . The solution is then found by iterative application of (1) applied to (2); that is, by the Gauss-Newton’s method [3]. Alternatively, a gradient-descent method [18] can be used that employs:

${\nabla_{m} Φ |}_{m^{(0)}} = - 2 G^{(0) T} C_{d}^{- 1} (d^{o b s} - G m^{(0)}) - 2 H^{(0) T} C_{h}^{- 1} (h^{p r i} - H m^{(0)})$ (3)

The latter approach is preferred for very large M, since the convergence rate of gradient descent is independent of its dimension [18], whereas the effort required to solve the M× M system (1) by a direct method scales as M³ [19].

We now discuss issues related to the covariance matrices that appear in GLS. The data covariance $C_{d}$ quantifies the uncertainty of the observations and the information covariance $C_{h}$ quantifies the uncertainty of the prior information. Prior knowledge of the inherent accuracy of the measurement technique is needed to assign $C_{d}$ , and prior knowledge of the physically-plausible solutions, perhaps stemming from and understanding of the underlying physics, is needed to assign $C_{h}$ . These assignments are often very subjective, especially when correlations are believed to occur (that is, $C_{d}$ and $C_{h}$ have non-zero off-diagonal elements). For example, one geotomographic study [7] reconstructs a two-dimensional field using a $C_{h}$ that represents autocovariance of the field and that is dependent upon a scale length q. The value of q is chosen on the basis of broad physical arguments that, while plausible, leaves considerable room for subjectivity.

The matrices $C_{d}$ and $C_{h}$ together contain $\frac{1}{2} N (N + 1) + \frac{1}{2} K (K + 1)$ elements,

many more than the $(N + K)$ constraints imposed by the data $d$ and prior information $h$ .Consequently, insufficient information is available to uniquely solve for all the elements of $C_{d}$ and $C_{h}$ . However, it sometimes may be possible to parameterize $C_{d} (q)$ and/or $C_{h} (q)$ in terms of $q \in ℝ^{J}$ , and ask whether an initial estimate of $q$ can be improved. As long as $(M + J) < (N + K)$ , adequate information may be available to determine a best estimate $q^{e s t}$ . We refer to the process of determining $q^{e s t}$ as “tuning”, since in typical practice it requires that the covariances be close to their true values.

As an example of a parametrized covariance, we consider the case where the model parameters represent a sampled version of a continuous function $m (x)$ , where $x \in ℝ$ is an independent variable; that is, $m_{n} = m (x_{n})$ , with $x_{n} \equiv n Δ x$ and $Δ x$ the sampling interval. The prior information that $m (x)$ is approximately oscillatory with wavenumber q can be modeled by:

$H = I and h^{p r i} = 0 and {[C_{h}]}_{n m} = σ_{h}^{2} \cos (q | x_{n} - x_{m} |)$ (4)

In this case, $C_{h}$ approximates the autocovariance of $m (x)$ , which is assumed to be stationary. The goal of tuning is to provides a best-estimate $q^{e s t}$ , as well of best estimated $m^{e s t}$ of the model parameters. This problem is further developed in Example 4, below.

Although the GLS formulation is widely used in geotomography and geophysical imaging, the tuning of variance is typically implemented in a very limited fashion, through the use of trade-off curves [7] - [12]. In this procedure, a scalar parameter q controls the relative size of $C_{d}$ and $C_{h}$ , that is, $C_{h} (q) = q C_{h}^{(0)}$ , where $C_{h}^{(0)}$ is specified [20]. The GLS problem is then solved for a suite of qs, the functions $E (q)$ and $L (q)$ are tabulated and the resulting trade-off curve $E (L)$ is used to identify a solution $m (q_{0})$ that has acceptably low E and L (for example, Figure 1 of [20] ). As we will show below, this ad hoc procedure is not a consistent extension of GLS, because it results in a different q than the one implied by Bayes’ principle. A more consistent approach is to apply Bayes theorem directly to estimate both the model parameters $m$ and the covariance parameters $q$ . Such an approach has been implemented in the context of ordinary least squares [21] and the Markov chain Monte Carlo (MCMC) inversion method [22] (which is a computationally-intensive alternative to GLS). An important and novel result of this paper is a computationally-efficient procedure for tuning GLS in a Bayes-consistent manner.

2. Bayesian Extenion of GLS

The general process of using Bayes’ theorem to construct a posterior probability density function (p.d.f.) that depends on unknown parameters and of estimating those parameters though the maximization of probability is very well understood [23]. In the current case, the p.d.f. has M model parameters and J covariance parameters, so the maximization process (implemented, say, with a gradient ascent method) must search an $(M + J)$ -dimensional space. Our main purpose here is to show that the process can be organized in a way that makes use of the GLS solution (1) and thus reduce the dimensionality of the searched space to J.

The GLS solution (1) yields the $m$ that minimizes the generalized error $Φ (m)$ , or equivalently, the $m$ that maximizes the Normal posterior probability density function (p.d.f.) $p (m | d^{o b s}, h^{p r i})$ :

$\begin{array}{l} m^{e s t} = \underset{m}{\arg \max} p (m | d^{o b s}, h^{p r i}) \\ with p (m | d^{o b s}, h^{p r i}) \propto p (d^{o b s} | m) p (h^{p r i} | m) \end{array}$ (5)

Here, Bayes theorem [23] is used to related the Normal posterior p.d.f. $p (m | d^{o b s}, h^{p r i})$ to the Normal likelihood $p (d^{o b s} | m)$ and the Normal prior $p (h^{p r i} | m)$ . When poorly known parameters $q$ are added to the problem, they must be treated as additional random variables [22]. Writing $q \equiv {[q^{(d)}, q^{(h)}]}^{T}$ , with $q^{(d)}$ appearing in the likelihood and $q^{(h)}$ appear in the prior, we have:

$\begin{array}{l} m^{e s t}, q^{e s t} = \underset{m, q^{(d)}, q^{(h)}}{\arg \max} p (m, q | d^{o b s}, h^{p r i}) \\ with p (m, q | d^{o b s}, h^{p r i}) \propto p (d^{o b s} | m, q^{(d)}) p (h^{p r i} | m, q^{(h)}) p (q^{(d)}) p (q^{(h)}) \end{array}$ (6)

Here, we have assumed that $q$ and $m$ are not correlated with one another. The maximization with respect to the two variables can be performed as a sequence of two single-variable maximizations:

$m (q_{0}) : \underset{m}{\arg \max} p (m, q_{0} | d^{o b s}, h^{p r i}) (atfixed q_{0})$ (7a)

$q^{e s t} : \underset{q_{0}}{\arg \max} p (m (q_{0}), q_{0} | d^{o b s}, h^{p r i})$ (7b)

$m^{e s t} = m (q^{e s t})$ (7c)

In the special case of the uniform prior $p (q_{(d)}) p (q_{(h)}) \propto constant$ , the maximization in (7a) is the GPR solution at fixed $q_{0}$ . For the Normal p.d.f.:

$\begin{array}{l} p (m (q_{0}), q_{0} | d^{o b s}, h^{p r i}) \\ = {(2 π)}^{- \frac{1}{2} (N + K)} {(\det C_{d})}^{- \frac{1}{2}} {(\det C_{k})}^{- \frac{1}{2}} \exp (- \frac{1}{2} E) (- \frac{1}{2} L) \end{array}$ (8)

the maximization (7b) is equivalent to the minimization of an objective function $Ψ (q)$ , defined as:

$Ψ \equiv - 2 [\ln p + \frac{1}{2} (N + K) \ln (2 π)] = \ln (\det C_{d}) + \ln (\det C_{h}) + E + L$ (9)

The quantity $\ln (\det C_{d})$ is best computed by finding the Choleski decomposition $C_{d} = D D^{T}$ , the algorithm [24] for which is implemented in many software environments, including MATLAB® and PYTHON/linalg. Then, $\ln (\det C_{d}) = 2 \sum_{n} \ln (D_{n n})$ (and similarly for $\ln (\det C_{h})$ ).The nonlinear optimization problem of minimizing $Ψ (q)$ can be implemented using a gradient descent method, provided that the derivative $\partial Ψ / \partial q_{m}$ can be calculated [18]. In the next section, we derive analytic formula for this and related derivatives.

3. Solution Method and Formula for Derivatives

The process of simultaneously estimating the covariance parameters $q^{e s t}$ and model parameters $m^{e s t}$ consists of six steps. First, the analytic form of the covariance matrices $C_{d} (q)$ and $C_{h} (q)$ are specified, and their derivatives $\partial C_{d} / \partial q_{m}$ and $\partial C_{h} / \partial q_{m}$ are computed analytically. Second, an initial estimate $q^{(0)}$ is identified. Third, the covariance matrices $C_{d} (q^{(0)})$ and $C_{h} (q^{(0)})$ are inserted into (1), yielding model parameters $m (q^{(0)})$ . Fourth, using formulas developed below, the value of the derivative $\partial Ψ / \partial q_{m}$ is calculated at $q^{(0)}$ . Fifth, a gradient descent method employing $\partial Ψ / \partial q_{m}$ is used to iteratively perturb $q^{(0)}$ towards the minimum of $Ψ$ at $q^{e s t}$ (and in process, repeating steps three through five many times). Sixth, the estimated model parameters are computed as $m^{e s t} = m (q^{e s t})$ . This process is depicted in Figure 1.

Our derivation of $\partial Ψ / \partial q_{m}$ uses three matrix derivatives, $\partial M^{- 1} / \partial q$ , $\partial M^{- 1 / 2} / \partial q$ and $\partial \ln (\det M) / \partial q$ that may be unfamiliar to some readers, so we derive them here for completeness. Let $M (q)$ be asquare, invertible, differentiable matrix. Differentiating $M^{- 1} M = I$ yields $[\partial M^{- 1} / \partial q_{m}] M + M^{- 1} [\partial M / \partial q_{m}] = 0$ , which can be rearranged into ( [25], their (36)):

$\frac{\partial M^{- 1}}{\partial q_{m}} = - M^{- 1} [\frac{\partial M}{\partial q_{m}}] M^{- 1}$ (10)

Figure 1. Schematic depiction of solution process. (a) The GLS solution $m^{e s t}$ (red curve) is considered a function of the covariance parameters $q$ and its derivative $\partial m^{e s t} / \partial q_{n}$ (blue line) at a point $q^{(0)}$ is computed by analytic differentiation of GLS equation (1); (b) The objective function Ψ (colors) is considered a function of $q$ . The results of (a) are used to compute its gradient $\nabla_{q} Ψ$ at the point $q^{(0)}$ . The gradient descent method is used to iteratively perturb this point anti-parallel to the gradient until it reaches the minimum $Ψ^{\min}$ of the objective function, resulting in the best-estimate $q^{e s t}$ . This value is then used to determine a best-estimate of the model parameters $m^{e s t}$ , as depicted in (a).

Similarly, differentiating $M^{- 1 / 2} M^{- 1 / 2} = M^{- 1}$ and applying (10), yields the Sylvester equation:

$\frac{\partial M^{- 1 / 2}}{\partial q_{m}} M^{- 1 / 2} + M^{- 1 / 2} \frac{\partial M^{- 1 / 2}}{\partial q_{m}} = \frac{\partial M^{- 1}}{\partial q_{m}} = - M^{- 1} [\frac{\partial M}{\partial q_{m}}] M^{- 1}$ (11)

We have not been able to determine a source for this equation, but in all likelihood, it has been derived previously. In practice, (11) is not significantly harder to compute than (10), because efficient algorithms for solving Sylvester equations [26] and for computing a symmetric (principal) square root [27], are widely available and implemented in many software environments, including MATLAB® and PYTHON/linalg. The derivative of $\ln (\det C_{d})$ is derived starting with Jacobi’s formula [12]:

$\frac{\partial \det (M)}{\partial q} = tr (adj (M) \frac{\partial M}{\partial q}) = tr (\det (M) M^{- 1} \frac{\partial M}{\partial q}) = \det (M) tr (M^{- 1} \frac{\partial M}{\partial q})$ (12)

where $adj (.)$ is the adjugate and $tr (.)$ is the trace, applying Laplace’s identify [28] $adj (C_{d}) = \det (C_{d}) C_{d}^{- 1}$ and the rule $tr (c M) = c tr (M)$ (where c is a scalar and $M$ is a matrix) [29]. Finally, the determinant is moved to the left-hand side and the well-known relationship $\partial \ln (f) / \partial q = f^{- 1} (\partial f / \partial q)$ , for a differentiable function $f (q)$ , is applied, yielding ( [25], their (38)):

$\frac{\partial \ln (\det M)}{\partial q} = \frac{1}{\det (M)} \frac{\partial \det (M)}{\partial q} = tr (M^{- 1} \frac{\partial M}{\partial q})$ (13)

We begin the main derivation by considering the case in which data variance $C_{d} (q)$ depends on a parameter vector $q$ , and the information variance $C_{h}$ is constant. The derivative of the GLS solution can be found by applying the chain rule applied to (1):

$\begin{array}{l} \frac{\partial m^{e s t}}{\partial q_{m}} = \frac{\partial Z^{- 1}}{\partial q_{m}} G^{T} C_{d}^{- 1} d^{o b s} + Z^{- 1} G^{T} \frac{\partial C_{d}^{- 1}}{\partial q_{m}} d^{o b s} + \frac{\partial Z^{- 1}}{\partial q_{m}} H^{T} C_{h}^{- 1} h^{p r i} \\ = Z^{- 1} (G^{T} \frac{\partial C_{d}^{- 1}}{\partial q_{m}} d^{o b s} - \frac{\partial Z}{\partial q_{m}} m^{e s t}) \\ with \frac{\partial Z}{\partial q_{m}} = G^{T} \frac{\partial C_{d}^{- 1}}{\partial q_{m}} G and \frac{\partial C_{d}^{- 1}}{\partial q_{m}} = - C_{d}^{- 1} \frac{\partial C_{d}}{\partial q_{m}} C_{d}^{- 1} \end{array}$ (14)

Note that we have used (10). The derivative of the normalized prediction error is $\tilde{e} \equiv C_{d}^{- 1 / 2} (d^{o b s} - G m^{e s t})$ and total error $E \equiv {\tilde{e}}^{T} \tilde{e}$ are:

$\begin{array}{l} \frac{\partial \tilde{e}}{\partial q_{m}} = - C_{d}^{- 1 / 2} G \frac{\partial m^{e s t}}{\partial q_{m}} + \frac{\partial C_{d}^{- 1 / 2}}{\partial q_{m}} (d^{o b s} - G m^{e s t}) and \frac{\partial E}{\partial q_{m}} = 2 {\tilde{e}}^{T} \frac{\partial \tilde{e}}{\partial q_{m}} \\ with \frac{\partial C_{h}^{- 1 / 2}}{\partial q_{m}} C_{h}^{- 1 / 2} + C_{h}^{- 1 / 2} \frac{\partial C_{h}^{- 1 / 2}}{\partial q_{m}} = - C_{h}^{- 1} \frac{\partial C_{h}}{\partial q_{m}} C_{h}^{- 1} \end{array}$ (15)

Here, the Sylvester equation arises from (11). An alternate way of differentiating E that does not require solving a Sylvester equation is:

$\frac{\partial E}{\partial q_{m}} = \frac{\partial}{\partial q_{m}} (e^{T} C_{d}^{- 1} e) = - {(\frac{\partial m^{e s t}}{\partial q_{m}})}^{T} G^{T} C_{d}^{- 1} e + e^{T} \frac{\partial C_{d}^{- 1}}{\partial q_{m}} e - e^{T} C_{d}^{- 1} G \frac{\partial m^{e s t}}{\partial q_{m}}$ (16)

The derivative of the normalized error in prior information $\tilde{l} = C_{h}^{- 1 / 2} (h - H m^{e s t})$ and total error $L \equiv {\tilde{l}}^{T} \tilde{l}$ are:

$\frac{\partial \tilde{l}}{\partial q_{m}} = - C_{h}^{- 1 / 2} H \frac{\partial m^{e s t}}{\partial q_{m}} and \frac{\partial L}{\partial q_{m}} = 2 {\tilde{l}}^{T} \frac{\partial \tilde{l}}{\partial q_{m}}$ (17)

Finally, since $Ψ = \ln (\det C_{d}) + \ln (\det C_{h}) + E + L$ , we have:

$\frac{\partial Ψ}{\partial q_{m}} = \frac{\partial \ln (\det C_{d})}{\partial q_{m}} + \frac{\partial E}{\partial q_{m}} + \frac{\partial L}{\partial q_{m}} = tr (C_{d}^{- 1} \frac{\partial C_{d}}{\partial q_{m}}) + \frac{\partial E}{\partial q_{m}} + \frac{\partial L}{\partial q_{m}}$ (18)

Note that we have applied (13).

Finally, we consider the case in which the information variance $C_{h} (q)$ depends on parameters $q$ , and $C_{d}$ is constant. Since the data and prior information play completely symmetric roles in (1), the derivatives can be obtained by interchanging the roles of $C_{d}$ and $C_{h}$ , $G$ and $H$ , $d^{o b s}$ and $h^{p r i}$ , $\tilde{e}$ and $\tilde{l}$ and E and L, in the equations above, yielding:

$\begin{array}{l} \frac{\partial m^{e s t}}{\partial q_{m}} = Z^{- 1} (H^{T} \frac{\partial C_{h}^{- 1}}{\partial q_{m}} h^{p r i} - \frac{\partial Z}{\partial q_{m}} m^{e s t}) \\ with \frac{\partial Z}{\partial q_{m}} = H^{T} \frac{\partial C_{h}^{- 1}}{\partial q_{m}} H and \frac{\partial C_{h}^{- 1}}{\partial q_{m}} = - C_{h}^{- 1} \frac{\partial C_{h}}{\partial q_{m}} C_{h}^{- 1} \end{array}$

$\frac{\partial \tilde{e}}{\partial q_{m}} = - C_{d}^{- 1 / 2} G \frac{\partial m^{e s t}}{\partial q_{m}} and \frac{\partial E}{\partial q_{m}} = 2 {\tilde{e}}^{T} \frac{\partial \tilde{e}}{\partial q_{m}}$

$\frac{\partial \tilde{l}}{\partial q_{m}} = - C_{h}^{- 1 / 2} H \frac{\partial m^{e s t}}{\partial q_{m}} + \frac{\partial C_{h}^{- 1 / 2}}{\partial q_{m}} (h^{p r i} - H m^{e s t})$

$\frac{\partial L}{\partial q_{m}} = 2 {\tilde{l}}^{T} \frac{\partial \tilde{l}}{\partial q_{m}} = - {(\frac{\partial m^{e s t}}{\partial q_{m}})}^{T} H^{T} C_{h}^{- 1} l + l^{T} \frac{\partial C_{h}^{- 1}}{\partial q_{m}} l - l^{T} C_{h}^{- 1} H \frac{\partial m^{e s t}}{\partial q_{m}}$

$\frac{\partial C_{h}^{- 1 / 2}}{\partial q_{m}} C_{h}^{- 1 / 2} + C_{h}^{- 1 / 2} \frac{\partial C_{h}^{- 1 / 2}}{\partial q_{m}} = - C_{h}^{- 1} \frac{\partial C_{h}}{\partial q_{m}} C_{h}^{- 1}$

$\frac{\partial \ln (\det C_{h})}{\partial q_{m}} = tr (C_{h}^{- 1} \frac{\partial C_{h}}{\partial q_{m}})$

$\frac{\partial Ψ}{\partial q_{m}} = tr (C_{h}^{- 1} \frac{\partial C_{h}}{\partial q_{m}}) + \frac{\partial E}{\partial q_{m}} + \frac{\partial L}{\partial q_{m}}$ (19)

These formulas have been checked numerically.

4. Examples with Discussion

In the first example, we examine the simplistic case in which the parameter q represents an overall scaling of variance; that is $C_{d} (q) = q C_{d}^{(0)}$ and $C_{h} (q) = q C_{h}^{(0)}$ , with specified $C_{d}^{(0)}$ and $C_{h}^{(0)}$ . The solution $m^{e s t}$ is independent of q, as can be verified by substitution into (1). The parameter q can then be found by direct minimization of (9), which simplifies to:

$Ψ = \ln (q^{N} \det C_{d}^{(0)}) + \ln (q^{K} \det C_{k}^{(0)}) + q^{- 1} E_{0} + q^{- 1} L_{0}$ (20)

Here, we have used the rule $\det (q M) = q^{N} \det (M)$ [25], valid for any $N \times N$ matrix $M$ , and have defined $E_{0} \equiv E (q = 1)$ and $L_{0} \equiv L (q = 1)$ . The minimum occurs when:

$\frac{\partial Ψ}{\partial q} = 0 = (N + K) q^{- 1} - (E_{0} + L_{0}) q^{- 2} or q = \frac{E_{0} + L_{0}}{N + K}$ (21)

This is a generalization of the well-known maximum likelihood estimate of the sample variance [30]. As long as $(E_{0} + L_{0})$ exists, the minimization in (21) is well-behaved and the overall scaling q is uniquely determined.

In the second example, we examine another simplistic case in which a parameter q represents the relative weighting of variance; that is $C_{d}^{- 1} (q) = q I$ and $C_{h}^{- 1} (q) = (1 - q) I$ .We consider the problem of estimating the mean $m_{1}$ of data given observations $d = 1$ and prior information $h = 0$ (where $0$ and $1$ are vectors of zeros and ones, respectively), when $N = K$ , $M = 1$ and $G = H = 1$ . Applying (1), we find that $m^{e s t} = q$ . Then, the objective function is $Ψ = \ln (q^{N}) + \ln ({(1 - q)}^{N}) + N q (1 - q)$ and its derivative is $\partial Ψ / \partial q = N [- q^{- 1} + {(1 - q)}^{- 1} + (1 - q) - q]$ . The solution to $\partial Ψ / \partial q = 0$ is $q^{e s t} = 1 / 2$ , as can be verified by direct substitution. Thus, the solution splits the difference between the observations and the prior values, and yields prior variances $C_{d}$ and $C_{h}$ that are equal. While simplistic, this problem illustrates that, at least in some cases, GLS is capable of uniquely determining the relative sizes of $C_{d}$ and $C_{h}$ . Because trade-off curves, as defined in the Introduction, are based on the behavior of E and L, and not the complete objective function Ψ, the weighting parameter $q_{0}$ estimated from them in general will be different from $q^{e s t}$ .Consequently, the trade-off curve procedure is not consistent with the Bayesian framework upon which GLS rests.

Our third example demonstrates the tuning of data covariance $C_{d} (q)$ . In many cases, observational error increases during the course of an experiment, due to degradation of equipment or to worsening environmental conditions. The example demonstrates that the method is capable of accurately quantifying the fractional rate of increase p of the variance $σ_{d_{n}}$ , which is assumed to vary with position $x_{n}$ . In our simulation, we consider $N = 201$ synthetic data, evenly-spaced on the interval $0 \leq x_{i} \leq 1$ , which scatter around the curve $d_{i} = m_{1} + m_{2} x_{i}^{1 / 2}$ (Figure 2). The covariance of the data is modeled as ${[C_{d}]}_{m n} = σ_{d_{n}}^{2} δ_{m n}$ , where $σ_{d_{n}} = {(1)}^{2} (1 + q (2 x_{n} - 1))$ and $δ_{m n}$ is the Kronecker delta; that is, the data are uncorrelated and their variance increases linearly with x. The derivative of the covariance is $(\partial / \partial q) {[C_{d}]}_{m n} = {(1)}^{2} (2 x_{n} - 1) δ_{m n}$ . We have included prior information with $H = I$ and $h^{p r i} = 0$ , which implements the notion that the model parameters are small. The corresponding covariance is chosen to be large, $C_{h} = {(1000)}^{2} I$ , indicating that this information is weak. The goal is to tune the rate of increase of variance and to arrive at a best-estimate of the two model parameters. The starting value is taken to be $q_{0} = 0$ , which corresponds to uniform variance. It is successively improved by a gradient descent method that minimizes Ψ, yielding an estimated value $q^{e s t} \approx 0.709$ .This estimate differs from the true value $q^{t r u e} = 0.700$ by about 1%. The estimated solution $m^{e s t}$ differs from $m (q = 0)$ by a few tenths of a percent, which may be significant in some applications.

Figure 2. Example of tuning $C_{d} (q)$ . (a) Plot of synthetic data (red dots) and predicted data (green curve); (b) The starting value $q_{0} = 0$ corresponds to uniform variance (black curve). The estimate $q^{e s t}$ corresponds to increasing variance (green curve); (c) Generalized error $Φ (q)$ (black curve). The starting value $q_{0}$ (black circle) is successively improved (red circles) by a gradient descent method, yielding an estimate $q^{e s t}$ (green circle); (d) The gradient $\partial Φ / \partial q$ , computed using the formulas developed in the text; (e) The first model parameter $m_{1} (q)$ , highlighting the initial value (black circle) and estimated value (green circle) (f) Same as (e), except for the second model parameter $m_{2} (q)$ .

The fourth example demonstrates tuning of information covariance $C_{h} (q)$ . In many instances, one may need to “reconstruct” or “interpolate” a function on the basis of unevenly and sparsely sampled data. In this case, prior information on the autocovariance of the function can enable a smooth interpolation. Furthermore, it can enforce a covariance structure that may be required, say, by the underlying physics of the problem. In our example, we suppose that the function is known to be oscillatory on physical grounds, but that the wavenumber of those oscillations is known only imprecisely. The goal is to tune prior knowledge of wavenumber to arrive at a best-estimate of the reconstructed function. In our simulation, a total of $M = 101$ model parameters $m_{j}$ are uniformly spaced on the interval $0 \leq x \leq 100$ and representing a sampled version of a continuous, sinusoidal function $m (x)$ with wavenumber $p^{t r u e} = 0.1571$ (Figure 3). Synthetic data $d_{i}^{o b s}$ with uncorrelated error with variance $σ_{d}^{2} = {(0.01)}^{2}$ are available for $N = 40$ randomly-chosen points $x_{j (i)}$ , where the index function $j (i)$ aligns in x observations to model parameters. The data kernel is $G_{i j} = δ_{i, j (i)}$ . The prior information is given in (4), with autocovariance ${[C_{h}]}_{n m} = σ_{h}^{2} \cos (q | x_{n} - x_{m} |)$ and

$σ_{h}^{2} = {(10)}^{2}$ . The derivative is $(\partial / \partial q) {[C_{h}]}_{n m} = - σ_{h}^{2} | x_{n} - x_{m} | \sin (q | x_{n} - x_{m} |)$ . An

initial guess $p_{0} = 0.95 p^{t r u e}$ is improved using a gradient descent method, yielding an estimated value of $p^{e s t} = 0.1571$ that differs from $p^{t r u e}$ by less than 0.01%. The reconstructed function is smooth and sinusoidal and the fit to the data is much improved.

Examples three and four were implemented in MATLAB® and executed in <5s on a notebook computer. They confirm the flexibility, speed and effectiveness of the method. An ability to tune prior information on autocovariance may be of special utility in seismic exploration applications, where three-dimensional waveform datasets are routinely interpolated.

A limitation of this overall “parametric” approach is that the solution is dependent on the choice of parameterization, which must be guided by prior knowledge of the general properties of the covariance matrices in particular problem being solved. In Example 3, we were able to recognize (say, by visually

Figure 3. Example of tuning $C_{h} (q)$ . Sparsely-sampled synthetic data $d_{i}^{o b s}$ (red dots) are oscillatory. (a) A regularly-sampled version $m_{j}^{e s t}$ is created by imposing the oscillatory covariance ${[C_{h}]}_{n m} = σ_{h}^{2} \cos (q | x_{n} - x_{m} |)$ . With the starting value $q_{0} = 0.9500 q^{t r u e}$ , the reconstruction poorly fits the data (black curve). Tuning leads to a better fit (green curve with dots), as well as a precise estimate of wavenumber $q_{0} \approx 0.9999 q^{t r u e}$ ; (b) Decrease in $Ψ_{n}$ with iteration number during the gradient descent process.

examining the data plotted in Figure 2(a)) that observational error increases with x and chose ${[C_{d}]}_{m n} = σ_{d}^{2} (1 + q (x_{n} - x_{1})) δ_{m n}$ that matched this scenario. If, instead, the degree of correlation between successive data increased with x, this pattern might be less expected, more difficult to detect, and require a different

parameterization—say, ${[C_{d} (q)]}_{n m} = σ_{d}^{2} \exp [- \frac{1}{2} q (x_{n} + x_{m}) | x_{n} - x_{m} |]$ .

Not every parameterization of $C_{d}$ (or $C_{h}$ ) is necessarily well-behaved. To avoid poor behavior, the parameterization must be chosen so its determinant does not have zeros at values of $q^{e s t}$ that will prevent the steepest descent process from converging to the global minimum. That this choice can be problematical is illustrated by the simple Toeplitz version of $C_{d}$ (with $N = 10$ , $J = 9$ ):

$C_{d} = [\begin{matrix} 1 & q_{1} & q_{2} & q_{3} & \dots & q_{9} \\ q_{1} & 1 & q_{1} & q_{2} & \dots & q_{8} \\ q_{2} & q_{1} & 1 & q_{1} & \dots & q_{7} \\ q_{3} & q_{2} & q_{1} & 1 & \dots & q_{6} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ q_{9} & q_{8} & q_{7} & q_{6} & \dots & 1 \end{matrix}]$ (22)

with $| q_{i} | < 1$ . This form is useful for quantifying correlations within a stationary sequence of data [31]. Yet as is illustrated in Figure 4, the $ℝ^{J}$ volume is crossed

Figure 4. The function $\det C_{d} (q) = 0$ for the case given by (22). (a) The $(q_{1}, q_{2})$ surface for $q_{3} = - 0.95$ and the other qs randomly assigned; (b) Same as (a), but with $q_{3} = 0.00$ ; (c) Same as (a), but with $q_{3} = 0.95$ ; (d) Perspective view of the surfaces in the $q_{1}, q_{2}, q_{3}$ volume. The positions of the three slices in (a), (b) and (c) are noted on the $q_{3}$ -axis (green arrows). A question posed in the text is whether, given an arbitrary point $q^{(0)}$ and the global minimum of the objective function, say at $q^{e s t}$ (and with both points satisfying $\det C_{d} > 0$ ), a steepest-descent path necessarily exists between them.

by many $\det C_{d} = 0$ surfaces that correspond to surfaces of singular objective function Ψ. Their presence suggests that the steepest descent path between a starting value $q^{(0)}$ and the global minimum at $q^{e s t}$ may be very convoluted (if, indeed, such a path exists) unless $q^{(0)}$ is very close to $q^{e s t}$ .

5. Conclusion

Generalized Least Squares requires the assignment of two prior covariance matrices, the prior covariance of the data and the prior covariance of the prior information. Making these assignments is often a very subjective process. However, in cases in which the forms of these matrices can be anticipated up to a set of poorly-known parameters, information contained within the data and prior information can be used to improve knowledge of them—a process we call “tuning”. Tuning can be achieved by minimizing an objective function that depends on both the generalized error and determinants of the covariance matrices to arrive at a best estimate of the parameters. Analytic and computationally-tractable formulas are derived for the derivative needed to implement the minimization via a gradient descent method. Furthermore, the problem is organized so that the minimization need be performed only over the space of covariance parameters, and not over the typically-much-larger space of model and covariance parameters. Although some care needs to be exercised as the covariance matrices are parametrized, the minimization is tractable and can lead to better estimates of the model parameters. An important outcome is this study is the recognition that the use of trade-off curves to determine relative weighting of covariance—a practice ubiquitous in the geophysical imaging—is not consistent with the underlying Bayesian framework of Generalized Least Squares. The strategy outlined here provides a consistent solution.

Acknowledgements

The author thanks Roger Creel for helpful discussion.

Conflicts of Interest

The author declares no conflicts of interest regarding the publication of this paper.

Cite this paper

Menke, W. (2021) Tuning of Prior Covariance in Generalized Least Squares. Applied Mathematics, 12, 157-170. https://doi.org/10.4236/am.2021.123011

References

1. Tarantola, A. and Valette, B. (1982) Generalized Non-Linear Inverse Problems Solved Using the Least Squares Criterion. Reviews of Geophysics and Space Physics, 20, 219-232. https://doi.org/10.1029/RG020i002p00219

2. Tarantola, A. and Valette, B. (1982) Inverse Problems = Quest for Information. Journal of Geophysics, 50, 159-170. https://n2t.net/ark:/88439/y048722

3. Menke, W. (2018) Geophysical Data Analysis: Discrete Inverse Theory. 4th Edition, Elsevier, 350 p.

4. Menke, W. and Menke, J. (2016) Environmental Data Analysis with MATLAB. 2nd Edition, Elsevier, 3342 p. https://doi.org/10.1016/B978-0-12-804488-9.00001-X

5. Tarantola, A. (2005) Inverse Problem Theory and Methods for Model Parameter Estimation. SIAM: Society for Industrial and Applied Mathematics, 342 p. https://doi.org/10.1137/1.9780898717921

6. Menke, W. (2014) Review of the Generalized Least Squares Method. Surveys in Geophysics, 36, 1-25. https://doi.org/10.1007/s10712-014-9303-1

7. Abers, G. (1994) Three-Dimensional Inversion of Regional P and S Arrival Times in the East 723 Aleutians and Sources of Subduction Zone Gravity Highs. Journal of Geophysical Research, 99, 4395-4412. https://doi.org/10.1029/93JB03107

8. Schmandt, B. and Lin, F.-C. (2014) P and S Wave Tomography of the Mantle beneath the United States. Geophysical Research Letters, 41, 6342-6349. https://doi.org/10.1002/2014GL061231

9. Menke, W. (2005) Case Studies of Seismic Tomography and Earthquake Location in a Regional Context. Geophysical Monograph 157. American Geophysical Union, Washington DC. https://doi.org/10.1029/157GM02

10. Nettles, M., and Dziewonski, A.M. (2008) Radially Anisotropic Shear Velocity Structure of the Upper Mantle Globally and Beneath North America. Journal of Geophysical Research, 113, B02303. https://doi.org/10.1029/2006JB004819

11. Chen, W. and Ritzwoller, M.H. (2016) Crustal and Uppermost Mantle Structure Beneath the United States. Journal of Geophysical Research, 121, 4306-4342. https://doi.org/10.1002/2016JB012887

12. Humphreys, E.D., Dueker, K.G., Schutt, D.L. and Smith, R.B. (2000) Beneath Yellowstone: Evaluating Plume and Nonplume Models Using Teleseismic Images of the Upper Mantle. GSA Today, 10, 1-7. https://www.geosociety.org/gsatoday/archive/10/12/

13. Gillet, N., Schaeffer, N. and Jault, D. (2011) Rationale and Geophysical Evidence for Quasi-Geostrophic Rapid Dynamics within the Earth’s Outer Core. Physics of the Earth and Planetary Interiors, 187, 380-390. https://doi.org/10.1016/j.pepi.2011.01.005

14. Zhao, S. (2013) Lithosphere Thickness and Mantle Viscosity Estimated from Joint Inversion of GPS and GRACE-Derived Radial Deformation and Gravity Rates in North America. Geophysical Journal International, 194, 1455-1472. https://doi.org/10.1093/gji/ggt212

15. Menke, W. and Eilon, Z. (2015) Relationship between Data Smoothing and the Regularization of Inverse Problems. Pure and Applied Geophysics, 172, 2711-2726. https://doi.org/10.1007/s00024-015-1059-0

16. Voorhies, C.F. (1986) Steady Flows at the Top of Earth’s Core Derived from Geomagnetic Field Models. Journal of Geophysical Research, 91, 12444-12466. https://doi.org/10.1029/JB091iB12p12444

17. Yao, Z.S. and Roberts, R.G. (1999) A Practical Regularization for Seismic Tomography. Geophysical Journal International, 138, 293-299. https://doi.org/10.1046/j.1365-246X.1999.00849.x

18. Snyman, J.A. and Wilke, D.N. (2018) Practical Mathematical Optimization—Basic Optimization Theory and Gradient-Based Algorithms. Springer Optimization and Its Applications, 2nd Edition, Springer, New York, 340 p.

19. Hidebrand, F.B. (1987) Introduction to Numerical Analysis. 2nd Edition, Dover Publications, New York.

20. Zaroli, C., Sambridge, M., Lévêque, J.-J., Debayle, E. and Nolet, G. (2013) An Objective Rationale for the Choice of Regularization Parameter with Application to Global Multiple-Frequency S-Wave Tomography. Solid Earth, 4, 357-371. https://doi.org/10.5194/se-4-357-2013

21. Malinverno, A. and Parker, R.L. (2006) Two Ways to Quantify Uncertainty in Geophysical Inverse Problems. Geophysics, 71, W15-W27. https://doi.org/10.1190/1.2194516

22. Malinverno, A. and Briggs, V.A. (2004) Expanded Uncertainty Quantification in Inverse Problems: Hierarchical Bayes and Empirical Bayes. Geophysics, 69, 877-1103. https://doi.org/10.1190/1.1778243

23. Box, G.E.P. and Tiao, G.C. (1992) Bayesian Inference in Statistical Analysis. Wiley, New York, 589 p. https://doi.org/10.1002/9781118033197

24. Schmidt, E. (1973) Cholesky Factorization and Matrix Inversion, National Oceanic and Atmospheric Administration Technical Report NOS-56. US Government Printing Office, Washington DC. https://books.google.com/books?id=MiRHAQAAIAAJ

25. Petersen, K.B. and Pedersen, M.S. (2008) The Matrix Cookbook, 71 p. https://archive.org/details/imm3274

26. Bartels, R.H. and Stewart, G.W. (1972) Solution of the matrix equation AX + XB = C. Communications of the ACM, 15, 820-826. https://doi.org/10.1145/361573.361582

27. Higham, N.J. (1987) Computing Real Square Roots of a Real Matrix. Linear Algebra and its Applications, 88-89, 405-430. https://doi.org/10.1016/0024-3795(87)90118-2

28. Magnus, J.R. and Neudecker, H. (1999) Matrix Differential Calculus with Applications in Statistics and Econometrics, Revised Edition. John Wiley and Sons, New York, 424 p.

29. Gantmacher, F.R. (1960) The Theory of Matrices, Volume 1. Chelsea Publishing, New York, 374 p.

30. Fisher, R.A. (1925) Theory of Statistical Estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22, 700-725. https://doi.org/10.1017/S0305004100009580

31. Claerbout, J.F. (1985) Fundamentals of Geophysical Data Processing with Applications to Petroleum Prospecting. Blackwell Scientific Publishing, Oxford, UK, 267 p.

Journal Menu >>