_{1}

^{*}

Generalized Least Squares (least squares with prior information) requires the correct assignment of two prior covariance matrices: one associated with the uncertainty of measurements; the other with the uncertainty of prior information. These assignments often are very subjective, especially when correlations among data or among prior information are believed to occur. However, in cases in which the general form of these matrices can be anticipated up to a set of poorly-known parameters, the data and prior information may be used to better-determine (or “tune”) the parameters in a manner that is faithful to the underlying Bayesian foundation of GLS. We identify an objective function, the minimization of which leads to the best-estimate of the parameters and provide explicit and computationally-efficient formula for calculating the derivatives needed to implement the minimization with a gradient descent method. Furthermore, the problem is organized so that the minimization need be performed only over the space of covariance parameters, and not over the combined space of model and covariance parameters. We show that the use of trade-off curves to select the relative weight given to observations and prior information is not a form of tuning, because it does not, in general maximize the posterior probability of the model parameters, and can lead to a different weighting than the procedure described here. We also provide several examples that demonstrate the viability, and discuss both the advantages and limitations of the method.

Generalized Least Squares (GLS, also called least-squared with prior information) is a tool for statistical inference [

We review the Generalized Least Squares (GLS) method here, following the notation in [

m e s t = Z − 1 ( G T C d − 1 d o b s + H T C h − 1 h p r i ) with Z ≡ G T C d − 1 G + H T C h − 1 H (1)

The assumption of linear kernels G and H is a very restrictive one. In the well-studied nonlinear generalization [

G ( 0 ) Δ m = Δ d with G i j ( 0 ) ≡ ∂ g i ∂ m j | m ( 0 ) and Δ d ≡ d o b s − g ( m ) H ( 0 ) Δ m = Δ h with H i j ( 0 ) ≡ ∂ h i ∂ m j | m ( 0 ) and Δ h ≡ h p r i − h ( m ) (2)

and Δ m = m − m ( 0 ) . The solution is then found by iterative application of (1) applied to (2); that is, by the Gauss-Newton’s method [

∇ m Φ | m ( 0 ) = − 2 G ( 0 ) T C d − 1 ( d o b s − G m ( 0 ) ) − 2 H ( 0 ) T C h − 1 ( h p r i − H m ( 0 ) ) (3)

The latter approach is preferred for very large M, since the convergence rate of gradient descent is independent of its dimension [^{3} [

We now discuss issues related to the covariance matrices that appear in GLS. The data covariance C d quantifies the uncertainty of the observations and the information covariance C h quantifies the uncertainty of the prior information. Prior knowledge of the inherent accuracy of the measurement technique is needed to assign C d , and prior knowledge of the physically-plausible solutions, perhaps stemming from and understanding of the underlying physics, is needed to assign C h . These assignments are often very subjective, especially when correlations are believed to occur (that is, C d and C h have non-zero off-diagonal elements). For example, one geotomographic study [

The matrices C d and C h together contain 1 2 N ( N + 1 ) + 1 2 K ( K + 1 ) elements,

many more than the ( N + K ) constraints imposed by the data d and prior information h .Consequently, insufficient information is available to uniquely solve for all the elements of C d and C h . However, it sometimes may be possible to parameterize C d ( q ) and/or C h ( q ) in terms of q ∈ ℝ J , and ask whether an initial estimate of q can be improved. As long as ( M + J ) < ( N + K ) , adequate information may be available to determine a best estimate q e s t . We refer to the process of determining q e s t as “tuning”, since in typical practice it requires that the covariances be close to their true values.

As an example of a parametrized covariance, we consider the case where the model parameters represent a sampled version of a continuous function m ( x ) , where x ∈ ℝ is an independent variable; that is, m n = m ( x n ) , with x n ≡ n Δ x and Δ x the sampling interval. The prior information that m ( x ) is approximately oscillatory with wavenumber q can be modeled by:

H = I and h p r i = 0 and [ C h ] n m = σ h 2 cos ( q | x n − x m | ) (4)

In this case, C h approximates the autocovariance of m ( x ) , which is assumed to be stationary. The goal of tuning is to provides a best-estimate q e s t , as well of best estimated m e s t of the model parameters. This problem is further developed in Example 4, below.

Although the GLS formulation is widely used in geotomography and geophysical imaging, the tuning of variance is typically implemented in a very limited fashion, through the use of trade-off curves [

The general process of using Bayes’ theorem to construct a posterior probability density function (p.d.f.) that depends on unknown parameters and of estimating those parameters though the maximization of probability is very well understood [

The GLS solution (1) yields the m that minimizes the generalized error Φ ( m ) , or equivalently, the m that maximizes the Normal posterior probability density function (p.d.f.) p ( m | d o b s , h p r i ) :

m e s t = arg max m p ( m | d o b s , h p r i ) with p ( m | d o b s , h p r i ) ∝ p ( d o b s | m ) p ( h p r i | m ) (5)

Here, Bayes theorem [

m e s t , q e s t = arg max m , q ( d ) , q ( h ) p ( m , q | d o b s , h p r i ) with p ( m , q | d o b s , h p r i ) ∝ p ( d o b s | m , q ( d ) ) p ( h p r i | m , q ( h ) ) p ( q ( d ) ) p ( q ( h ) ) (6)

Here, we have assumed that q and m are not correlated with one another. The maximization with respect to the two variables can be performed as a sequence of two single-variable maximizations:

m ( q 0 ) : arg max m p ( m , q 0 | d o b s , h p r i ) ( atfixed q 0 ) (7a)

q e s t : arg max q 0 p ( m ( q 0 ) , q 0 | d o b s , h p r i ) (7b)

m e s t = m ( q e s t ) (7c)

In the special case of the uniform prior p ( q ( d ) ) p ( q ( h ) ) ∝ constant , the maximization in (7a) is the GPR solution at fixed q 0 . For the Normal p.d.f.:

p ( m ( q 0 ) , q 0 | d o b s , h p r i ) = ( 2 π ) − 1 2 ( N + K ) ( det C d ) − 1 2 ( det C k ) − 1 2 exp ( − 1 2 E ) ( − 1 2 L ) (8)

the maximization (7b) is equivalent to the minimization of an objective function Ψ ( q ) , defined as:

Ψ ≡ − 2 [ ln p + 1 2 ( N + K ) ln ( 2 π ) ] = ln ( det C d ) + ln ( det C h ) + E + L (9)

The quantity ln ( det C d ) is best computed by finding the Choleski decomposition C d = D D T , the algorithm [

The process of simultaneously estimating the covariance parameters q e s t and model parameters m e s t consists of six steps. First, the analytic form of the covariance matrices C d ( q ) and C h ( q ) are specified, and their derivatives ∂ C d / ∂ q m and ∂ C h / ∂ q m are computed analytically. Second, an initial estimate q ( 0 ) is identified. Third, the covariance matrices C d ( q ( 0 ) ) and C h ( q ( 0 ) ) are inserted into (1), yielding model parameters m ( q ( 0 ) ) . Fourth, using formulas developed below, the value of the derivative ∂ Ψ / ∂ q m is calculated at q ( 0 ) . Fifth, a gradient descent method employing ∂ Ψ / ∂ q m is used to iteratively perturb q ( 0 ) towards the minimum of Ψ at q e s t (and in process, repeating steps three through five many times). Sixth, the estimated model parameters are computed as m e s t = m ( q e s t ) . This process is depicted in

Our derivation of ∂ Ψ / ∂ q m uses three matrix derivatives, ∂ M − 1 / ∂ q , ∂ M − 1 / 2 / ∂ q and ∂ ln ( det M ) / ∂ q that may be unfamiliar to some readers, so we derive them here for completeness. Let M ( q ) be asquare, invertible, differentiable matrix. Differentiating M − 1 M = I yields [ ∂ M − 1 / ∂ q m ] M + M − 1 [ ∂ M / ∂ q m ] = 0 , which can be rearranged into ( [

∂ M − 1 ∂ q m = − M − 1 [ ∂ M ∂ q m ] M − 1 (10)

Similarly, differentiating M − 1 / 2 M − 1 / 2 = M − 1 and applying (10), yields the Sylvester equation:

∂ M − 1 / 2 ∂ q m M − 1 / 2 + M − 1 / 2 ∂ M − 1 / 2 ∂ q m = ∂ M − 1 ∂ q m = − M − 1 [ ∂ M ∂ q m ] M − 1 (11)

We have not been able to determine a source for this equation, but in all likelihood, it has been derived previously. In practice, (11) is not significantly harder to compute than (10), because efficient algorithms for solving Sylvester equations [

∂ det ( M ) ∂ q = tr ( adj ( M ) ∂ M ∂ q ) = tr ( det ( M ) M − 1 ∂ M ∂ q ) = det ( M ) tr ( M − 1 ∂ M ∂ q ) (12)

where adj ( . ) is the adjugate and tr ( . ) is the trace, applying Laplace’s identify [

∂ ln ( det M ) ∂ q = 1 det ( M ) ∂ det ( M ) ∂ q = tr ( M − 1 ∂ M ∂ q ) (13)

We begin the main derivation by considering the case in which data variance C d ( q ) depends on a parameter vector q , and the information variance C h is constant. The derivative of the GLS solution can be found by applying the chain rule applied to (1):

∂ m e s t ∂ q m = ∂ Z − 1 ∂ q m G T C d − 1 d o b s + Z − 1 G T ∂ C d − 1 ∂ q m d o b s + ∂ Z − 1 ∂ q m H T C h − 1 h p r i = Z − 1 ( G T ∂ C d − 1 ∂ q m d o b s − ∂ Z ∂ q m m e s t ) with ∂ Z ∂ q m = G T ∂ C d − 1 ∂ q m G and ∂ C d − 1 ∂ q m = − C d − 1 ∂ C d ∂ q m C d − 1 (14)

Note that we have used (10). The derivative of the normalized prediction error is e ˜ ≡ C d − 1 / 2 ( d o b s − G m e s t ) and total error E ≡ e ˜ T e ˜ are:

∂ e ˜ ∂ q m = − C d − 1 / 2 G ∂ m e s t ∂ q m + ∂ C d − 1 / 2 ∂ q m ( d o b s − G m e s t ) and ∂ E ∂ q m = 2 e ˜ T ∂ e ˜ ∂ q m with ∂ C h − 1 / 2 ∂ q m C h − 1 / 2 + C h − 1 / 2 ∂ C h − 1 / 2 ∂ q m = − C h − 1 ∂ C h ∂ q m C h − 1 (15)

Here, the Sylvester equation arises from (11). An alternate way of differentiating E that does not require solving a Sylvester equation is:

∂ E ∂ q m = ∂ ∂ q m ( e T C d − 1 e ) = − ( ∂ m e s t ∂ q m ) T G T C d − 1 e + e T ∂ C d − 1 ∂ q m e − e T C d − 1 G ∂ m e s t ∂ q m (16)

The derivative of the normalized error in prior information l ˜ = C h − 1 / 2 ( h − H m e s t ) and total error L ≡ l ˜ T l ˜ are:

∂ l ˜ ∂ q m = − C h − 1 / 2 H ∂ m e s t ∂ q m and ∂ L ∂ q m = 2 l ˜ T ∂ l ˜ ∂ q m (17)

Finally, since Ψ = ln ( det C d ) + ln ( det C h ) + E + L , we have:

∂ Ψ ∂ q m = ∂ ln ( det C d ) ∂ q m + ∂ E ∂ q m + ∂ L ∂ q m = tr ( C d − 1 ∂ C d ∂ q m ) + ∂ E ∂ q m + ∂ L ∂ q m (18)

Note that we have applied (13).

Finally, we consider the case in which the information variance C h ( q ) depends on parameters q , and C d is constant. Since the data and prior information play completely symmetric roles in (1), the derivatives can be obtained by interchanging the roles of C d and C h , G and H , d o b s and h p r i , e ˜ and l ˜ and E and L, in the equations above, yielding:

∂ m e s t ∂ q m = Z − 1 ( H T ∂ C h − 1 ∂ q m h p r i − ∂ Z ∂ q m m e s t ) with ∂ Z ∂ q m = H T ∂ C h − 1 ∂ q m H and ∂ C h − 1 ∂ q m = − C h − 1 ∂ C h ∂ q m C h − 1

∂ e ˜ ∂ q m = − C d − 1 / 2 G ∂ m e s t ∂ q m and ∂ E ∂ q m = 2 e ˜ T ∂ e ˜ ∂ q m

∂ l ˜ ∂ q m = − C h − 1 / 2 H ∂ m e s t ∂ q m + ∂ C h − 1 / 2 ∂ q m ( h p r i − H m e s t )

∂ L ∂ q m = 2 l ˜ T ∂ l ˜ ∂ q m = − ( ∂ m e s t ∂ q m ) T H T C h − 1 l + l T ∂ C h − 1 ∂ q m l − l T C h − 1 H ∂ m e s t ∂ q m

∂ C h − 1 / 2 ∂ q m C h − 1 / 2 + C h − 1 / 2 ∂ C h − 1 / 2 ∂ q m = − C h − 1 ∂ C h ∂ q m C h − 1

∂ ln ( det C h ) ∂ q m = tr ( C h − 1 ∂ C h ∂ q m )

∂ Ψ ∂ q m = tr ( C h − 1 ∂ C h ∂ q m ) + ∂ E ∂ q m + ∂ L ∂ q m (19)

These formulas have been checked numerically.

In the first example, we examine the simplistic case in which the parameter q represents an overall scaling of variance; that is C d ( q ) = q C d ( 0 ) and C h ( q ) = q C h ( 0 ) , with specified C d ( 0 ) and C h ( 0 ) . The solution m e s t is independent of q, as can be verified by substitution into (1). The parameter q can then be found by direct minimization of (9), which simplifies to:

Ψ = ln ( q N det C d ( 0 ) ) + ln ( q K det C k ( 0 ) ) + q − 1 E 0 + q − 1 L 0 (20)

Here, we have used the rule det ( q M ) = q N det ( M ) [

∂ Ψ ∂ q = 0 = ( N + K ) q − 1 − ( E 0 + L 0 ) q − 2 or q = E 0 + L 0 N + K (21)

This is a generalization of the well-known maximum likelihood estimate of the sample variance [

In the second example, we examine another simplistic case in which a parameter q represents the relative weighting of variance; that is C d − 1 ( q ) = q I and C h − 1 ( q ) = ( 1 − q ) I .We consider the problem of estimating the mean m 1 of data given observations d = 1 and prior information h = 0 (where 0 and 1 are vectors of zeros and ones, respectively), when N = K , M = 1 and G = H = 1 . Applying (1), we find that m e s t = q . Then, the objective function is Ψ = ln ( q N ) + ln ( ( 1 − q ) N ) + N q ( 1 − q ) and its derivative is ∂ Ψ / ∂ q = N [ − q − 1 + ( 1 − q ) − 1 + ( 1 − q ) − q ] . The solution to ∂ Ψ / ∂ q = 0 is q e s t = 1 / 2 , as can be verified by direct substitution. Thus, the solution splits the difference between the observations and the prior values, and yields prior variances C d and C h that are equal. While simplistic, this problem illustrates that, at least in some cases, GLS is capable of uniquely determining the relative sizes of C d and C h . Because trade-off curves, as defined in the Introduction, are based on the behavior of E and L, and not the complete objective function Ψ, the weighting parameter q 0 estimated from them in general will be different from q e s t .Consequently, the trade-off curve procedure is not consistent with the Bayesian framework upon which GLS rests.

Our third example demonstrates the tuning of data covariance C d ( q ) . In many cases, observational error increases during the course of an experiment, due to degradation of equipment or to worsening environmental conditions. The example demonstrates that the method is capable of accurately quantifying the fractional rate of increase p of the variance σ d n , which is assumed to vary with position x n . In our simulation, we consider N = 201 synthetic data, evenly-spaced on the interval 0 ≤ x i ≤ 1 , which scatter around the curve d i = m 1 + m 2 x i 1 / 2 (

The fourth example demonstrates tuning of information covariance C h ( q ) . In many instances, one may need to “reconstruct” or “interpolate” a function on the basis of unevenly and sparsely sampled data. In this case, prior information on the autocovariance of the function can enable a smooth interpolation. Furthermore, it can enforce a covariance structure that may be required, say, by the underlying physics of the problem. In our example, we suppose that the function is known to be oscillatory on physical grounds, but that the wavenumber of those oscillations is known only imprecisely. The goal is to tune prior knowledge of wavenumber to arrive at a best-estimate of the reconstructed function. In our simulation, a total of M = 101 model parameters m j are uniformly spaced on the interval 0 ≤ x ≤ 100 and representing a sampled version of a continuous, sinusoidal function m ( x ) with wavenumber p t r u e = 0.1571 (

σ h 2 = ( 10 ) 2 . The derivative is ( ∂ / ∂ q ) [ C h ] n m = − σ h 2 | x n − x m | sin ( q | x n − x m | ) . An

initial guess p 0 = 0.95 p t r u e is improved using a gradient descent method, yielding an estimated value of p e s t = 0.1571 that differs from p t r u e by less than 0.01%. The reconstructed function is smooth and sinusoidal and the fit to the data is much improved.

Examples three and four were implemented in MATLAB® and executed in <5s on a notebook computer. They confirm the flexibility, speed and effectiveness of the method. An ability to tune prior information on autocovariance may be of special utility in seismic exploration applications, where three-dimensional waveform datasets are routinely interpolated.

A limitation of this overall “parametric” approach is that the solution is dependent on the choice of parameterization, which must be guided by prior knowledge of the general properties of the covariance matrices in particular problem being solved. In Example 3, we were able to recognize (say, by visually

examining the data plotted in

parameterization—say, [ C d ( q ) ] n m = σ d 2 exp [ − 1 2 q ( x n + x m ) | x n − x m | ] .

Not every parameterization of C d (or C h ) is necessarily well-behaved. To avoid poor behavior, the parameterization must be chosen so its determinant does not have zeros at values of q e s t that will prevent the steepest descent process from converging to the global minimum. That this choice can be problematical is illustrated by the simple Toeplitz version of C d (with N = 10 , J = 9 ):

C d = [ 1 q 1 q 2 q 3 ⋯ q 9 q 1 1 q 1 q 2 ⋯ q 8 q 2 q 1 1 q 1 ⋯ q 7 q 3 q 2 q 1 1 ⋯ q 6 ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ q 9 q 8 q 7 q 6 ⋯ 1 ] (22)

with | q i | < 1 . This form is useful for quantifying correlations within a stationary sequence of data [

by many det C d = 0 surfaces that correspond to surfaces of singular objective function Ψ. Their presence suggests that the steepest descent path between a starting value q ( 0 ) and the global minimum at q e s t may be very convoluted (if, indeed, such a path exists) unless q ( 0 ) is very close to q e s t .

Generalized Least Squares requires the assignment of two prior covariance matrices, the prior covariance of the data and the prior covariance of the prior information. Making these assignments is often a very subjective process. However, in cases in which the forms of these matrices can be anticipated up to a set of poorly-known parameters, information contained within the data and prior information can be used to improve knowledge of them—a process we call “tuning”. Tuning can be achieved by minimizing an objective function that depends on both the generalized error and determinants of the covariance matrices to arrive at a best estimate of the parameters. Analytic and computationally-tractable formulas are derived for the derivative needed to implement the minimization via a gradient descent method. Furthermore, the problem is organized so that the minimization need be performed only over the space of covariance parameters, and not over the typically-much-larger space of model and covariance parameters. Although some care needs to be exercised as the covariance matrices are parametrized, the minimization is tractable and can lead to better estimates of the model parameters. An important outcome is this study is the recognition that the use of trade-off curves to determine relative weighting of covariance—a practice ubiquitous in the geophysical imaging—is not consistent with the underlying Bayesian framework of Generalized Least Squares. The strategy outlined here provides a consistent solution.

The author thanks Roger Creel for helpful discussion.

The author declares no conflicts of interest regarding the publication of this paper.

Menke, W. (2021) Tuning of Prior Covariance in Generalized Least Squares. Applied Mathematics, 12, 157-170. https://doi.org/10.4236/am.2021.123011