Applied Mathematics
Vol.12 No.03(2021), Article ID:107768,14 pages
10.4236/am.2021.123011
Tuning of Prior Covariance in Generalized Least Squares
William Menke
Lamont-Doherty Earth Observatory of Columbia University, New York, USA
Copyright © 2021 by author(s) and Scientific Research Publishing Inc.
This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).
http://creativecommons.org/licenses/by/4.0/
Received: February 5, 2021; Accepted: March 14, 2021; Published: March 17, 2021
ABSTRACT
Generalized Least Squares (least squares with prior information) requires the correct assignment of two prior covariance matrices: one associated with the uncertainty of measurements; the other with the uncertainty of prior information. These assignments often are very subjective, especially when correlations among data or among prior information are believed to occur. However, in cases in which the general form of these matrices can be anticipated up to a set of poorly-known parameters, the data and prior information may be used to better-determine (or “tune”) the parameters in a manner that is faithful to the underlying Bayesian foundation of GLS. We identify an objective function, the minimization of which leads to the best-estimate of the parameters and provide explicit and computationally-efficient formula for calculating the derivatives needed to implement the minimization with a gradient descent method. Furthermore, the problem is organized so that the minimization need be performed only over the space of covariance parameters, and not over the combined space of model and covariance parameters. We show that the use of trade-off curves to select the relative weight given to observations and prior information is not a form of tuning, because it does not, in general maximize the posterior probability of the model parameters, and can lead to a different weighting than the procedure described here. We also provide several examples that demonstrate the viability, and discuss both the advantages and limitations of the method.
Keywords:
Bayesian Inference, Covariance, Error, Generalized Least Squares, Gradient Descent, Interpolation, Regularization, Trade-Off Curve, Variance
1. Introduction
Generalized Least Squares (GLS, also called least-squared with prior information) is a tool for statistical inference [1] - [6] that is widely used in geotomography [7] - [12] and geophysical inversion [13] [14], as well as other areas of the physical sciences and engineering. One of the attractive features of GLS that makes it especially useful in the imaging of multidimensional fields (for example, density, velocity, viscosity) is its ability to implement, in a natural and versatile way, prior information of the behavior of the field. Widely-used types of prior information include the field being smooth, as quantified by its low-order derivatives [15], having a specified power spectral density or autocovariance [7] [15], and satisfying a specified partial differential equation (such as the geostrophic flow equation [16] or the diffusion equation [4] ). The word “regularization” sometimes is used to describe the effect of prior information on the solution process [17].
We review the Generalized Least Squares (GLS) method here, following the notation in [6], in order to provide context and to establish nomenclature. In GLS, observations (or data) and prior information (or inferences) are combined to arrive at a best-estimate of initially-unknown model parameters (which might, for example, represent a field sampled on a regular grid). The data are assumed to satisfy the linear equation , where is a vector of data, is a vector of model parameters, and is a known “kernel” matrix associated with the data. Prior information is assumed to satisfy a linear equation , where is a vector of prior values and is a kernel matrix associated with the prior information. GLS problems are assumed to be over-determined, with . For observed data , known prior information and a specified model , the prediction error is and prior information error is . These errors are assumed to be Normally-distributed with zero mean and prior covariance and , respectively. Then, the normalized errors and are independent and identically-distributed Normal random variables with zero mean and unit variance. Bayes theorem can be used to show that the best estimate of the solution is the one that minimizes the generalized error , with and [1] [2] [5]. The solution can be expressed in a variety of equivalent forms, among which is the widely-used version [6]:
(1)
The assumption of linear kernels and is a very restrictive one. In the well-studied nonlinear generalization [1] [6], the products and are replaced with vector functions and . Then, a common solution method is to linearize the data and prior information equations around a trial solution :
(2)
and . The solution is then found by iterative application of (1) applied to (2); that is, by the Gauss-Newton’s method [3]. Alternatively, a gradient-descent method [18] can be used that employs:
(3)
The latter approach is preferred for very large M, since the convergence rate of gradient descent is independent of its dimension [18], whereas the effort required to solve the M× M system (1) by a direct method scales as M3 [19].
We now discuss issues related to the covariance matrices that appear in GLS. The data covariance quantifies the uncertainty of the observations and the information covariance quantifies the uncertainty of the prior information. Prior knowledge of the inherent accuracy of the measurement technique is needed to assign , and prior knowledge of the physically-plausible solutions, perhaps stemming from and understanding of the underlying physics, is needed to assign . These assignments are often very subjective, especially when correlations are believed to occur (that is, and have non-zero off-diagonal elements). For example, one geotomographic study [7] reconstructs a two-dimensional field using a that represents autocovariance of the field and that is dependent upon a scale length q. The value of q is chosen on the basis of broad physical arguments that, while plausible, leaves considerable room for subjectivity.
The matrices and together contain elements,
many more than the constraints imposed by the data and prior information .Consequently, insufficient information is available to uniquely solve for all the elements of and . However, it sometimes may be possible to parameterize and/or in terms of , and ask whether an initial estimate of can be improved. As long as , adequate information may be available to determine a best estimate . We refer to the process of determining as “tuning”, since in typical practice it requires that the covariances be close to their true values.
As an example of a parametrized covariance, we consider the case where the model parameters represent a sampled version of a continuous function , where is an independent variable; that is, , with and the sampling interval. The prior information that is approximately oscillatory with wavenumber q can be modeled by:
(4)
In this case, approximates the autocovariance of , which is assumed to be stationary. The goal of tuning is to provides a best-estimate , as well of best estimated of the model parameters. This problem is further developed in Example 4, below.
Although the GLS formulation is widely used in geotomography and geophysical imaging, the tuning of variance is typically implemented in a very limited fashion, through the use of trade-off curves [7] - [12]. In this procedure, a scalar parameter q controls the relative size of and , that is, , where is specified [20]. The GLS problem is then solved for a suite of qs, the functions and are tabulated and the resulting trade-off curve is used to identify a solution that has acceptably low E and L (for example, Figure 1 of [20] ). As we will show below, this ad hoc procedure is not a consistent extension of GLS, because it results in a different q than the one implied by Bayes’ principle. A more consistent approach is to apply Bayes theorem directly to estimate both the model parameters and the covariance parameters . Such an approach has been implemented in the context of ordinary least squares [21] and the Markov chain Monte Carlo (MCMC) inversion method [22] (which is a computationally-intensive alternative to GLS). An important and novel result of this paper is a computationally-efficient procedure for tuning GLS in a Bayes-consistent manner.
2. Bayesian Extenion of GLS
The general process of using Bayes’ theorem to construct a posterior probability density function (p.d.f.) that depends on unknown parameters and of estimating those parameters though the maximization of probability is very well understood [23]. In the current case, the p.d.f. has M model parameters and J covariance parameters, so the maximization process (implemented, say, with a gradient ascent method) must search an -dimensional space. Our main purpose here is to show that the process can be organized in a way that makes use of the GLS solution (1) and thus reduce the dimensionality of the searched space to J.
The GLS solution (1) yields the that minimizes the generalized error , or equivalently, the that maximizes the Normal posterior probability density function (p.d.f.) :
(5)
Here, Bayes theorem [23] is used to related the Normal posterior p.d.f. to the Normal likelihood and the Normal prior . When poorly known parameters are added to the problem, they must be treated as additional random variables [22]. Writing , with appearing in the likelihood and appear in the prior, we have:
(6)
Here, we have assumed that and are not correlated with one another. The maximization with respect to the two variables can be performed as a sequence of two single-variable maximizations:
(7a)
(7b)
(7c)
In the special case of the uniform prior , the maximization in (7a) is the GPR solution at fixed . For the Normal p.d.f.:
(8)
the maximization (7b) is equivalent to the minimization of an objective function , defined as:
(9)
The quantity is best computed by finding the Choleski decomposition , the algorithm [24] for which is implemented in many software environments, including MATLAB® and PYTHON/linalg. Then, (and similarly for ).The nonlinear optimization problem of minimizing can be implemented using a gradient descent method, provided that the derivative can be calculated [18]. In the next section, we derive analytic formula for this and related derivatives.
3. Solution Method and Formula for Derivatives
The process of simultaneously estimating the covariance parameters and model parameters consists of six steps. First, the analytic form of the covariance matrices and are specified, and their derivatives and are computed analytically. Second, an initial estimate is identified. Third, the covariance matrices and are inserted into (1), yielding model parameters . Fourth, using formulas developed below, the value of the derivative is calculated at . Fifth, a gradient descent method employing is used to iteratively perturb towards the minimum of at (and in process, repeating steps three through five many times). Sixth, the estimated model parameters are computed as . This process is depicted in Figure 1.
Our derivation of uses three matrix derivatives, , and that may be unfamiliar to some readers, so we derive them here for completeness. Let be asquare, invertible, differentiable matrix. Differentiating yields , which can be rearranged into ( [25], their (36)):
(10)
Figure 1. Schematic depiction of solution process. (a) The GLS solution (red curve) is considered a function of the covariance parameters and its derivative (blue line) at a point is computed by analytic differentiation of GLS equation (1); (b) The objective function Ψ (colors) is considered a function of . The results of (a) are used to compute its gradient at the point . The gradient descent method is used to iteratively perturb this point anti-parallel to the gradient until it reaches the minimum of the objective function, resulting in the best-estimate . This value is then used to determine a best-estimate of the model parameters , as depicted in (a).
Similarly, differentiating and applying (10), yields the Sylvester equation:
(11)
We have not been able to determine a source for this equation, but in all likelihood, it has been derived previously. In practice, (11) is not significantly harder to compute than (10), because efficient algorithms for solving Sylvester equations [26] and for computing a symmetric (principal) square root [27], are widely available and implemented in many software environments, including MATLAB® and PYTHON/linalg. The derivative of is derived starting with Jacobi’s formula [12]:
(12)
where is the adjugate and is the trace, applying Laplace’s identify [28] and the rule (where c is a scalar and is a matrix) [29]. Finally, the determinant is moved to the left-hand side and the well-known relationship , for a differentiable function , is applied, yielding ( [25], their (38)):
(13)
We begin the main derivation by considering the case in which data variance depends on a parameter vector , and the information variance is constant. The derivative of the GLS solution can be found by applying the chain rule applied to (1):
(14)
Note that we have used (10). The derivative of the normalized prediction error is and total error are:
(15)
Here, the Sylvester equation arises from (11). An alternate way of differentiating E that does not require solving a Sylvester equation is:
(16)
The derivative of the normalized error in prior information and total error are:
(17)
Finally, since , we have:
(18)
Note that we have applied (13).
Finally, we consider the case in which the information variance depends on parameters , and is constant. Since the data and prior information play completely symmetric roles in (1), the derivatives can be obtained by interchanging the roles of and , and , and , and and E and L, in the equations above, yielding:
(19)
These formulas have been checked numerically.
4. Examples with Discussion
In the first example, we examine the simplistic case in which the parameter q represents an overall scaling of variance; that is and , with specified and . The solution is independent of q, as can be verified by substitution into (1). The parameter q can then be found by direct minimization of (9), which simplifies to:
(20)
Here, we have used the rule [25], valid for any matrix , and have defined and . The minimum occurs when:
(21)
This is a generalization of the well-known maximum likelihood estimate of the sample variance [30]. As long as exists, the minimization in (21) is well-behaved and the overall scaling q is uniquely determined.
In the second example, we examine another simplistic case in which a parameter q represents the relative weighting of variance; that is and .We consider the problem of estimating the mean of data given observations and prior information (where and are vectors of zeros and ones, respectively), when , and . Applying (1), we find that . Then, the objective function is and its derivative is . The solution to is , as can be verified by direct substitution. Thus, the solution splits the difference between the observations and the prior values, and yields prior variances and that are equal. While simplistic, this problem illustrates that, at least in some cases, GLS is capable of uniquely determining the relative sizes of and . Because trade-off curves, as defined in the Introduction, are based on the behavior of E and L, and not the complete objective function Ψ, the weighting parameter estimated from them in general will be different from .Consequently, the trade-off curve procedure is not consistent with the Bayesian framework upon which GLS rests.
Our third example demonstrates the tuning of data covariance . In many cases, observational error increases during the course of an experiment, due to degradation of equipment or to worsening environmental conditions. The example demonstrates that the method is capable of accurately quantifying the fractional rate of increase p of the variance , which is assumed to vary with position . In our simulation, we consider synthetic data, evenly-spaced on the interval , which scatter around the curve (Figure 2). The covariance of the data is modeled as , where and is the Kronecker delta; that is, the data are uncorrelated and their variance increases linearly with x. The derivative of the covariance is . We have included prior information with and , which implements the notion that the model parameters are small. The corresponding covariance is chosen to be large, , indicating that this information is weak. The goal is to tune the rate of increase of variance and to arrive at a best-estimate of the two model parameters. The starting value is taken to be , which corresponds to uniform variance. It is successively improved by a gradient descent method that minimizes Ψ, yielding an estimated value .This estimate differs from the true value by about 1%. The estimated solution differs from by a few tenths of a percent, which may be significant in some applications.
Figure 2. Example of tuning . (a) Plot of synthetic data (red dots) and predicted data (green curve); (b) The starting value corresponds to uniform variance (black curve). The estimate corresponds to increasing variance (green curve); (c) Generalized error (black curve). The starting value (black circle) is successively improved (red circles) by a gradient descent method, yielding an estimate (green circle); (d) The gradient , computed using the formulas developed in the text; (e) The first model parameter , highlighting the initial value (black circle) and estimated value (green circle) (f) Same as (e), except for the second model parameter .
The fourth example demonstrates tuning of information covariance . In many instances, one may need to “reconstruct” or “interpolate” a function on the basis of unevenly and sparsely sampled data. In this case, prior information on the autocovariance of the function can enable a smooth interpolation. Furthermore, it can enforce a covariance structure that may be required, say, by the underlying physics of the problem. In our example, we suppose that the function is known to be oscillatory on physical grounds, but that the wavenumber of those oscillations is known only imprecisely. The goal is to tune prior knowledge of wavenumber to arrive at a best-estimate of the reconstructed function. In our simulation, a total of model parameters are uniformly spaced on the interval and representing a sampled version of a continuous, sinusoidal function with wavenumber (Figure 3). Synthetic data with uncorrelated error with variance are available for randomly-chosen points , where the index function aligns in x observations to model parameters. The data kernel is . The prior information is given in (4), with autocovariance and
. The derivative is . An
initial guess is improved using a gradient descent method, yielding an estimated value of that differs from by less than 0.01%. The reconstructed function is smooth and sinusoidal and the fit to the data is much improved.
Examples three and four were implemented in MATLAB® and executed in <5s on a notebook computer. They confirm the flexibility, speed and effectiveness of the method. An ability to tune prior information on autocovariance may be of special utility in seismic exploration applications, where three-dimensional waveform datasets are routinely interpolated.
A limitation of this overall “parametric” approach is that the solution is dependent on the choice of parameterization, which must be guided by prior knowledge of the general properties of the covariance matrices in particular problem being solved. In Example 3, we were able to recognize (say, by visually
Figure 3. Example of tuning . Sparsely-sampled synthetic data (red dots) are oscillatory. (a) A regularly-sampled version is created by imposing the oscillatory covariance . With the starting value , the reconstruction poorly fits the data (black curve). Tuning leads to a better fit (green curve with dots), as well as a precise estimate of wavenumber ; (b) Decrease in with iteration number during the gradient descent process.
examining the data plotted in Figure 2(a)) that observational error increases with x and chose that matched this scenario. If, instead, the degree of correlation between successive data increased with x, this pattern might be less expected, more difficult to detect, and require a different
parameterization—say, .
Not every parameterization of (or ) is necessarily well-behaved. To avoid poor behavior, the parameterization must be chosen so its determinant does not have zeros at values of that will prevent the steepest descent process from converging to the global minimum. That this choice can be problematical is illustrated by the simple Toeplitz version of (with , ):
(22)
with . This form is useful for quantifying correlations within a stationary sequence of data [31]. Yet as is illustrated in Figure 4, the volume is crossed
Figure 4. The function for the case given by (22). (a) The surface for and the other qs randomly assigned; (b) Same as (a), but with ; (c) Same as (a), but with ; (d) Perspective view of the surfaces in the volume. The positions of the three slices in (a), (b) and (c) are noted on the -axis (green arrows). A question posed in the text is whether, given an arbitrary point and the global minimum of the objective function, say at (and with both points satisfying ), a steepest-descent path necessarily exists between them.
by many surfaces that correspond to surfaces of singular objective function Ψ. Their presence suggests that the steepest descent path between a starting value and the global minimum at may be very convoluted (if, indeed, such a path exists) unless is very close to .
5. Conclusion
Generalized Least Squares requires the assignment of two prior covariance matrices, the prior covariance of the data and the prior covariance of the prior information. Making these assignments is often a very subjective process. However, in cases in which the forms of these matrices can be anticipated up to a set of poorly-known parameters, information contained within the data and prior information can be used to improve knowledge of them—a process we call “tuning”. Tuning can be achieved by minimizing an objective function that depends on both the generalized error and determinants of the covariance matrices to arrive at a best estimate of the parameters. Analytic and computationally-tractable formulas are derived for the derivative needed to implement the minimization via a gradient descent method. Furthermore, the problem is organized so that the minimization need be performed only over the space of covariance parameters, and not over the typically-much-larger space of model and covariance parameters. Although some care needs to be exercised as the covariance matrices are parametrized, the minimization is tractable and can lead to better estimates of the model parameters. An important outcome is this study is the recognition that the use of trade-off curves to determine relative weighting of covariance—a practice ubiquitous in the geophysical imaging—is not consistent with the underlying Bayesian framework of Generalized Least Squares. The strategy outlined here provides a consistent solution.
Acknowledgements
The author thanks Roger Creel for helpful discussion.
Conflicts of Interest
The author declares no conflicts of interest regarding the publication of this paper.
Cite this paper
Menke, W. (2021) Tuning of Prior Covariance in Generalized Least Squares. Applied Mathematics, 12, 157-170. https://doi.org/10.4236/am.2021.123011
References
- 1. Tarantola, A. and Valette, B. (1982) Generalized Non-Linear Inverse Problems Solved Using the Least Squares Criterion. Reviews of Geophysics and Space Physics, 20, 219-232. https://doi.org/10.1029/RG020i002p00219
- 2. Tarantola, A. and Valette, B. (1982) Inverse Problems = Quest for Information. Journal of Geophysics, 50, 159-170. https://n2t.net/ark:/88439/y048722
- 3. Menke, W. (2018) Geophysical Data Analysis: Discrete Inverse Theory. 4th Edition, Elsevier, 350 p.
- 4. Menke, W. and Menke, J. (2016) Environmental Data Analysis with MATLAB. 2nd Edition, Elsevier, 3342 p. https://doi.org/10.1016/B978-0-12-804488-9.00001-X
- 5. Tarantola, A. (2005) Inverse Problem Theory and Methods for Model Parameter Estimation. SIAM: Society for Industrial and Applied Mathematics, 342 p. https://doi.org/10.1137/1.9780898717921
- 6. Menke, W. (2014) Review of the Generalized Least Squares Method. Surveys in Geophysics, 36, 1-25. https://doi.org/10.1007/s10712-014-9303-1
- 7. Abers, G. (1994) Three-Dimensional Inversion of Regional P and S Arrival Times in the East 723 Aleutians and Sources of Subduction Zone Gravity Highs. Journal of Geophysical Research, 99, 4395-4412. https://doi.org/10.1029/93JB03107
- 8. Schmandt, B. and Lin, F.-C. (2014) P and S Wave Tomography of the Mantle beneath the United States. Geophysical Research Letters, 41, 6342-6349. https://doi.org/10.1002/2014GL061231
- 9. Menke, W. (2005) Case Studies of Seismic Tomography and Earthquake Location in a Regional Context. Geophysical Monograph 157. American Geophysical Union, Washington DC. https://doi.org/10.1029/157GM02
- 10. Nettles, M., and Dziewonski, A.M. (2008) Radially Anisotropic Shear Velocity Structure of the Upper Mantle Globally and Beneath North America. Journal of Geophysical Research, 113, B02303. https://doi.org/10.1029/2006JB004819
- 11. Chen, W. and Ritzwoller, M.H. (2016) Crustal and Uppermost Mantle Structure Beneath the United States. Journal of Geophysical Research, 121, 4306-4342. https://doi.org/10.1002/2016JB012887
- 12. Humphreys, E.D., Dueker, K.G., Schutt, D.L. and Smith, R.B. (2000) Beneath Yellowstone: Evaluating Plume and Nonplume Models Using Teleseismic Images of the Upper Mantle. GSA Today, 10, 1-7. https://www.geosociety.org/gsatoday/archive/10/12/
- 13. Gillet, N., Schaeffer, N. and Jault, D. (2011) Rationale and Geophysical Evidence for Quasi-Geostrophic Rapid Dynamics within the Earth’s Outer Core. Physics of the Earth and Planetary Interiors, 187, 380-390. https://doi.org/10.1016/j.pepi.2011.01.005
- 14. Zhao, S. (2013) Lithosphere Thickness and Mantle Viscosity Estimated from Joint Inversion of GPS and GRACE-Derived Radial Deformation and Gravity Rates in North America. Geophysical Journal International, 194, 1455-1472. https://doi.org/10.1093/gji/ggt212
- 15. Menke, W. and Eilon, Z. (2015) Relationship between Data Smoothing and the Regularization of Inverse Problems. Pure and Applied Geophysics, 172, 2711-2726. https://doi.org/10.1007/s00024-015-1059-0
- 16. Voorhies, C.F. (1986) Steady Flows at the Top of Earth’s Core Derived from Geomagnetic Field Models. Journal of Geophysical Research, 91, 12444-12466. https://doi.org/10.1029/JB091iB12p12444
- 17. Yao, Z.S. and Roberts, R.G. (1999) A Practical Regularization for Seismic Tomography. Geophysical Journal International, 138, 293-299. https://doi.org/10.1046/j.1365-246X.1999.00849.x
- 18. Snyman, J.A. and Wilke, D.N. (2018) Practical Mathematical Optimization—Basic Optimization Theory and Gradient-Based Algorithms. Springer Optimization and Its Applications, 2nd Edition, Springer, New York, 340 p.
- 19. Hidebrand, F.B. (1987) Introduction to Numerical Analysis. 2nd Edition, Dover Publications, New York.
- 20. Zaroli, C., Sambridge, M., Lévêque, J.-J., Debayle, E. and Nolet, G. (2013) An Objective Rationale for the Choice of Regularization Parameter with Application to Global Multiple-Frequency S-Wave Tomography. Solid Earth, 4, 357-371. https://doi.org/10.5194/se-4-357-2013
- 21. Malinverno, A. and Parker, R.L. (2006) Two Ways to Quantify Uncertainty in Geophysical Inverse Problems. Geophysics, 71, W15-W27. https://doi.org/10.1190/1.2194516
- 22. Malinverno, A. and Briggs, V.A. (2004) Expanded Uncertainty Quantification in Inverse Problems: Hierarchical Bayes and Empirical Bayes. Geophysics, 69, 877-1103. https://doi.org/10.1190/1.1778243
- 23. Box, G.E.P. and Tiao, G.C. (1992) Bayesian Inference in Statistical Analysis. Wiley, New York, 589 p. https://doi.org/10.1002/9781118033197
- 24. Schmidt, E. (1973) Cholesky Factorization and Matrix Inversion, National Oceanic and Atmospheric Administration Technical Report NOS-56. US Government Printing Office, Washington DC. https://books.google.com/books?id=MiRHAQAAIAAJ
- 25. Petersen, K.B. and Pedersen, M.S. (2008) The Matrix Cookbook, 71 p. https://archive.org/details/imm3274
- 26. Bartels, R.H. and Stewart, G.W. (1972) Solution of the matrix equation AX + XB = C. Communications of the ACM, 15, 820-826. https://doi.org/10.1145/361573.361582
- 27. Higham, N.J. (1987) Computing Real Square Roots of a Real Matrix. Linear Algebra and its Applications, 88-89, 405-430. https://doi.org/10.1016/0024-3795(87)90118-2
- 28. Magnus, J.R. and Neudecker, H. (1999) Matrix Differential Calculus with Applications in Statistics and Econometrics, Revised Edition. John Wiley and Sons, New York, 424 p.
- 29. Gantmacher, F.R. (1960) The Theory of Matrices, Volume 1. Chelsea Publishing, New York, 374 p.
- 30. Fisher, R.A. (1925) Theory of Statistical Estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22, 700-725. https://doi.org/10.1017/S0305004100009580
- 31. Claerbout, J.F. (1985) Fundamentals of Geophysical Data Processing with Applications to Petroleum Prospecting. Blackwell Scientific Publishing, Oxford, UK, 267 p.