Some Properties of a Recursive Procedure for High Dimensional Parameter Estimation in Linear Model with Regularization

Theoretical results related to properties of a regularized recursive algorithm for estimation of a high dimensional vector of parameters are presented and proved. The recursive character of the procedure is proposed to overcome the difficulties with high dimension of the observation vector in computation of a statistical regularized estimator. As to deal with high dimension of the vector of unknown parameters, the regularization is introduced by specifying a priori non-negative covariance structure for the vector of estimated parameters. Numerical example with Monte-Carlo simulation for a low-dimensional system as well as the state/parameter estimation in a very high dimensional oceanic model is presented to demonstrate the efficiency of the proposed approach.


Introduction
In [1] a statistical regularized estimator is proposed for an optimal linear estimator of unknown vector in a linear model with arbitrary non-negative covariance structure where z is the p-vector observation, H is the ( ) p n × observation matrix, x is the n-vector of unknown parameters to be estimated, v is the p-vector representing the observation error.
It is assumed where ( ) E ⋅ is the mathematical expectation operator.Throughout this paper let p , n be any positive inte- gers, the covariance matrix of the joint vector ( ) may be singular (and hence the model ( 1)-( 3) is called a linear model with arbitrary non-negative covariance structure).
No particular assumption is made regarding the probability density function of v  and p , n are any posi- tive integers.
In [1] the optimal linear estimator for the unknown vector x in the model ( 1)-( 3) is defined as ( ) where T  V is the transpose of V , H + denotes the pseudo-inversion of H .
As in practice all the matrices , , , H M N R and the observation vector z are given only approximately, instead of data set ( ) , , , , D H M N R z = we are given their δ -approximations ( ) hence the resulting estimate ( ) . As shown in [1], there are situations when the error ˆ: e x x δ = − between the "true" estimate x and its per- turbed xδ may be very large even for small data error δ .The regularization procedure has been proposed in [1] to overcome this difficulty.
In this paper we are interested in obtaining a simple recursive algorithm for computation of x subject to the situation when n or p or/and n , p may be very high.
This problem is very important for many practical applications.As example, consider data assimilation problems in meteorology and oceanography [2].For typical data set in oceanography, at each assimilation instant we have the observation vector with .Under such conditions it is unimaginable to compute the estimate (4) using standard numerical methods.We will show that there exists a simple algorithm to overcome these difficulties by exploiting a recursive character of the algorithm with an appropriate regularization in the form of the priori covariance matrix M for the vector of unknown parameters x .

Problem Statement: Free-Noise Observations
First consider the model (1) and assume that 0 v = .We have then the system of linear equations , , and p z R ∈ for the noise-free observations z .Suppose that the system (6) is compatible, i.e. there exists 0 x such that 0 The problem is to obtain a simple recursive procedure to compute a solution of the system (6) when the dimension of z is very high.

Iterative Procedure
To find a solution to Equation ( 6), let us introduce the following system of recursive equations [ ] In order to prove Theorem 1 we need Lemma 1.The following equalities hold 0, 1, , Proof.By induction.We have for 1 j = , Let the statement be true for some 1 l p < < .We show now that it is true also for 1 l + .As the statement is true for l , it implies 0 Proof.By induction.We have for 1 j = , ( ) Let the statement be true for some 1 l p < < .We show now that it is true also for 1 l + .As the statement is true for l , it implies i l i h x z = , 1, , i l =  .We have to prove However from Lemma 1, as Proof of Theorem Proof.By induction.The fact that for 1 l = , 1 1 0 h M = implies at least the null subspace of 1 M has one nonzero element hence ( ) We show now that it is impossible that ( ) Suppose that ( )

R M ∆
has the dimension 1 hence any element from ( ) 1 R M ∆ can be represented on the basis of some vector 1 b .Thus any element from the subspace ( ) On the other hand, as any element of ( ) 0 R M must be represented as a linear combination of n linearly independent vectors.It contradicts the fact that any element from ( ) 1 R M ′ could be written as a linear combination of 1 n − linearly independent elements.We conclude that it is impossible ( ) The same argument is true for 3, 4, , Suppose now Corollary is true for 1 l ≥ and we have to prove that it holds for : N M + has the dimension 1 l + (as all the rows of H are linearly independent).It fol- lows that the dimension of the subspace [ ] 1 l R M + is at least less or equal to 1 n l − − .By the same way as proved for 1 l = one can show that it is impossible ( ) Comment 1.By verifying the rank of l M , Corollary 1 allows us to check if the computer code is correct.In particular if H is non-singular, at the end of the iterative procedure the matrix n M should be zero.The re- cursive Equations (7a)-(7c) then yield the unique solution of the equation z Hx = after n iterations.Using the result (8) in Lemma 1, it is easy to see that: Then un- der the conditions of Theorem 1, ( ) and hence by Corollary 1, ( ) M is similar to that given in (3.14.1) in [3] since from Corollary 2 it follows automatically Their difference is that in (7c) the initial 0 M may be any SPD matrix whereas in (3.14.1) of [3] it is assumed 0 M I = .
In the next section we shall show that the fact 0 M may be any SPD matrix is important to obtain the optimal in mean squared estimator for x .

Regularized Estimate
Theorem 1 says only that Equations (7a)-(7c) give a solution to (1).We are going now to study the question on whether a solution of Equations (7a)-(7c) is optimal, and if yes, in what sense?Return to Equations ( 1)-( 3) and assume that 0 V = , 0 N = .The optimal estimator (4) is then given in the form ( ) where H + is the pseudo-inversion of H . Consider first the equation 1 1 z h x = .In this case, (10) yields the optimal estimator (obtained on the basis of the first observation) ( ) and one can prove also that the mean square error for 1 x is equal to If we apply Equation (3.14.1) in [3], M I = then instead of 1 x we have ( ) ( )

ˆ, x h h h z h z M I h h h h
For simplicity, let 0 x = .Comparing Equation (12) with Equation (11) shows that if 1 x′ is the orthogonal projection of x onto the subspace spanned by T 1 h , the estimate 1 x belongs to the subspace spanned by M .
Thus the algorithm (11) takes into account the fact that we known a priori x belongs to the space [ ] R M .This fact is very important when the number of observations p is much less than the number of the estimated pa- rameters n as it happens in oceanic data assimilation: today usually We prove now a more strong result saying that all the estimates i x , 1, 2, i =  are projected onto the space [ ] R M .Theorem 2. Consider the algorithm (7).Suppose x R M   ∈   .Then all the estimates i x , 1, 2, i =  belong to the space spanned by the columns of M , i.e. i x R M   ∈   .
Proof.For 1 i = the statement is evident as shown above.Suppose the statement is true for some i .We will show that it is true also for : Again from the equation for 1 i M + in Equation ( 7) we have Theorem 2 says that by specifying 0 M R M   ∈   the algorithm (7) will produce the estimates belonging to the subspace R M     .Specification of the priori matrix M plays the most important task if we want the algo- rithm (7) to produce the estimate with high quality.
Comment 3 1) If x does not belong to R M     , by considering : e x x = − in place of x Theorem 3 remains valid for the algorithm (7) written for e .In this case at the first iteration, as : e x x = − represents the priori error for x , it is natural for the algorithm to seek the correction (i.e., the estimate for the error : In what follows, unless otherwise stated, for simplicity we assume 0 x = . 2) Theorem 2 says that there is a possibility to regularize the estimate when the number of observations is less than the number of estimated parameters by choosing 0 M M = .Thus the algorithm can be considered as that which finds the solution Hx z = under the constraint x R M   ∈   .In the procedure in Albert (1972) putting 0 M I = means that there is no constraint on x hence the best way to do is to project x orthogonally onto subspace of T R H     .

Minimal Variance Estimate
Suppose x is a random variable having the mean x and covariance matrix M .We have then the following result.Theorem 3. Suppose x is a random variable having the mean x and the covariance matrix M .Then i x generated by the recursive Equation ( 7) is an unbiased and minimum variance estimate for x in the class of all unbiased estimates linearly dependent on x and 1 , , i z z  .Proof.Introduce for the system (6), ( ) ( ) and the class of all estimates i x′ linearly dependent on x and 1 , , i z z  .
( ) The condition for unbiasedness of ( ) It means that all the estimate in Equation ( 14) is unbiased.
Consider the minimization problem T trace argmin , : Taking the derivative of ( ) J B with respect to B implies the following equation for finding B , ,T ,T 1 1 1 If now instead of Equation ( 13) we consider the system ( ) and repeat the same proof, one can show that the unbiased minimum variance estimate for x is given by Using the properties of the pseudo-inverse (Theorem 3.8 [3]), one can prove that ( ) Thus for a very small 0 α > the estimate ( ) x α is unbiased minimum variance which can be made as close as possible to ( ) On the other hand, applying Lemma 1 in [5] for the case of uncorrelated sequence { } i v , one can show that ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Letting 0 α → one comes to ( )

Noisy Observations
The algorithm (18a)-(18d) thus yields an unbiased minimal variance (UMV) estimates for x in the situation when α represents the observation noise variance.We want to stress that these algorithms produce the UMV estimates only if M M = where M is the true covariance of the error : e x x = − before arriving i z , 1, 2, i =  .

Very High Dimension of x : Simplified Algorithms
In the field of data assimilation in meteorology and oceanography usually the state vector x is of very high dimension, the orders of .This happens because x is a collection of several variables defined in the three dimensional grid.If the algorithm (7) allows to overcome the difficulties with the high dimension of the observation vector z (each iteration involves one component of z ), due to high dimension of x , it is impossible to handle Equation (7c) to evaluate the matrices i M (with the number of elements 12 14 10 10 − ).This section is devoted to the question on how one can overcome such difficulties.
Let us consider the eigen-decomposition for M [6]   T M UDU = In (19) the columns of U are the eigenvectors of M and D is diagonal with the elements . If we put in the algorithms (7) or (18) T 1 1 1

M U D U
= , then the algorithm (7), for example, will yield the best estimate for x projected in the subspace ( )

Main Theoretical Results
Theorem 4 Consider two algorithms of the type (7) subject to two matrices where the columns of ( ) ( ) x m is the estimate produced by the algorithm (7) subject to ( ) , where the strict inequality takes place if 2 m λ > .

Proof
Write the representation of x in the terms of decomposition of M on the basis of its eigenvectors (for simplicity, let 0 where y is of zero mean and has the covariance matrix I .Let l y is a sample of y .Theorem 3 states that for all : , the algorithm (7) will yield the estimate with the minimal variance.
In what follows we introduce the notation: M -the initialized ECM in the algorithm (7a)-(7c).The samples l x of x are coming from a variable having zero mean and covariance M .
( ) x m -sample coming from a variable having zero mean and covariance

( )
M m .There are the following different cases 1) M M = : By Theorem 3 the algorithm (7) will produce the estimates of minimal variance for all l x ; This is true also if Equation ( 7) is applied to The estimates will be of minimal variance.
b) For samples belonging to The estimates will not be of minimal variance.
Thus in the mean sense , ,

E e n E e m m n e m x m x
3) Consider two initializations ( ) In the same way we have a) ( )  : the estimates are of minimal variance; ii) ( ) The algorithm (7) will produce the estimates i) of minimal variance for ( ) ii) of minimal variance for ( ) ( ) iii) not of minimal variance for ( ) ( )

Simplified Algorithm
Theorem 5 Consider the algorithm (7) subject to ( ) e e e e e e

K i M i h i h i M i h i
It is seen that in the algorithm (24a)-(24d), the estimate ( ) m e x i R ∈ belongs to the linear space of dimension m .In data assimilation, it happens that the dimension of x may be very high but there is only some leading directions (leading eigenvectors of M representing the directions of most rapid growth of estimation error) to be captured.Thus the algorithm (24a)-( 24d) is quite adapted for solving such problems: first the initial covariance M is constructed from physical considerations or numerical model, and next to decompose it (numeri- cally) to obtain an approximated decomposition 24d) is very closed to that studied in [7] for ensuring a stability of the filter.

Numerical Example
Consider the system (1) subject to the covariance M , 0.85 0.525 0 0.525 0.5625 0 0 0 0.64 Here we assume that the 1st and 3rd components of x is observed, i.e.
Numerical computation of eigendecomposition of The algorithm (24a)-(24d) is applied subject to three covariance matrices . They are denoted as ALG(m).In Figure 1 we show the numerical results obtained from the Monte-Carlo simulation.There are 100 samples simulating the true x which are generated by a random generator distributed according to the normal distribution  The curve Id denotes the 4th algorithm ALG (4) which is run subject to M I = -identity matrix.This is equivalent to the orthogonal projection (using the pseudo-inversion of H ) of z into the subspace T R H     .As seen from Figure 1, the estimation error is highest in ALG (1).There is practically no difference between ALG (2) and ALG (3) which are capable of decreasing considerably the estimation error (50%) compared to ALG (1).As to the ALG (4), its performance is situated between ALG (1) and ALG (2).This experiment confirms the theoretical results and demonstrates that if we are given a good priori information on the estimated parameters, there exists a simple way to improve the quality of the estimate by appropriately introducing the priori information in the form of the regularization matrix M .
The results produced by ALG (1) and ALG (4) show also that when the priori information is insufficiently rich, the algorithm naturally produces the estimates of poor quality.In such situation, simple applying orthogonal projection can yield a better result.For the present example, the reason is that using the 2nd mode 2 u al- lows to capture the important information contained in the second observation 2 v .Ignoring it (as does ALG (1)) is equivalent to ignoring the second observation 2 z .As to the third mode 3 u , it has a weak impact on the es- timation since the corresponding eigenvalue 3 λ is small.That explains why ALG (2) and ALG (3) have pro- duced almost the same results.

MICOM Model
In this section we will show importance of the regularization factor in the form of the priori covariance M in the design of a filter for systems of very high dimension.
The numerical model used in this experiment is the MICOM (Miami Isopycnic Coordinate Ocean Model) which is exactly as that presented in [8].We recall only that the model configuration is a domain situated in the North Atlantic from 30 N  to 60 N  and 80 W  to 44 W  .The grid spacing is about 0.2  in longitude and in latitude, requiring the horizontal mesh 1, ,140 i =  The assimilation experiment consists of using the SSH to correct the model solution, which is initialized by some arbitrarily chosen state resulting from the control run.

Different Filters
The different filters are implemented to estimate the oceanic circulation.It is well known that determining the filter gain is one of the most important tasks in the design of a filter.As for the considered problem it is impossible to apply the standard Kalman filter [9] since in the present experiment, ( ) = and the number of elements in the ECM is of order ( )

2
O n .At each assimilation instant, the estimate for the system state in all fil- ters is computed in the form (10) with the corresponding ECM M .As the number of observations 252 p = is largely inferior to the dimension of the system state, the choice of M as a regularization factor has a great im- pact on the quality of the produced estimates.In this assimilation experiment the following filters will be employed.First the Prediction Error Filter (PEF) whose ECM is obtained on the basis of leading real Schur vectors [8].Parallelly two other filters, one is the Cooper-Haines filter (CHF) [10] and another is an EnOI (Ensemble based Optimal Interpolation) filter [11] will be used.Mention that the ECM in the CHF is obtained on the basis of the principle of a vertical rearrangement of water parcels (see also [8]).The method conserves the water masses and maintains geostrophy.The main difference between PEF and EnOI is lying in the way to generate the ensembles of Prediction Error (PE) samples.In the PEF, the ensemble of PE samples is generated using the sampling procedure described in [8] (and it will be denoted as En(PEF)).As for the EnOI, the ensemble of background errors samples (the term used in [8]) and will be denoted by En(EnOI)) will be used.The elements of En(EnOI) are constructed according to the method in [11].It consists of using 2-year mean of true states as the background field and the error samples are calculated as differences between individual 10-day true states during this period and the background.
According to Corollary 4.1 in [4], using the hypothesis on separable vertical-horizontal structure for the ECM, we represent M are the ECM of vertical and horizontal variables respectively.In the case of sea-surface height observations, from the representation where ⊗ denotes the Kronecker product.The gain v K allows the correction available at the surface to propa- gate into all vertical subsurface layers.As to h K , it represents an operator of (horizontal) optimal interpolation which interpolates the observations over all horizontal grid points at the surface.Mention that the elements of v M and the correlation length parameter in h M are estimated by minimizing the mean distance between the data matrix d M and M using a simultaneous perturbation stochastic approximation algorithm [12].The data matrix d M is obtained from samples of the leading real Schur vectors as described in [8].
We remark that all the gain coefficients in two filters are of identical sign but the elements of enoi v K are of much less magnitudes than that of pef v K .It means that the EnOI will make less correction (compared to the PEF) to the forecast estimate.Two gains in (29a), (29b) will be used in the two filters PEF and EnOI to assimilate the observations.
The vertical gain coefficients for the CHF are taken from [8] and are equal to

Numerical Results
In Figure 3 we show the instantaneous variances of the SSH innovation produced by three filters EnOI, CHF and PEF.It is seen that initialized by the same initial state, if the innovation variances in EnOI, CHF have a tendency to increase, this error remains stable for the PEF during all assimilation period.At the end of assimilation, the PE in the CHF is more than two times greater than that produced by the PEF.The EnOI has produced poor estimates, with error about two times greater than the CHF has done.For the velocity estimates, the same tendency is observed as seen from Figure 4 for the surface velocity PE   errors.These results prove that the choice of ECM as a regularization factor on the basis of members of the En(PEF) allows to much better approach the true system state compared to that based on the samples taken from En(EnOI) or to that constructed on the basis of the physical consideration as in the CHF.The reason is that the members of En(PEF) by construction [8] are samples of the directions along which the prediction error increases most rapidly.In other words, the correction in the PEF is designed to capture the principal important components in the decomposion of the covariance of the prediction error.

Conclusion
We have presented some properties of an efficient recursive procedure for computation of a statistical regularized estimator for the optimal linear estimator in a linear model with arbitrary non-negative covariance structure.
The main objective of this paper is to obtain an algorithm which allows overcoming the difficulties concerned with high dimensions of the observation vector as well as that of the estimated vector of parameters.As it was seen, the recursive nature of the proposed algorithm allows dealing with high dimension of the observation vector.By initialization of the associated matrix equation by a low rank approximation covariance which accounts for only first leading components of the eigenvalue decomposition of the priori covariance matrix, the proposed algorithm permits to reduce greatly the number of estimated parameters in the algorithm.The efficiency of the proposed recursive procedure has been demonstrated by numerical experiments, with the systems of small and very high dimension.
a similar question has been studied which concerns the choice of adequate structure for the Error Covariance Matrix (ECM) M .
components.The "true" ocean is simulated by running the model from "climatology" during two years.Each ten days the sea-surface height (SSH) are stored at the grid points 10as observations in the assimilation experiment.The sequence of true states will be available and allows us to compute the estimation errors.Thus the observation operator H is constant at all assimilation instants.

Figure 2
shows the estimated vertical coefficients l k , 1, 2,3, 4 l = obtained on the basis of the ECM spanned by the elements of En(PEF).It is seen that the estimates converge quite quickly.The estimated vertical gain coefficients basis of the ECM from two ensembles En(PEF), En(EnOI) at the iteration 72 t

Figure 2 .
Figure 2. Vertical gain coefficients obtained during application of the Sampling Procedure for layer thickness correction during iterations.

Figure 3 .
Figure 3. Performance comparison of EnOI, CHF and PEF: Variance of SSH innovation resulting from the filters EnOI, CHF and PEF.

Figure 4 .
Figure 4.The prediction error variance of the u velocity component at the surface (cm/s) resulting from the EnOI, CHF and PEF.
Theorem 1. Suppose the system (6) is compatible.Then for any finite x and symmetric positive definitive (SPD) matrix R H is a linear space spanned by the columns H. S. Hoang, R. Baraille 923 of H . Throughout of the paper, for definiteness, let 0 0 M we have p Hx z = .