Optimal Estimation of High-Dimensional Covariance Matrices with Missing and Noisy Data

Meiyin Wang; Wanzhou Ye

doi:10.4236/apm.2024.144013

Advances in Pure Mathematics > Vol.14 No.4, April 2024

Optimal Estimation of High-Dimensional Covariance Matrices with Missing and Noisy Data

Meiyin Wang, Wanzhou Ye
Department of Mathematics, College of Science, Shanghai University, Shanghai, China.
DOI: 10.4236/apm.2024.144013 PDF HTML XML 42 Downloads 163 Views

Abstract

The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method.

Keywords

High-Dimensional Covariance Matrix, Missing Data, Sub-Gaussian Noise, Optimal Estimation

Share and Cite:

Wang, M. and Ye, W. (2024) Optimal Estimation of High-Dimensional Covariance Matrices with Missing and Noisy Data. Advances in Pure Mathematics, 14, 214-227. doi: 10.4236/apm.2024.144013.

1. Introduction

The covariance matrix is a key component in various fields, particularly statistics. However, when dealing with many statistical situations, the covariance matrix is usually unknown. As a result, estimating the covariance matrix is extremely significant, and it is frequently utilized in signal processing, genomics, financial mathematics, and other domains. When the dimension p is fixed and the sample size n is sufficiently large, the sample covariance matrix is commonly used to estimate the true covariance matrix. However, with advancements in information technology and various other technologies, there is a growing challenge in estimating large covariance matrices. Issues such as dimensionality and noise can significantly impact the effectiveness of using the sample covariance matrix to estimate the true covariance matrix. Moreover, in the era of big data, missing data is a common occurrence, making research on the estimation of high-dimensional covariance matrices based on missing and noisy data essential.

Bickel and Levina [1] proposed thresholding as a commonly used method for estimating high-dimensional covariance matrices and and proved its consistency under the operator norm. However, there was no discussion on its optimality. Cai and Zhou studied the optimal estimation of sparse covariance matrices under the operator norm and Bregman divergence loss. They also proved that the thresholding estimator can achieve the optimal convergence rate under the spectral norm, see [2] . Cai and Zhou [3] provided the optimal estimation of the sparse covariance matrices under the $l_{1}$ norm loss. The thresholding described above is also referred to as hard thresholding, and its counterpart is soft thresholding [4] [5] . On this basis, Rothman, Levina, and Zhu [6] proposed generalized thresholding and proved its consistency. Cai and Liu [7] proposed adaptive thresholding. The adaptive estimation of high-dimensional sparse precision matrices was studied by Cai, Liu, and Zhou [8] . For bandable covariance matrices, [3] , [9] , and [10] conducted in-depth research.

In the case of missing data, Cai and Zhang [11] assumed that the missingness was independent of the data values and studied the optimal estimation of two classes of covariance matrices. Qi [12] explored the optimal estimation of sparse covariance matrices under the $l_{1}$ norm and the Fribenius norm, respectively. In addition, the lower bound for estimating bandable covariance matrices under the spectral norm was studied based on noisy and missing data, but its optimality is not considered. Shi [13] studied the optimal estimation of bandable covariance matrices based on missing and noisy sample data.

It is not difficult to find that the research on estimating high-dimensional covariance matrices is primarily based on complete data. However, the correlational research on missing data and noisy models remains critical. The articles listed above served as a tremendous source of inspiration for this paper’s study topic and methods. This paper will provide corresponding research for the aforementioned issues. Sparse covariance matrices are widely employed in a variety of applications, including genomics. As a result, it is necessary to investigate the estimate of this kind of covariance matrix. The research in this paper can help people better estimate the high-dimensional covariance matrix when the sample is noisy and missing. Thus, it is convenient for many fields to better use high-dimensional data to obtain more useful information, and this paper provides them with a reliable theoretical basis.

The remaining sections of this paper are as follows: Section 2 will provide the associated concepts and knowledge of covariance matrix estimation, which serves as the theoretical foundation for the research. In Section 3, we will study the optimal estimation of sparse covariance matrices with missing and noisy data. In Section 4, numerical simulation experiment will be performed to investigate the estimating effect of the estimator presented in Section 3. The fifth section summarizes the research content and discusses existing problems.

2. Theoretical Basis

This paper will primarily study the optimal estimation of covariance matrices under the $l_{1}$ norm. For a matrix $A = (a_{i j}) \in ℝ^{m \times n}$ , $σ (A) = \sqrt{λ (A^{H} A)}$ represents the singular values of $A$ , while $λ (A^{H} A)$ represents the eigenvalues of $A^{H} A$ . The operator norm of $A$ is defined as

${‖ A ‖}_{a} = \max_{{‖ x ‖}_{a} = 1, x \in ℝ^{n}} {‖ A x ‖}_{a} = \max_{x \neq 0, x \in ℝ^{n}} \frac{{‖ A x ‖}_{a}}{{‖ x ‖}_{a}}$

There are three common operator norms:

(1) $l_{1}$ norm: ${‖ A ‖}_{l_{1}} = {‖ A ‖}_{1} = \max_{1 \leq j \leq n} \sum_{i = 1}^{m} | a_{i j} |$ ;

(2) spectral norm: ${‖ A ‖}_{s p} = {‖ A ‖}_{2} = σ_{1} (A) = \max_{i} σ_{i} (A)$ ;

(3) $l_{\infty}$ norm: ${‖ A ‖}_{\infty} = \max_{1 \leq i \leq m} \sum_{j = 1}^{n} | a_{i j} |$ .

Next, we will introduce the sub-Gaussian random vector. If there is a parameter $k > 0$ such that $E (e^{s X}) \leq e^{k^{2} s^{2} / 2}, s \in ℝ$ , the random variable $X$ is considered a sub-Gaussian random variable with parameter $k$ , that is, $X \sim S u b (k)$ . It is easy to know that sub-Gaussian random variables include Gaussian random variables whose mean is 0 and all bounded random variables with a mean of 0. Assuming the random variable $X$ is sub-Gaussian, its sub-Gaussian norm is denoted by

${‖ X ‖}_{ψ_{2}} = \sup_{p \geq 1} p^{- 1 / 2} {(E {| X |}^{p})}^{1 / p} .$

A p-dimensional random vector $X = {(X_{1}, X_{2}, \dots X_{p})}^{T}$ is called the sub-Gaussian random vector if any linear combination of $X_{1}, X_{2}, \dots X_{p}$ is sub-Gaussian. That is, when $τ > 0$ , for any $t > 0$ , $v \in ℝ^{p}$ , and ${‖ v ‖}_{2} = 1$ , there is

$P {| v^{T} (Χ - E Χ) | > t} \leq e^{- t^{2} / 2 τ} .$

Assume that a p-dimensional random vector $Χ \in ℝ^{p}$ has the mean $μ$ and the covariance matrix $Σ$ . Covariance matrix estimation is the process of computing a covariance matrix $\hat{Σ}$ based on $n$ independent copies $Χ_{1}, Χ_{2}, \dots Χ_{n} \in ℝ^{p}$ of $Χ$ and then using $\hat{Σ}$ to estimate $Σ$ , i.e., making $\hat{Σ}$ approximate $Σ$ in a certain sense. In this paper, minimax risk is used as a standard to measure the estimation effect. Suppose $Χ_{1}$ has a certain class of covariance matrices, and $A$ is a specific collection of $Χ_{1}$ ’s distributions. Then, under the specified matrix norm $‖ \cdot ‖$ , [3] defines the minimax risk of estimating $Σ$ over $A$ as

When the vector’s dimension $p$ is smaller than the sample size $n$ , the sample covariance matrix is typically utilized to estimate the true covariance matrix. The sample mean is

$\bar{Χ} = \frac{1}{n} \sum_{i = 1}^{n} Χ_{i},$

and the sample covariance matrix is

$\hat{Σ} = {({\hat{σ}}_{i j})}_{p \times p} = \frac{1}{n} \sum_{i = 1}^{n} (Χ_{i} - \bar{Χ}) {(Χ_{i} - \bar{Χ})}^{T} .$ (1)

However, as noted in Section 1, when the dimension $p$ is substantially larger than the sample size $n$ , utilizing the sample covariance matrix for estimating the true covariance matrix becomes inadequate. Based on the work of Cai et al., this paper will study the optimal estimation of high-dimensional covariance matrices based on missing and noisy data.

The missing completely at random (MCR) model is introduced below.MCR indicates that the missingness was random and independent of the data values. Suppose ${Χ_{1}, Χ_{2}, \dots Χ_{n}}$ is complete random sample from $Χ$ . Introducing vector $S_{k} = {0, 1}^{p}, k = 1, 2, \dots n$ as the observation index for $Χ_{k}$ , then

$S_{j k} = {\begin{array}{l} 1, & X_{j k} isobserved, \\ 0, & X_{j k} ismissing . \end{array}$

$S_{j k}$ and $X_{j k}$ represent the $j$ th coordinate of vectors $Χ_{k}$ and $S_{k}$ , respectively.

We denote $Χ^{*} = {Χ_{1}^{*}, Χ_{2}^{*}, \dots, Χ_{n}^{*}}$ as the sample with missing data, where $Χ_{i}^{*} = {(X_{1 i} S_{1 i}, X_{2 i} S_{2 i}, \dots, X_{p i} S_{p i})}^{T}$ is the $i$ th observation sample. Additionally, define

$n_{i j}^{*} : = \sum_{k = 1}^{n} S_{i k} S_{j k}, 1 \leq i, j \leq p .$

When $S_{i k} S_{j k} = 1$ , the $i$ th and $j$ th components of vector $Χ_{k}^{*}$ are observed simultaneously, whereas $S_{i k} S_{j k} = 0$ indicates that they were not observed simultaneously. Thus, $n_{i j}^{*}$ denotes the number of times the $i$ th and $j$ th components of $Χ^{*}$ are simultaneously observed. For convenience, let’s define

$n_{i}^{*} = n_{i i}^{*}$ , $n_{\min}^{*} = \min_{i, j} n_{i j}^{*}$ . It is simple to know that $n_{\min}^{*} \leq n_{i j}^{*} \leq \min {n_{i}^{*}, n_{j}^{*}}$ . When the sample data are complete, $n_{i j}^{*} \equiv n$ .

For sample $Χ^{*}$ with missing data, we substitute the generalized sample mean and generalized sample covariance matrix for the traditional sample mean and covariance matrix. The generalized sample mean is defined as the following:

${\bar{Χ}}^{*} : = {({\bar{Χ}}_{i}^{*})}_{1 \leq i \leq p}, {\bar{Χ}}_{i}^{*} = \frac{1}{n_{i}^{*}} \sum_{k = 1}^{n} Χ_{i k}^{*} = \frac{1}{n_{i}^{*}} \sum_{k = 1}^{n} Χ_{i k} S_{i k},$

the generalized sample covariance matrix is defined as

${\hat{Σ}}^{*} : = {({\hat{σ}}_{i j}^{*})}_{1 \leq i, j \leq p}, {\hat{σ}}_{i j}^{*} = \frac{1}{n_{i j}^{*}} \sum_{k = 1}^{n} (Χ_{i k} - {\bar{Χ}}_{i}^{*}) (Χ_{j k} - {\bar{Χ}}_{j}^{*}) S_{i k} S_{j k} .$ (2)

3. Covariance Matrix estimation

This paper assumes that the covariance matrix is sparse, which means that the majority of its components are 0 or insignificant, and the distribution of non-zero elements is irregular. First, we introduce the parameter space $G_{q} (ρ, c_{n, p})$ of the sparse covariance matrices:

$G_{q} (ρ, c_{n, p}) = {Σ = {(σ_{i j})}_{1 \leq i, j \leq p} : \max_{1 \leq j \leq p} {{| σ_{[k] j} |}^{q}} \leq \frac{c_{n, p}}{k}, \forall k ， \max σ_{i i} \leq ρ}$ ,

where $0 \leq q < 1$ , and $σ_{[k] j}$ represents the element with the $k$ th largest absolute value in the $j$ th column of matrix $Σ$ . When $q = 0$ , each column of the matrices in $G_{q} (ρ, c_{n, p})$ has at most $c_{n, p}$ non-zero components, usually assuming $c_{n, p} \geq 1$ .

3.1. Noisy Model

Assuming the complete random vector $Χ \in ℝ^{p}$ has the covariance matrix $Σ_{X} = {({\hat{σ}}_{i j})}_{p \times p}$ . Using a $p$ -dimensional random vector $F$ to represent noisy data, the noisy model can be expressed as

$F = X + ε,$ (3)

where $ε \in ℝ^{p}$ represents noise. In this section, we hope to build a $p \times p$ matrix ${\hat{Σ}}_{F}$ based on $n$ independent random noisy samples $F_{1}, F_{2}, \dots F_{n} \in ℝ^{p}$ of $F$ . We next use ${\hat{Σ}}_{F}$ to estimate the covariance matrix $Σ_{X}$ of the random vector $X$ .

The noisy sample with missing data are represented by $F^{*} = {F_{1}^{*}, F_{2}^{*}, \dots, F_{n}^{*}}$ , where $F_{i}^{*} = {(F_{1 i} S_{1 i}, F_{2 i} S_{2 i}, \dots, F_{p i} S_{p i})}^{T}$ is the $i$ th observation sample. The definition of the generalized sample mean is as follows:

${\bar{F}}^{*} : = {({\bar{F}}_{i}^{*})}_{1 \leq i \leq p}, {\bar{F}}_{i}^{*} = \frac{1}{n_{i}^{*}} \sum_{k = 1}^{n} F_{i k}^{*} = \frac{1}{n_{i}^{*}} \sum_{k = 1}^{n} F_{i k} S_{i k},$

the generalized sample covariance matrix is defined as

${\hat{Σ}}_{F}^{*} : = {({\hat{σ}}_{i j}^{*} (F))}_{1 \leq i, j \leq p}, {\hat{σ}}_{i j}^{*} (F) = \frac{1}{n_{i j}^{*}} \sum_{k = 1}^{n} (F_{i k} - {\bar{F}}_{i}^{*}) (F_{j k} - {\bar{F}}_{j}^{*}) S_{i k} S_{j k} .$ (4)

Two new assumptions in [12] are presented below.

Assumption 1. The observation index $S : = {S_{1}, S_{2}, \dots S_{n}}$ can be random or deterministic, but it is independent of the noisy observation sample $F : = {F_{1}, F_{2}, \dots F_{n}}$ .

Assumption 2. The random vectors $F_{1}, F_{2}, \dots F_{n}$ are i.i.d., where $F_{k} = X_{k} + ε_{k}$ , and

$X_{k} = Γ Z_{k} + μ, ε_{k} = Γ^{ε} Z_{k}, k = 1, 2, \dots n .$

$μ$ represents a fixed $p$ -dimensional mean vector. $Γ, Γ^{ε} \in ℝ^{p \times q} (p \leq q)$ are fixed matrices with $Γ Γ^{T} = Σ$ and $Γ^{ε} Γ^{ε}^{T} = Σ^{ε}$ . Each component of the random vector $Z_{k} = {(Z_{1 k}, Z_{2 k}, \dots, Z_{q k})}^{T}$ i.i.d. sub-Gaussian with a variance of 1 and a mean of 0. For any $s > 0$ , there exists a parameter $τ > 0$ such that $E (e^{s Z_{i k}}) \leq e^{τ s^{2} / 2}$ , that is, $Z_{i k} \sim S u b (τ)$ .

3.2. Upper Bound for Estimating Sparse Covariance Matrix

The hard thresholding estimator based on complete data was proposed by [1] . When most of the elements in each row or column of the true covariance matrix are close to zero or negligible, set the elements of the sample covariance matrix below a certain threshold to 0, and leave the remaining elements unaltered to estimate the true covariance matrix, so as to reduce the error. In [2] ,

$P {| {\hat{σ}}_{i j} - σ_{i j} | \leq t} \geq 1 - C p^{- 8}$

for $t = γ \sqrt{\log p / n}$ , where $C$ is a constant. The threshold is set to $γ \sqrt{\log p / n}$ .

In this paper, it is extended to the case of missing and noisy data. According to Lemma 4.6 in [12] , if Assumption 1 and Assumption 2 are both hold, then there are two absolute constants $C$ and $c$ greater than 0, such that

$P {| {\hat{σ}}_{i j}^{*} (F) - σ_{i j} | \leq x} \geq 1 - C \exp {- c n_{\min}^{*} \min (\frac{x^{2}}{τ^{4} σ_{i i} σ_{j j}}, \frac{x}{τ^{2} \sqrt{σ_{i i} σ_{j j}}})}$ (5)

for any $x > 0$ . Since $σ_{i i} σ_{j j} \leq ρ^{2}$ , the above can be simplified to: there are constants $C > 0$ and $γ > 0$ , such that

$P {| {\hat{σ}}_{i j}^{*} (F) - σ_{i j} | \leq x} \geq 1 - C \exp (- \frac{8}{γ^{2}} n_{\min}^{*} x^{2}) .$ (6)

Where $x \leq ρ$ , and the constants $C$ and $γ$ only depend on $ρ$ . Note that Inequality can be written as $P {| {\hat{σ}}_{i j}^{*} (F) - σ_{i j} | \leq x} \geq 1 - C p^{- 8}$ when $x = γ \sqrt{\log p / n_{\min}^{*}}$ .

The hard thresholding estimator ${\hat{Σ}}_{F}^{t h}$ of the covariance matrices $Σ_{X} \in G_{q} (ρ, c_{n, p})$ is defined by transforming the generalized sample covariance ${\hat{σ}}_{i j}^{*} (F)$ in Equation (4),

${\hat{Σ}}_{F}^{t h} = {({\hat{σ}}_{i j}^{t h} (F))}_{p \times p} = {({\hat{σ}}_{i j}^{*} (F) \cdot I (| {\hat{σ}}_{i j}^{*} (F) | \geq λ))}_{p \times p}, λ = γ \sqrt{\frac{\log p}{n_{\min}^{*}}},$ (7)

where $γ$ is a constant and $γ > 0$ .

The following is Lemma 1, which plays an important role in studying the minimax upper bound. Lemma 1 generalizes Lemma 8 in [8] from complete to noisy and missing sample.

Lemma 1. Define event $A_{i j} : = {| {\hat{σ}}_{i j}^{t h} (F) - σ_{i j} | > 4 \min (| σ_{i j} |, λ = γ \sqrt{\log p / n_{\min}^{*}})}$ , then there is constant $c > 0$ , which only depends on $ρ$ , such that

$P {A_{i j}} \leq 2 c p^{- \frac{9}{2}} .$

Proof : Firstly, define event $B_{1} : = {| {\hat{σ}}_{i j}^{t h} (F) | \geq λ}$ . It is easy to know that

$\begin{array}{l} B_{1} \subset {| {\hat{σ}}_{i j}^{t h} (F) - σ_{i j} | \geq | {\hat{σ}}_{i j}^{t h} (F) | - | σ_{i j} | \geq λ - | σ_{i j} |}, \\ B_{1}^{c} \subset {| {\hat{σ}}_{i j}^{t h} (F) - σ_{i j} | \geq | σ_{i j} | - | {\hat{σ}}_{i j}^{t h} (F) | > | σ_{i j} | - λ} . \end{array}$ (8)

According to the definition of ${\hat{σ}}_{i j}^{t h} (F)$ in Equation (7),

$| {\hat{σ}}_{i j}^{t h} (F) - σ_{i j} | = | {\hat{σ}}_{i j}^{*} (F) - σ_{i j} | I (B_{1}) + | σ_{i j} | I (B_{1}^{c}) .$ (9)

Next, we will prove this lemma in different cases. It can be obtained by simple calculation:

$P {| {\hat{σ}}_{i j}^{t h} (F) - σ_{i j} | \leq 4 \min (| σ_{i j} |, λ)} {\begin{array}{l} \geq 1 - C p^{- 9 / 2}, & | σ_{i j} | < λ / 4, \\ \geq 1 - C p^{- 8}, & λ / 4 \leq | σ_{i j} | \leq 2 λ, \\ \geq 1 - 2 C p^{- 8} & | σ_{i j} | > 2 λ . \end{array}$

Therefore, there exists a constant $c > 0$ , such that

$P {| {\hat{σ}}_{i j}^{t h} (F) - σ_{i j} | > 4 \min (| σ_{i j} |, λ)} \leq 2 c p^{- 9 / 2}$ .

Next, we can obtain the upper bound for estimating the sparse covariance matrices by utilizing the risk properties of thresholding estimator.

Theorem 1. If Assumption 1 and Assumption 2 hold, $\log p = o (\sqrt{n_{\min}^{*}})$ and $p \geq \sqrt{n_{\min}^{*}}$ , then there is a constant $C > 0$ such that the hard thresholding estimator ${\hat{Σ}}_{F}^{t h}$ defined by Equation (7) satisfies

$\sup_{Σ_{X} \in G_{q} (ρ, c_{n, p})} E {‖ {\hat{Σ}}_{F}^{t h} - Σ_{X} ‖}_{1}^{2} \leq C c_{n, p}^{2} {(\frac{\log p}{n_{\min}^{*}})}^{1 - q} .$ (10)

Proof: Easy to know, ${‖ {\hat{Σ}}_{F}^{t h} - Σ_{X} ‖}_{1}^{2} = {[\max_{j} \sum_{i = 1}^{p} | {\hat{σ}}_{i j}^{t h} (F) - σ_{i j} |]}^{2}$ . If event $A_{i j}$ occurs,

$\sum_{i = 1}^{p} | {\hat{σ}}_{i j}^{t h} (F) - σ_{i j} | > \sum_{i = 1}^{p} 4 \min (| σ_{i j} |, λ) .$

Simple calculations show that

$\sum_{i = 1}^{p} \min (| σ_{i j} |, λ) = (\sum_{i \leq k^{'}} + \sum_{i > k^{'}}) \min (| σ_{i j} |, λ) \leq (\sum_{i \leq k^{'}} + \sum_{i > k^{'}}) \min (| σ_{[i] j} |, γ \sqrt{\frac{\log p}{n_{\min}^{*}}}) .$

According to the definition of $G_{q} (ρ, c_{n, p})$ , we know that $\max_{1 \leq j \leq p} {{| σ_{[i] j} |}^{q}} \leq c_{n, p} / i$ , so $| σ_{[i] j} | \leq {(c_{n, p} / i)}^{1 / q}$ . Select the constant $k^{'}$ to satisfy $k^{'} = ⌊ c_{n, p} {(n_{\min}^{*} / \log p)}^{q / 2} ⌋$ , so

$\begin{array}{l} \sum_{i = 1}^{p} \min (| σ_{i j} |, λ) \leq \sum_{i \leq k^{'}} \min ({(\frac{c_{n, p}}{i})}^{\frac{1}{q}}, γ \sqrt{\frac{\log p}{n_{\min}^{*}}}) + \sum_{i > k^{'}} \min ({(\frac{c_{n, p}}{i})}^{\frac{1}{q}}, γ \sqrt{\frac{\log p}{n_{\min}^{*}}}) \\ \leq k^{'} γ \sqrt{\frac{\log p}{n_{\min}^{*}}} + \sum_{i > k^{'}} {(\frac{c_{n, p}}{i})}^{\frac{1}{q}} \leq C_{1} c_{n, p} {(\frac{\log p}{n_{\min}^{*}})}^{\frac{1 - q}{2}} . \end{array}$

Let the matrix $D = {(d_{i j})}_{p \times p}$ satisfy $d_{i j} = | {\hat{σ}}_{i j}^{t h} (F) - σ_{i j} | I (A_{i j})$ , we have

$E {‖ {\hat{Σ}}_{F}^{t h} - Σ_{X} ‖}_{1}^{2} \leq 2 E {‖ {\hat{Σ}}_{F}^{t h} - Σ_{X} - D ‖}_{1}^{2} + 2 E {‖ D ‖}_{1}^{2} .$

Then, it is straightforward to acquire

$\begin{array}{l} 2 E {‖ {\hat{Σ}}_{F}^{t h} - Σ_{X} - D ‖}_{1}^{2} = 2 E {[\max_{j} \sum_{i = 1}^{p} | {\hat{σ}}_{i j}^{t h} (F) - σ_{i j} - ({\hat{σ}}_{i j}^{t h} (F) - σ_{i j}) I (A_{i j}) |]}^{2} \\ \leq 2 E {[\max_{j} \sum_{i = 1}^{p} | ({\hat{σ}}_{i j}^{t h} (F) - σ_{i j}) I (A_{i j}^{c}) |]}^{2} \leq C_{2} c_{n, p}^{2} {(\frac{\log p}{n_{\min}^{*}})}^{1 - q} . \end{array}$

Therefore, we only need to prove that $2 E {‖ D ‖}_{1}^{2}$ is negligible.

Firstly,

$\begin{array}{l} E {‖ D ‖}_{1}^{2} = E {[\max_{j} \sum_{i = 1}^{p} | d_{i j} |]}^{2} \leq p^{2} E {(\max_{i, j} d_{i j})}^{2} \\ = p^{2} E [(\max_{i, j} d_{i j}^{2}) I (A_{i j} \cap {| {\hat{σ}}_{i j}^{*} (F) | \geq λ})] + p^{2} E [(\max_{i, j} d_{i j}^{2}) I (A_{i j} \cap {| {\hat{σ}}_{i j}^{*} (F) | < λ})] \\ \leq p^{2} E [{(\max_{i, j} | {\hat{σ}}_{i j}^{*} (F) - σ_{i j} |)}^{2} I (A_{i j})] + p^{2} E [{(\max_{i, j} | σ_{i j} |)}^{2} I (A_{i j})] \\ = E_{1} + E_{2} . \end{array}$

According to the Cauchy-Schwartz inequality, we know that ${| σ_{i j} |}^{2} \leq σ_{i i} σ_{j j}$ , and because $P {A_{i j}} \leq 2 c p^{- 9 / 2}$ , $p \geq \sqrt{n_{\min}^{*}}$ , so

$E_{2} \leq p^{2} E (\max_{i, j} σ_{i i} σ_{j j}) P (A_{i j}) \leq p^{2} ρ^{2} 2 c p^{- \frac{9}{2}} \leq C_{3} p^{- \frac{5}{2}} \leq \frac{C_{4}}{n_{\min}^{*}} .$

In addition, $E_{1} \leq p^{2} \max_{i, j} E {| {\hat{σ}}_{i j}^{*} (F) - σ_{i j} |}^{2} P (A_{i j})$ . From Inequality (5),

$\begin{array}{l} E {| {\hat{σ}}_{i j}^{*} (F) - σ_{i j} |}^{2} = \int {| {\hat{σ}}_{i j}^{*} (F) - σ_{i j} |}^{2} d P \leq \int_{0}^{\infty} x P {| {\hat{σ}}_{i j}^{*} (F) - σ_{i j} | \geq x} d x \\ \leq \int_{0}^{ρ} x P {| {\hat{σ}}_{i j}^{*} (F) - σ_{i j} | \geq x} d x + \int_{ρ}^{\infty} x P {| {\hat{σ}}_{i j}^{*} (F) - σ_{i j} | \geq x} d x \\ \leq ρ^{2} + \int_{ρ}^{\infty} x \exp (- C_{5} n_{\min}^{*} \frac{x}{ρ}) d x \leq ρ^{2} + \frac{ρ^{2}}{C_{5} n_{\min}^{*}} \exp (- C_{5} n_{\min}^{*}) . \end{array}$

From Lemma 1, we have $P {A_{i j}} \leq 2 c p^{- 9 / 2}$ and $p \geq \sqrt{n_{\min}^{*}}$ , so

$\begin{array}{l} E_{1} \leq p^{2} \max_{i, j} E {| {\hat{σ}}_{i j}^{*} (F) - σ_{i j} |}^{2} P (A_{i j}) \leq p^{2} (ρ^{2} + \frac{ρ^{2}}{C_{5} n_{\min}^{*}} \exp (- C_{5} n_{\min}^{*})) 2 c p^{- 9 / 2} \\ \leq p^{2} ρ^{2} 2 c p^{- 9 / 2} + p^{2} \frac{ρ^{2}}{C_{5} n_{\min}^{*}} \exp (- C_{5} n_{\min}^{*}) 2 c p^{- 9 / 2} \\ \leq \frac{C_{4}}{n_{\min}^{*}} + \frac{C_{6}}{n_{\min}^{*}} \leq \frac{C_{7}}{n_{\min}^{*}} \end{array}$

To sum up,

$E {‖ D ‖}_{1}^{2} \leq E_{1} + E_{2} \leq \frac{C_{8}}{n_{\min}^{*}} = O (\frac{1}{n_{\min}^{*}})$ .

3.3. Lower Bound for Estimating Sparse Covariance Matrix

Before studying the lower bound, introduce some useful lemmas and symbols.

Lemma 2. Assume $P$ and $V$ are two probability measures, with $p$ and $v$ representing their probability density functions. The total variation distance between $P$ and $V$ is $V (P, V) = 1 - \int \min (d P, d P)$ . Define the total variation affinity as $‖ P \land V ‖ : = \int \min (d P, d P) = \int p (x) \land v (x) d x$ . The Kullback divergence between $P$ and $V$ is expressed as $K L (P ‖ V) = \int p (x) \log [p (x) / v (x)] d x$ . Thus, $‖ P \land V ‖$ and $K L (P ‖ V)$ satisfy the following inequality:

$1 - ‖ P \land V ‖ \leq \sqrt{\frac{K L (P ‖ V)}{2}} .$ (11)

Lemma 2 in [14] and Le Cam’s lemma and its corollary in [2] [12] introduced below are important tools for proving minimax lower bound.

Lemma 3 (Le Cam). Suppose $Θ = {θ_{0}, θ_{1}, \dots, θ_{m}}$ is a finite set of parameters. Let $L$ be a loss function and $l_{\min} : = \min_{1 \leq i \leq m} \inf_{t} [L (t, θ_{0}) + L (t, θ_{i})]$ , then

$\sup_{θ \in Θ} E [L (\tilde{θ}, θ)] \geq \frac{1}{2} l_{\min} ‖ P_{θ_{0}} \land \bar{P} ‖ .$

$\tilde{θ}$ is any estimator of $θ$ based on the observed values of the probability measure $P_{θ} (θ \in Θ)$ , and $\bar{P} = \frac{1}{m} \sum_{i = 1}^{m} P_{θ_{i}}$ .

Lemma 4. Suppose $\tilde{Σ}$ be any estimator of $Σ_{i}$ based on the collection of probability measures ${P_{Σ_{0}}, P_{Σ_{1}}, \dots, P_{Σ_{m}}}$ . We get

$\sup_{1 \leq i \leq m} E {‖ \tilde{Σ} - Σ_{i} ‖}_{1} \geq \frac{1}{2} ‖ P_{Σ_{0}} \land \bar{P} ‖ \cdot \inf_{1 \leq i \leq m} {‖ Σ_{i} - Σ_{0} ‖}_{1},$

where $\bar{P} = \frac{1}{m} \sum_{i = 1}^{m} P_{Σ_{i}}$ .

Before studying the minimax risk lower bound, it is advisable to construct a matrix with all off-diagonal elements equal to 0 except the first row or column. Let $H$ be the collection of $p \times p$ symmetric matrices in which exactly $k$ non-diagonal elements in the first row or column equal to 1 and all other elements are 0. Let $k = ⌊ c_{n, p} {(n_{0} / \log p)}^{q / 2} ⌋$ . Define

$G_{0} = {Σ = {(σ_{i j})}_{1 \leq i, j \leq p} : Σ = I_{p} 或 Σ = I_{p} + a H, H \in H},$ (12)

where $I_{p}$ represents the identity matrix of size $p \times p$ , $a = \sqrt{δ \log p / n_{0}}$ , and $δ$ is a constant. Assuming $ρ > 1$ , $0 < δ < \min {1, 1 / (4 M)}$ , it is easy to know that $G_{0} \subset G_{q} (ρ, c_{n, p})$ .

Obtaining the lower bound requires two steps. Firstly, the subset of the parameter space constructed above is selected to simplify the proof. Secondly, calculate the total variation affinity between two probability measures.

Theorem 2. Let $1 \leq n_{0} \leq n_{\min}^{*}$ , $p \geq n_{0}^{ν} (ν > 1)$ , and $\log p \leq n_{0}$ . Assume $c_{n, p} \leq M {(n_{0} / \log p)}^{(1 - q) / 2}$ with $0 \leq q < 1, M > 0$ . For any $1 \leq n_{0} \leq n_{\min}^{*}$ , there exists a constant $c > 0$ such that the minimax risk lower bound for estimating the covariance matrix $Σ_{X}$ satisfies

$\inf_{{\tilde{Σ}}_{F}} \sup_{Σ_{X} \in G_{q} (ρ, c_{n, p})} E {‖ {\tilde{Σ}}_{F} - Σ_{X} ‖}_{1}^{2} \geq c c_{n, p}^{2} {(\frac{\log p}{n_{0}})}^{1 - q} .$ (13)

where ${\tilde{Σ}}_{F}$ is any estimator of $Σ_{X}$ based on noisy sample.

Proof: Assume $G_{0} = {Σ_{0}, Σ_{1}, \dots, Σ_{m^{*}}}$ has $m^{*} + 1$ elements, where $Σ_{0}$ represents the identity matrix and $Σ_{i}, i = 1, \dots, m^{*}$ represent the non-identity matrix, then $m^{*} = C a r d (G_{0}) - 1 = C_{p - 1}^{k}$ .

Assume that $X_{l}, l = 1, \dots, n \overset{i . i . d .}{~} N (0, Σ_{i}), i = 1, \dots, m^{*}$ , and the probability measure and probability density function are $P_{Σ_{i}}$ and $f_{i}$ , respectively, that is, $Σ_{X} \in G_{0}$ . Let $F_{l} \overset{i . i . d .}{~} N (0, Σ_{i} (F))$ with $Σ_{i} (F) = Σ_{i} + s^{2} I_{p}$ and $P_{Σ_{i}} (F)$ is the probability measure. Since $G_{0} \subset G_{q} (ρ, c_{n, p})$ , it is easy to know that

$\inf_{{\tilde{Σ}}_{F}} \sup_{Σ_{X} \in G_{q} (ρ, c_{n, p})} E {‖ {\tilde{Σ}}_{F} - Σ_{X} ‖}_{1}^{2} \geq \inf_{{\tilde{Σ}}_{F}} \sup_{Σ_{X} \in G_{0}} E {‖ {\tilde{Σ}}_{F} - Σ_{X} ‖}_{1}^{2} .$

Therefore, to prove Inequality (13), just prove the following Inequality:

$\inf_{{\tilde{Σ}}_{F}} \sup_{Σ_{X} \in G_{0}} E {‖ {\tilde{Σ}}_{F} - Σ_{X} ‖}_{1}^{2} \geq c c_{n, p}^{2} {(\frac{\log p}{n_{0}})}^{1 - q} .$ (14)

Lemma 3.3 shows that

$\sup_{Σ_{X} \in G_{0}} E {‖ {\tilde{Σ}}_{F} - Σ ‖}_{1} \geq \sup_{1 \leq i \leq m^{*}} E {‖ {\tilde{Σ}}_{F} - Σ_{i} ‖}_{1} \geq \frac{1}{2} ‖ P_{Σ_{0}} (F) \land \bar{P} (F) ‖ \cdot \inf_{1 \leq i \leq m^{*}} {‖ Σ_{i} (F) - Σ_{0} (F) ‖}_{1} .$

Since $a = \sqrt{δ \log p / n_{0}}$ and $k = ⌊ c_{n, p} {(n_{0} / \log p)}^{q / 2} ⌋$ , there exists a constant $c_{1} > 0$ such that

$\begin{array}{l} \inf_{1 \leq i \leq m^{*}} {‖ Σ_{i} (F) - Σ_{0} (F) ‖}_{1} = \inf_{1 \leq i \leq m^{*}} {‖ (Σ_{i} + s^{2} I_{p}) - (Σ_{0} + s^{2} I_{p}) ‖}_{1} = \inf_{1 \leq i \leq m^{*}} {‖ a H ‖}_{1} \\ = k a \geq c_{n, p} {(\frac{n_{0}}{\log p})}^{q / 2} \cdot \sqrt{δ \frac{\log p}{n_{0}}} \geq c_{1} c_{n, p} {(\frac{\log p}{n_{0}})}^{\frac{1 - q}{2}} \end{array}$ (15)

Obviously, to prove Inequality (14), we only need to prove that there is a constant $c_{2} > 0$ such that $‖ P_{Σ_{0}} (F) \land \bar{P} (F) ‖ \geq c_{2}$ .

From $F_{l} \overset{i . i . d .}{~} N (0, Σ_{i} (F))$ , we have

$K L (P_{Σ_{0}} (F) ‖ P_{Σ_{i}} (F)) = \frac{1}{2} [t r (Σ_{i}^{- 1} (F) Σ_{0} (F)) - l o g \det (Σ_{i}^{- 1} (F) Σ_{0} (F)) - p] .$

Let $B = - a H$ , it is easy to know that $Σ_{0} (F) = Σ_{i} (F) + B$ . Suppose the eigenvalues of $B Σ_{i}^{- 1} (F)$ are $ξ_{1}, \dots, ξ_{p}$ , then there are

$t r (Σ_{i}^{- 1} (F) Σ_{0} (F)) = t r [Σ_{i}^{- 1} (F) (Σ_{i} (F) + B)] = t r [I_{p} + B Σ_{i}^{- 1} (F)] = p + \sum_{i = 1}^{p} ξ_{i} .$ (16)

In addition, we can know that

$\log \det (Σ_{i}^{- 1} (F) Σ_{0} (F)) = \sum_{i = 1}^{p} \log (1 + ξ_{i}) = t r (Σ_{i}^{- 1} (F) Σ_{0} (F)) - p - \sum_{i = 1}^{p} \frac{1}{2 (1 + θ)} ξ_{i}^{2},$ (17)

where $θ$ is a number between 0 and $ξ_{i}$ . Putting Equation (16) and Equation (17) into $K L (P_{Σ_{0}} (F) ‖ P_{Σ_{i}} (F))$ , we can get

$K L (P_{Σ_{0}} (F) ‖ P_{Σ_{i}} (F)) = \frac{1}{2} [p + \sum_{i = 1}^{p} \frac{1}{2 (1 + θ)} ξ_{i}^{2} - p] = \frac{1}{2} \sum_{i = 1}^{p} \frac{1}{2 (1 + θ)} ξ_{i}^{2} \leq \frac{1}{4} \sum_{i = 1}^{p} ξ_{i}^{2} .$

According to Theorem 1.3 in [12] ,

$\sum_{i = 1}^{p} ξ_{i}^{2} = t r ({(B Σ_{i}^{- 1} (F))}^{H} B Σ_{i}^{- 1} (F)) = {‖ B Σ_{i}^{- 1} (F) ‖}_{F}^{2} \leq {‖ Σ_{i}^{- 1} (F) ‖}_{s p}^{2} {‖ B ‖}_{F}^{2} \leq 2 k a^{2} .$

It is easy to see that $\bar{P} (F) = 1 / m^{*} \sum_{i = 1}^{m^{*}} P_{Σ_{i}} (F)$ , hence

$K L (P_{Σ_{0}} (F) ‖ \bar{P} (F)) \leq \frac{1}{m^{*}} \sum_{i = 1}^{m^{*}} K L (P_{Σ_{0}} (F) ‖ P_{Σ_{i}} (F)) \leq \frac{1}{8} .$

Lemma 2 implies

$‖ P_{Σ_{0}} (F) \land \bar{P} (F) ‖ \geq 1 - \sqrt{\frac{K L (P_{Σ_{0}} (F) ‖ \bar{P} (F))}{2}} = \frac{3}{4} .$

That is, there exists a constant $c_{2} > 0$ such that $‖ P_{Σ_{0}} (F) \land \bar{P} (F) ‖ \geq c_{2}$ .

It is worth noting that Theorem 2 requires $c_{n, p} \leq M {(n_{0} / \log p)}^{(1 - q) / 2} (M > 0)$ , which is a necessary condition. If $c_{n, p} > M {(n_{0} / \log p)}^{(1 - q) / 2}$ , then

$\inf_{{\tilde{Σ}}_{F}} \sup_{Σ_{X} \in G_{q} (ρ, c_{n, p})} E {‖ {\tilde{Σ}}_{F} - Σ_{X} ‖}_{1}^{2} \geq \inf_{{\tilde{Σ}}_{F}} \sup_{Σ_{X} \in G_{q} (ρ, M {(n_{0} / \log p)}^{(1 - q) / 2})} E {‖ {\tilde{Σ}}_{F} - Σ_{X} ‖}_{1}^{2} ≳ M^{2} .$

$Σ_{X}$ does not have a consistent estimator in this case.

Theorem 1 and Theorem 2 show that the estimator ${\hat{Σ}}_{F}^{t h}$ we construct is rate-optimal over $G_{q} (ρ, c_{n, p})$ under the $l_{1}$ norm.

4. Numerical Analysis

The optimal estimation of sparse covariance matrices based on missing and noisy data is derived in Section 3. This section compares the performance of the hard threshold estimator ${\hat{Σ}}_{F}^{t h}$ , as defined in Section 3, against the traditional estimator using numerical simulation.

Some symbols are presented before the numerical simulation begins. Assume the $p$ -dimensional Gaussian random vector $Χ$ has a mean of $μ$ and a covariance matrix of $Σ_{X}$ . $n$ is the number of samples, and $p$ is their dimension.

Here are the specific steps of numerical simulation.

1) Construct the sparse covariance matrix.

Assume $μ$ is a zero vector and $Σ_{X}$ is a sparse matrix ( $Σ_{X} \in G_{q} (ρ, c_{n, p})$ ). Consult the construction of the sparse matrix in [11] , let

$Σ_{X} = I_{p} + (B + B^{T}) / ({‖ B + B^{T} ‖}_{1} + 0.01),$

where $B = {(b_{i j})}_{p \times p}$ , and $P (b_{i j} = - 1) = 0.1$ , $P (b_{i j} = 0) = 0.8$ , $P (b_{i j} = 1) = 0.1$ .

2) Generate random samples according to the true covariance matrix.

After $Σ_{X}$ is constructed, $n$ $p$ -dimensional random samples are first generated from the multivariate normal distribution with mean $μ$ and covariance matrix $Σ_{X}$ . The resulting $n$ samples are then subjected to noise with a sub-Gaussian distribution, followed by random missing processing. This method produces sample data with missing and sub-Gaussian noise.

3) Compare the estimation effect of different estimators.

Based on the sample data with missing and sub-Gaussian noise, calculate the generalized sample covariance matrix ${\hat{Σ}}_{F}^{*}$ and the hard thresholding estimator ${\hat{Σ}}_{F}^{t h}$ according to Equation and Equation . Then compute the error between ${\hat{Σ}}_{F}^{*}$ and the real matrix $Σ_{X}$ , as well as the error between ${\hat{Σ}}_{F}^{t h}$ and the real matrix $Σ_{X}$ , under the given norm.

After determining the values of $n$ and $p$ , repeat the above three steps 1), 2), and 3) 50 times, and take the mean value of the fifty error results as the standard for evaluating the estimation effect of different estimators in this case. The performance is better when the outcome is smaller. Table 1 shows the experimental results.

The values of $n$ and $p$ are shown in the first two columns of Table 1. Table 1 shows the average after 50 runs of the processes 1), 2), and 3) with $n$ and $p$ fixed. When the true covariance matrix $Σ_{X}$ is sparse, the hard thresholding estimator ${\hat{Σ}}_{F}^{t h}$ has a substantially better performance than the generalized sample covariance matrix ${\hat{Σ}}_{F}^{*}$ under any norm, especially when $p$ is larger than $n$ , that is, the dimension is high.

Therefore, when the dimension is very small in comparison to the sample size, the sample covariance matrix can be used to estimate the population covariance matrix. When estimating a high-dimensional sparse covariance matrix with sub-Gaussian additive noise and missing data, it is best to choose the hard thresholding estimator ${\hat{Σ}}_{F}^{t h}$ given in Equation (7). This section provides some suggestions for application statisticians on how to select estimation methods.

Table 1. Results of estimating sparse covariance matrix.

5. Summary and Outlook

In statistics and other fields, covariance matrix estimation is crucial. The estimation of high-dimensional covariance matrices has always been a hot topic with the rapid growth of numerous technologies.

Based on the missing and noisy sample data, this paper constructs a hard thresholding estimator ${\hat{Σ}}_{F}^{t h}$ , and studies its optimality. Section 3 shows that the hard thresholding estimator given in this paper is rate-optimal. The numerical simulation shown in Section 4 demonstrates that the hard thresholding estimator works well in situations where the true covariance matrix is sparse. When the true covariance matrix is not sparse, the estimation effect of the hard thresholding estimator has not been discussed.

This paper’s research has limitations and areas that require more investigation:

1) This paper focuses solely on the optimal estimation of sparse covariance matrices based on noisy and missing data. More research is needed on the optimal estimation of other common high-dimensional covariance matrices.

2) If the sub-Gaussian distribution used in this article is replaced with the sub-exponential distribution with a larger range, the relevant issues merit additional investigation.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1]	Bickel, P.J. and Levina, E. (2008) Covariance Regularization by Thresholding. The Annals of Statistics, 36, 2577-2604. https://doi.org/10.1214/08-AOS600
[2]	Cai, T.T. and Zhou, H.H. (2012) Optimal Rates of Convergence for Sparse Covariance Matrix Estimation. The Annals of Statistics, 40, 2389-2420. https://doi.org/10.1214/12-AOS998
[3]	Cai, T.T. and Zhou, H.H. (2012) Minimax Estimation of Large Covariance Matrices under L1-Norm. Statistica Sinica, 22, 1319-1349. https://doi.org/10.5705/ss.2010.253
[4]	Fan, J. and Li, R. (2001) Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96, 1348-1360. http://www.jstor.org/stable/3085904 https://doi.org/10.1198/016214501753382273
[5]	Zou, H. (2006) The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association, 101, 1418-1429. https://doi.org/10.1198/016214506000000735
[6]	Rothman, A.J., Levina, E. and Zhu, J. (2009) Generalized Thresholding of Large Covariance Matrices. Journal of the American Statistical Association, 104, 177-186. https://doi.org/10.1198/jasa.2009.0101
[7]	Cai, T.T. and Liu, W. (2011) Adaptive Thresholding for Sparse Covariance Matrix Estimation. Journal of the American Statistical Association, 106, 672-684. http://www.jstor.org/stable/41416401 https://doi.org/10.1198/jasa.2011.tm10560
[8]	Cai, T.T., Liu, W. and Zhou, H.H. (2016) Estimating Sparse Precision Matrix: Optimal Rates of Convergence and Adaptive Estimation. The Annals of Statistics, 44, 455-488. https://doi.org/10.1214/13-AOS1171
[9]	Bickel, P.J. and Levina, E. (2008) Regularized Estimation of Large Covariance Matrices. The Annals of Statistics, 36, 199-227. https://doi.org/10.1214/009053607000000758
[10]	Cai, T.T., Zhang, C.H. and Zhou, H.H. (2010) Optimal Rates of Convergence for Covariance Matrix Estimation. The Annals of Statistics, 38, 2118-2144. https://doi.org/10.1214/09-AOS752
[11]	Cai, T.T. and Zhang, A. (2016) Minimax Rate-Optimal Estimation of High-Dimensional Covariance Matrices with Incomplete Data. Journal of Multivariate Analysis, 150, 55-74. https://doi.org/10.1016/j.jmva.2016.05.002
[12]	Qi, X. (2022) Low Rank Matrix Perturbation Analysis and Estimation for Two Classes of Sparse Covariance Matrices. Ph.D. Thesis, Beijing University, Beijing.
[13]	Shi, W. (2022) Optimal Estimation of Bandable Covariance Matrices Based on Noised Consored Data. MSc. Thesis, Beijing University, Beijing.
[14]	Tsybakov, A.B. (2009) Introduction to Nonparametric Estimation. Springer-Verlag, New York. https://doi.org/10.1007/b13794

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies