Share This Article:

Quantile Regression Based on Laplacian Manifold Regularizer with the Data Sparsity in l1 Spaces

Full-Text HTML XML Download Download as PDF (Size:434KB) PP. 786-802
DOI: 10.4236/ojs.2017.75056    67 Downloads   158 Views  

ABSTRACT

In this paper, we consider the regularized learning schemes based on l1-regularizer and pinball loss in a data dependent hypothesis space. The target is the error analysis for the quantile regression learning. There is no regularized condition with the kernel function, excepting continuity and boundness. The graph-based semi-supervised algorithm leads to an extra error term called manifold error. Part of new error bounds and convergence rates are exactly derived with the techniques consisting of l1-empirical covering number and boundness decomposition.

1. Introduction

The classical least-squares regression models have focused mainly on estimating conditional mean functions. In contrast, quantile regression can provide richer information about the conditional distribution of response variables such as stretching or compressing tails, so it is particularly useful in applications when both lower and upper or all quantiles are of interest. Over the last years, quantile regression has become a popular statistical method in various research fields, such as reference charts in medicine [1] , survival analysis [2] and economics [3] .

In addition, relative to the least-squares regression, quantile regression estimates are more robust against outliers in the response measurements. We introduce a framework for data-dependent regularization that exploits the geometry of the marginal distribution. The labeled and unlabeled data learnt from the problem constructs a framework and incorporates the framework as an additional regularization term. The framework exploits the geometry of the probability distribution that generates the data. Hence, there are two regularization terms: one controlling the complexity of the classifier in the ambient space and the other controlling the complexity as measured by geometry of the distribution in the intrinsic space.

2. The Model

In this paper, under the framework of learning theory, we study l1-regularized and manifold regularized quantile regression. Let X be a compact subset of l + u and Y . There is a probability distribution ρ on X × Y according to which examples are generated for function learning. Labeled examples are ( x , y ) pairs generated according to ρ . Unlabeled examples are simply x X drawn according to the marginal distribution ρ X of ρ . We will make a specific assumption that an identifiable relation between ρ X and the conditional ρ ( y | x ) . The conditional t-quantile is a set-valued function defined by

ρ ( { y ( , u ] } | x ) 1 τ and ρ ( { y ( u , ] } | x ) τ (2.1)

where τ ( 0,1 ) , x X and u Y .

The empirical method for estimating the conditional t-quantile function is based on the t-pinball loss

ρ τ ( r ) = { τ r , r > 0 ( τ 1 ) r , otherwise (2.2)

Then, denote the generalization error to minimize the conditional t-quantile function f ρ with the loss function ρ τ

ε τ = X × Y ρ τ ( y f ( x ) ) d ρ (2.3)

where f : X . Based on observations, the empirical risk of the function f is

ε z τ = 1 l i = 1 l ρ τ ( y i f ( x i ) ) (2.4)

Next, We assume that | f ρ τ | 1 , a . e . , x X with respect to ρ X .

In kernel-based learning, this minimization process usually takes place in a hypothesis space, Reproducing Kernel Hilbert Space (RKHS) [4] [5] H k generated by a kernel function K : X × X . In the empirical case, a graph-based regular quantile regression problem can be typically formulated in terms of the following optimization

f ^ z , γ = min f H K { i = 1 l ρ τ ( y i f ( x i ) ) + γ A f K + γ I f I } . (2.5)

By the representers theorem, the solution to (2.5) can be written as

f ^ z , γ ( x ) = i = 1 l + u α i K ( x , x i ) (2.6)

The l1-norm penalty not only shrinks the fitted coefficients toward zero but also causes some of the coefficients to be exactly zero when making γ A sufficiently large. When the data lies on a low-dimensional manifold, the graph- based method seems more effective for semi-supervised learning and many approaches have been proposed for instance Transductive SVM [6] , Measure- based Regularization [7] and so on. Then the l1-regularized and manifold regularized quantile regression are as following

f ^ z , γ ( x ) = min f H K { ε z τ ( f ) + γ A Ω ( f ) + γ I ( l + u ) 2 i , j = 1 l + u ( f ( x i ) f ( x j ) ) 2 ω i j } = min f H K { ε z τ ( f ) + γ A Ω ( f ) + γ I ( l + u ) 2 f T L f } (2.7)

where Ω ( f ) = i = 1 l + u | α i | , f = ( f ( x 1 ) , , f ( x l + u ) ) T . γ A , γ I are nonnegative regularization parameters. L = D W is the unnormalized graph Laplacian, where D is a diagonal matrix with diagonal entries D i i = j = 1 l + u ω i j . The weight ω i j is given by a similar function ω ( x i , x j ) . The more similar x i and x j , the larger ω i j should be.

3. The Restriction

Definition 3.1. The projection operator on the space of function is defined by

π ( f ) ( x ) = { 1 , if f ( x ) > 1 f ( x ) , if 1 f ( x ) 1 1 , if f ( x ) < 1 (3.1)

Hence, it is natural to measure the approximation ability by the distance π ( f ^ z , γ ) f ρ τ L ρ X p with f ρ τ 1 .

Definition 3.2. Let p ( 0, ] and q ( 1, ) . We say that ρ has a t-quantile of p-average type q if for almost all x X with respect to ρ X , there exist a τ quantile t and constants a x ( 0,2 ] , b x > 0 such that for each s [ 0, a x ] ,

ρ ( y ( t s , t ) | x ) b x s q 1 , (3.2)

ρ ( y ( t s , t ) | x ) b x s q 1 , (3.3)

and that the function ϕ : X [ 0, ] , ϕ ( x ) = b x a x q 1 satisfies ϕ 1 L ρ X p .

For p ( 0, ] and q ( 1, ) , denote

θ = min { 2 q , p p + 1 } ( 0 , 1 ] (3.4)

Lemma 3.1. If ρ has a t-quantile of p-average type q for some p ( 0, ] and q ( 1, ) , then for any measurable function f : X [ 1,1 ] , there holds

f f ρ τ L ρ X p C q , ρ { ε τ ( f ) ε τ ( f ρ τ ) } 1 q (3.5)

where C q , ρ = 2 1 1 q q 1 q [ ( b x a x q 1 ) 1 ] x X L ρ X p 1 q and p = p q p + 1

Definition 3.3. We say that the kernel function K is a C c kernel with c > 0 if there exists some constants C c > 0 , such that

| K ( t , x ) K ( t , x ) | C c | x x | c , t , x , x X (3.6)

We assume throughout the paper that K C c ( X × X ) and denote κ = sup t , x X | K ( x , t ) | < . Our approximation condition is given as

f ρ τ = L K ˜ s g ρ τ , for some 0 < s < 1 , g ρ τ L ρ X 2 ( X ) (3.7)

here, the kernel K ˜ is defined by

K ˜ ( x , y ) = X K ( x , t ) K ( y , t ) d ρ X ( t ) (3.8)

Hence, although kernel K in not positive semi-definite, K ˜ is a Mercer kernel, H K ˜ denotes the associated reproducing kernel Hilbert spaces. The kernel K ˜ defines an integral operator L K ˜ : L ρ X 2 L ρ X 2 by

L K ˜ f ( x ) = X K ˜ ( x , x ) f ( x ) d ρ X ( x ) , x X . (3.9)

Note that L K ˜ = L K L K is a self-adjoint positive operator on L ρ X 2 .Hence its s-th power L K ˜ s is well defined for any s > 0 . we take the RKHS H K ^ with

K ^ ( X , Y ) = X K ˜ ( x , t ) K ˜ ( y , t ) d ρ X ( t ) (3.10)

It is easy to see L K ^ = L K ˜ 2 , so that any function f H K ^ can be expressed as L K ˜ g for some g L ρ X 2 .

Definition 3.4. Define a Banach space H 1 = { f : f = j = 1 α j K ^ ( x , x j ) , { α j } l 1 , { x j } X } with the norm

f = inf { f = j = 1 | α j | : f = j = 1 α j K ^ ( x , x j ) , { α j } l 1 , { x j } X } (3.11)

Definition 3.5. For every η > 0 , the l2-empirical covering number of F is

N 2 ( F , η ) = min k min x X k inf { l : { f i } i = 1 l such that for all f F , there is min 1 i l d 2, x ( f , f i ) η } (3.12)

Lemma 3.2. There exist an exponent μ ( 0,2 ) and a constant c μ , K such that

l o g N 2 ( B r , η ) c μ , K η μ , η > 0 (3.13)

suppose that K C c ( X × X )

μ = { 2 n / ( n + 2 c ) , when 0 < c < 1 2 n / ( n + 2 ) , when 0 < c < 1 n / c , when c > 1 + n / 2 (3.14)

Define an operator L ω on L ρ X 2 as

L ω f ( x ) = f ( x ) p ( x ) X K ( x , x ) f ( x ) d ρ X ( x ) (3.15)

with p ( x ) = X K ( x , x ) d ρ ( x ) . The above equation tells us that

f , L ω f 2 = ( f ( x ) f ( x ) ) 2 d ρ X ( x ) d ρ X ( x ) (3.16)

Next, The performance of H K ^ approaching f ρ τ can be described through the regularizing function f γ defined as

f γ = arg min f H K ^ { ε τ ( f ) ε ( f ρ τ ) + γ A f K ^ + γ I f , L ω f } (3.17)

the above function f γ given by (3.13 ) can be expressed as

f γ = L K ˜ h γ = L K g γ (3.18)

where g γ = L K h γ . Moreover, g γ is a continuous function on X and

g γ L ρ X 2 = f γ K ˜ , f γ K ˜ κ f γ K ^ = κ h γ L ρ X 2 . (3.19)

Definition 3.6. Let F be a set of function on X , x= ( x i ) i = 1 k X k . The l 2 metric between function on X is

d 2, x ( f , g ) = { 1 k i = 1 k ( f ( x i ) g ( x i ) ) 2 } 1 2 , f , g F (3.20)

Denote the ball of radius r 1 as B r = { f H K ^ : f K ^ r } .

4. Error Analysis

4.1. Error Decomposition

Proposition 4.1. Let γ = ( γ A , γ I ) , γ A > 0 , γ I > 0 and f z , γ = i = 1 l + u α i K ( x , x i ) given by (2.6). Then

ε τ ( π ( f z , γ ) ) ε τ ( f ρ τ ) + γ A Ω ( f z , γ ) + γ I ( l + u ) 2 f z , γ T L f z , γ S 1 ( z , γ ) + S 2 ( z , γ ) + H 1 ( z , γ ) + H 2 ( z , γ ) + ( 1 + κ ) D ( γ ) (4.1.1)

where

S 1 ( z , γ ) = { ε τ ( π ( f z , γ ) ) ε τ ( f ρ τ ) } { ε z τ ( π ( f z , γ ) ) ε z τ ( f ρ τ ) } (4.1.2)

S 2 ( z , γ ) = { ε z τ ( f ^ z , γ ) ε z τ ( f ρ τ ) } { ε τ ( f ^ z , γ ) ε τ ( f ρ τ ) } (4.1.3)

H 1 ( z , γ ) = γ A Ω ( f ^ z , γ ) + γ I ( l + u ) 2 f ^ z , γ T L f ^ z , γ γ A g γ L ρ X 1 γ I ( l + u ) 2 f γ T L f γ (4.1.4)

H 2 ( z , γ ) = ε τ ( f ^ z , γ ) ε τ ( f γ ) (4.1.5)

D ( γ ) = ε τ ( f γ ) ε τ ( f ρ τ ) + γ A f γ K + γ I f γ , L ω f γ 2 (4.1.6)

M ( z , γ ) = γ I ( l + u ) 2 f γ T L f γ γ I f γ , L ω f γ 2 (4.1.7)

Proof. A direct decomposition shows that

ε τ ( π ( f z , γ ) ) ε τ ( f ρ τ ) + γ A Ω ( f z , γ ) + γ I ( l + u ) 2 f z , γ T L f z , γ = { ε τ ( π ( f z , γ ) ) ε τ ( f ρ τ ) } { ε z τ ( π ( f z , γ ) ) ε z τ ( f ρ τ ) } + { ε τ ( π ( f z , γ ) ) + γ A Ω ( f z , γ ) + γ I ( l + u ) 2 f z , γ T L f z , γ } { ε z τ ( f ^ z , γ ) + γ A Ω ( f ^ z , γ ) + γ I ( l + u ) 2 f ^ z , γ T L f ^ z , γ } + { ε z τ ( f ^ z , γ ) ε z τ ( f ρ τ ) } { ε τ ( f ^ z , γ ) ε τ ( f ρ τ ) } + ε τ ( f ^ z , γ ) ε τ ( f γ ) + ε τ ( f γ ) ε τ ( f ρ τ ) + γ A f γ K + γ I f γ , L ω f γ 2 + γ A Ω ( f ^ z , γ ) + γ I ( l + u ) 2 f ^ z , γ T L f ^ z , γ γ A g γ L ρ X 1 γ I ( l + u ) 2 f γ T L f γ + γ A g γ L ρ X 1 γ A g γ L ρ X 2 + γ I ( l + u ) 2 f γ T L f γ γ I f γ , L ω f γ 2 (4.1.8)

The fact | y | < 1 implies that ε z τ ( π ( f z , γ ) ) ε z τ ( f z , γ ) . Hence, the second item in the right-hand side of the above equation is at most 0 by the reason that f ^ z , γ H K is the minimizer of (2.7). Duo to g γ L ρ X 1 g γ L ρ X 2 , we see that the last but one item is at most 0. The fifth item is less than by the ( 1 + κ ) D ( γ ) fact that g γ L ρ X 2 = f γ K ˜ κ f γ K . Thus we complete the proof.,

4.2. Estimation of the Regularization Error

Proposition 4.2. Assume (3.7) holds, denoting γ I = γ A 2 s for some 0 < s 1 then we have

D ( γ ) C 0 γ A s (4.2.1)

where C 0 = 2 g ρ τ L ρ X 2 + 6 ω g ρ τ L ρ X 2 2 .

Proof. Denote f γ A = arg min f H K { ε τ ( f ) + γ A f K } . By proposition 2 in [8] , we get the following relationships ε τ ( f γ A ) ε τ ( f ρ τ ) + γ A f γ A K 2 g ρ τ L ρ X 2 γ A s and f γ A K γ A s 1 g ρ τ L ρ X 2 . Connected with the definition of D ( γ ) , we have

D ( γ ) ε τ ( f γ A ) ε τ ( f ρ τ ) + γ A f γ K + γ I f γ , L ω f γ 2 2 g ρ τ L ρ X 2 γ A s + γ I ( f γ A ( x ) f γ A ( x ) ) 2 ω ( x , x ) d ρ X ( x ) d ρ X ( x ) 2 g ρ τ L ρ X 2 γ A s + γ A 2 s 2 f γ A K 2 ω d ρ X ( x ) d ρ X ( x ) 2 g ρ τ L ρ X 2 γ A s + 6 ω γ A 2 ( s 1 ) g ρ τ L ρ X 2

where γ I = γ A 2 s . we derive the desired bound.,

4.3. Estimation of the Manifold Error

In this subsection, we estimation the manifold error. Denote

A z , γ = γ I l + u j = 1 l + u ( ( 1 l + u i = 1 l + u ξ 1 ( x i ) ) ( x j ) ( E ξ 1 ) ( x j ) ) (4.3.1)

B z , γ = γ I f γ 2 ( x ) ( ( 1 l + u i = 1 l + u ξ 2 ( x j ) ) ( x ) ( E ξ 2 ) ( x ) ) d ρ X ( x ) (4.3.2)

C z , γ = γ I l + u j = 1 l + u f γ ( x j ) ( ( E ξ 3 ) ( x j ) ( 1 l + u i = 1 l + u ξ 3 ( x i ) ( x j ) ) ) (4.3.3)

D z , γ = γ I f γ ( x ) ( ( E ξ 3 ) ( x ) ( 1 l + u i = 1 l + u ξ 3 ( x j ) ) ( x ) ) d ρ X ( x ) (4.3.4)

where ξ 1 ( x ) = f γ 2 ( x ) ω ( x , ) , ξ 2 ( x ) = ω ( , x ) , ξ 3 ( x ) = f γ ( x ) ω ( , x ) . So we can see that M ( z , γ ) = 2 ( A z , γ + B z , γ + C z , γ + D z , γ )

Lemma 4.1. Let ξ be a random variable on a probability space X with σ 2 = E ξ 2 satisfying ξ M ξ for some constant M ξ . Then for any 0 < δ < 1 , we have

1 l i = 1 l ξ ( z i ) E ξ 2 M ξ log ( 1 / δ ) l + 2 σ 2 log ( 1 / δ ) l (4.3.5)

Proposition 4.3. under the approximation condition (3.13), let 0 < γ A 1 and γ I = γ A 2 s for some 0 < s 1 . then for any 0 < δ < 1 with the confidence 1 δ , there holds

M ( z , γ ) 4 ω κ 4 C 0 2 2 log ( 4 / δ ) γ A s ( l + u ) 1 / 2 (4.3.6)

Proof. By the definition of ξ 1 ( x ) , we have | ξ 1 ( x ) | ω | f | γ 2 = ω | L K g γ | 2 κ g γ L ρ X 2 2 . Since g γ L ρ X 2 2 κ f γ K ^ κ C 0 γ A s 1 , there holds | ξ 1 ( x ) | κ 4 C 0 2 γ A 2 ( s 1 ) , | ξ 2 ( x ) | ω , | ξ 3 ( x ) | ω κ 2 C 0 γ A s 1 . Applying lemma 4.1, with confidence 1 4 / δ ,

A z , γ γ I ω κ 4 C 0 2 γ A 2 ( s 1 ) ( 2 l o g ( 2 / δ ) l + u + 2 l o g ( 2 / δ ) l + u ) (4.3.7)

B z , γ γ I ω κ 4 C 0 2 γ A 2 ( s 1 ) ( 2 l o g ( 2 / δ ) l + u + 2 l o g ( 2 / δ ) l + u ) (4.3.8)

C z , γ γ I ω κ 4 C 0 2 γ A 2 ( s 1 ) ( 2 l o g ( 2 / δ ) l + u + 2 l o g ( 2 / δ ) l + u ) (4.3.9)

D z , γ γ I ω κ 4 C 0 2 γ A 2 ( s 1 ) ( 2 l o g ( 2 / δ ) l + u + 2 l o g ( 2 / δ ) l + u ) (4.3.10)

Then we find the manifold error bound holds true.,

4.4. Estimation of the Hypothesis Error

This subsection is devoted to estimate the hypothesis errors. Under the assumption that the sample is i.i.d. drawn from ρ and | y | 1 a.e., we estimate H 1 , H 2 as following.

Proposition 4.4. For any 0 < δ < 1 ,with confidence 1 δ , we have

H 1 κ D ( γ ) l + u { 2 log ( 4 / δ ) l + u + 2 log ( 4 / δ ) } + 8 κ 4 2 log ( 4 / δ ) D ( γ ) 2 γ A s l + u (4.4.1)

H 2 κ 2 D ( γ ) γ A l + u { 2 log ( 4 / δ ) l + u + 2 log ( 4 / δ ) } (4.4.2)

Proof. We estimate H 1 . Recall f ^ z , γ = 1 l + u i = 1 l + u g γ ( x i ) K x i , then Ω ( f ^ z , γ ) = 1 l + u i = 1 l + u | g γ ( x i ) | .

Applying Lemma 4.1 to the random variable ξ = | g γ ( x ) | on ( X , ρ X ) with value in . There is

ξ = | g γ ( x ) | κ D ( γ ) / γ A (4.4.3)

E ξ = g γ L ρ X 1 and σ 2 ( ξ ) κ 2 D ( γ ) 2 / γ A 2 (4.4.4)

with confidence 1 δ / 2 , there holds

Ω ( f ^ z , γ ) g γ L ρ X 1 2 κ D ( γ ) l o g ( 4 / δ ) γ A ( l + u ) + 2 κ 2 D ( γ ) 2 l o g ( 4 / δ ) γ A 2 ( l + u ) (4.4.5)

since

( f z , γ ( x i ) f z , γ ( x j ) 2 ) 2 ( f γ ( x i ) f γ ( x j ) ) 2 = [ 1 l + u t = 1 l + u ( g γ ( x t ) K ( x i , x t ) g γ ( x t ) K ( x j , x t ) ) X ( g γ ( x ) K ( x i , x ) g γ ( x ) K ( x j , x ) ) d ρ X ( x ) ] × [ 1 l + u t = 1 l + u ( g γ ( x t ) K ( x i , x t ) g γ ( x t ) K ( x j , x t ) ) + X ( g γ ( x ) K ( x i , x ) g γ ( x ) K ( x j , x ) ) d ρ X ( x ) ] 2 κ 2 D ( γ ) γ A 2 log ( 2 / δ ) l + u 4 κ 2 D ( γ ) γ A 8 κ 2 2 log ( 2 / δ ) D ( γ ) 2 γ A 2 ( l + u ) 1 / 2 (4.4.6)

Then the bound of the following is derived with γ I = γ A 2 s

γ I ( l + u ) 2 f ^ z , γ T L f ^ z , γ γ I ( l + u ) 2 f γ T L f γ 8 κ 2 2 l o g ( 4 / δ ) D ( γ ) 2 γ A s ( l + u ) 1 / 2 (4.4.7)

Finally, we have

H 1 κ D ( γ ) l + u { 2 l o g ( 4 / δ ) l + u + 2 l o g ( 4 / δ ) } + 8 κ 4 2 l o g ( 4 / δ ) D ( γ ) 2 γ A s l + u (4.4.8)

The H 2 has been proved in [8] .,

4.5. Estimation of the Sample Error

Since f ^ z , γ is a function valued random variable which depends on the sample error in the data independent space H which contains all possible hypothesis spaces H K , z . Our estimations for H 1 , H 2 are based on the following concentration inequality see [8] .

Lemma 4.2. Let F be a class of measurable function on Z . Assume that there are constants B , c > 0 and β [ 0,1 ] such that f B and E f 2 c ( E f ) β for every f F . If for some a > 0 and μ ( 0,2 ) ,

l o g N 2 ( F , ζ ) a ζ μ , ζ > 0 (4.5.1)

then there exists a constant c μ depending only on μ such that for any 0 < δ < 1 , with confidence 1 δ , there holds

E f 1 l i = 1 l f ( z i ) 1 2 w 1 β ( E f ) β + c μ w + 2 ( c l o g ( 1 / δ ) l ) 1 / ( 2 β ) + 18 B l o g ( 1 / δ ) l , f F (4.5.2)

where w = max { c 2 μ 4 2 β ( a l ) 2 4 2 β + μ β , B 2 μ 2 + μ ( a l ) 2 2 + μ } .The same bound also holds true for 1 l i = 1 l f ( z i ) E f .

The following proposition which has been proved in [9] will be utilized to bound S 1 .

Proposition 4.5. suppose that ρ has a t-quantile of p-average type q for some p ( 0, ] , q ( 1, ) . Let r 1 and 0 < γ A 1 . Assume B 1 satisfies the capacity assumption (3.12) with some 0 < μ < 2 . Then, for any 0 < δ < 1 , with confidence 1 δ , there holds, for all f B r ,

{ ε τ ( π ( f ) ) ε τ ( f ρ τ ) } { ε z τ ( π ( f ) ) ε z τ ( f ρ τ ) } 1 2 C 1 1 θ r 2 μ ( 1 θ ) 2 + μ l 2 ( 1 θ ) 4 2 θ + μ θ { ε τ ( π ( f ) ) ε τ ( f ρ τ ) } θ + ( 36 + 2 C θ 1 2 θ ) l o g ( 1 / δ ) l 1 2 θ + C 2 r 2 μ 2 + μ l 2 4 2 θ + μ θ (4.5.3)

Here C 1 and C 2 are the constants depending on μ , θ , c μ , K and C θ .

The following proposition which has been proved in [9] will be utilized to bound S 2 .

Proposition 4.6. Under the assumptions of proposition 4.5. Then, for any 0 < δ < 0 ,with confidence 1 δ , there holds,

S 2 C 3 ( 1 + 1 l l o g 5 δ ) l o g ( 10 δ ) × l 2 ( 1 θ ) 4 2 θ + μ θ × ( γ A s 1 ( l + u ) 1 2 + γ A s 1 + θ ( l + u ) 1 θ 2 ) (4.5.4)

here C 3 is a constant independent of l , γ A , δ .

Proof. We consider the following function set with r 1 to bound S 2

G r = { ρ τ ( f ( x ) y ) ρ τ ( f ρ t a u ( x ) y ) : f B r } (4.5.5)

since | f ρ τ | 1 and f κ r , for any g G r , we have

| g ( z ) | | f ( x ) f ρ τ ( x ) | f + 1 κ r + 1 (4.5.6)

By Lemma 3.1, the variance-expectation condition of g ( z ) is satisfied with θ given by (3.4) and c = C θ , β = θ . Then we get

l o g N 2 ( G r , η ) c μ , K r μ η μ . (4.5.7)

Applying lemma 4.2 to G r , then for any δ ( 0,1 ) , with confidence 1 δ , there holds that, for any f B r ,

{ ε z τ ( f ) ε z τ ( f ρ τ ) } { ε τ ( f ) ε τ ( f ρ τ ) } 1 2 υ 1 θ { ε τ ( f ) ε τ ( f ρ τ ) } θ + 2 ( C θ log ( 1 / δ ) l ) 1 2 θ + 18 ( κ r + 1 ) log ( 1 / δ ) l + c μ υ . (4.5.8)

where υ = C ˜ r l 2 4 2 θ + μ θ and C ˜ = C θ 2 μ 4 2 θ + μ θ c μ , K 2 4 2 θ + μ θ + ( κ + 1 ) 2 μ 2 + μ c μ , K 2 2 + μ . From the processing of estimating H 1 , for any δ ( 0,1 ) , with confidence 1 2 δ / 5 , we have

1 l + u i = 1 l + u | g γ ( x i ) | g γ L ρ X 1 κ D ( γ ) γ A l + u ( 2 log ( 5 / δ ) l + u + 2 log ( 5 / δ ) ) (4.5.9)

which implies there exists a subset V 1 of X l + u with measure at most 2 δ / 5 such that

1 l + u i = 1 l + u g γ ( x i ) max { κ D ( γ ) γ A l + u ( 2 log ( 5 / δ ) l + u + 2 log ( 5 / δ ) ) , 1 } r γ , z X l + u \ V 1 (4.5.10)

The above inequality guarantees f ^ z , γ B r with for every x X l + u \ V 1 . By Lemma 4.2 and (4.5.8), there existing V r γ with measure at most δ / 5 such that for every x X l + u \ ( V 1 V r γ ) , we have f ^ z , γ B r γ and

{ ε z τ ( f ^ z , γ ) ε z τ ( f ρ τ ) } { ε τ ( f ^ z , γ ) ε τ ( f ρ τ ) } 1 2 C ˜ 1 θ r γ 1 θ l 2 ( 1 θ ) 4 2 θ + μ θ { ε τ ( f ^ z , γ ) ε τ ( f ρ t a u ) } θ + 18 ( κ + 1 ) r γ l 1 log 5 δ + c μ C ˜ r γ l 2 4 2 θ + μ θ + 2 ( C θ log ( 5 / δ ) l ) 1 2 θ 1 2 C ˜ 1 θ r γ 1 θ l 2 ( 1 θ ) 4 2 θ + μ θ | ε τ ( f ^ z , γ ) ε τ ( f γ ) | θ + 1 2 C ˜ 1 θ r γ 1 θ l 2 ( 1 θ ) 4 2 θ + μ θ { ε τ ( f γ ) ε τ ( f ρ τ ) } θ + 18 ( κ + 1 ) r γ l 1 log 5 δ + c μ C ˜ r γ l 2 4 2 θ + μ θ + 2 ( C θ log ( 5 / δ ) l ) 1 2 θ (4.5.11)

Proposition 4.2 implies that ε τ ( f γ ) ε τ ( f ρ τ ) D ( γ ) C 0 γ A s and

r γ ( κ + 1 ) γ A s 1 l + u ( 2 log ( 5 / δ ) l + u + 2 log ( 5 / δ ) + 1 ) (4.5.12)

The Proposition 4.4 tells that there exists a subset V 2 of X l + u with measure of at most 2 δ / 5 such that for every x X l + u \ V 2 ,

ε τ ( f ^ z , γ ) ε τ ( f γ ) κ 2 D ( γ ) γ A l + u { 2 l o g ( 5 / δ ) l + u + 2 l o g ( 5 / δ ) } (4.5.13)

Let V = V 1 V 2 V r γ . Obviously, the measure of V is at most δ and for every x X l + u \ V , the above inequalities hold. Finally, we combines (4.5.11), (4.5.12), (4.5.13), the result is completed.,

5. Total Error Bound

Proposition 5.1. suppose that ρ has a t-quantile of p-average type q for some p ( 0, + ] and q ( 1, ) , and that Approximation condition (3.7) and Capacity condition (3.12) hold. Let 0 < γ A 1, r 1 and 0 < δ < 1 . Then, there exists a subset U r of X l + u with measure at most δ such that for any x W ( r ) / U r , we have

ε τ ( π ( f z , γ ) ) ε τ ( f ρ τ ) + γ A Ω ( f z , γ ) + γ I ( l + u ) 2 f z , γ T L f z , γ C ^ r 2 μ 2 + μ l 2 4 + μ θ 2 θ + C 4 ( 2 log ( 10 / δ ) l + u + 2 log ( 10 / δ ) + 1 ) Ψ ( l , u , γ ) (5.1)

Here C ^ , C 4 are constants independent of l , u , γ A , δ and Ψ ( l , u , γ ) = γ A s + γ A s ( l + u ) 1 / 2 + γ A s 1 ( l + u ) 1 / 2 + γ A s 1 ( l + u ) 1 / 2 l 2 ( 1 θ ) 4 2 θ + μ θ + γ A s 1 + θ ( l + u ) ( 1 θ ) / 2 l 2 ( 1 θ ) 4 2 θ + μ θ

Proof. Proposition 4.4 ensures the existence U 1 of X l + u with measure at most 2 δ / 5 such that

H 1 κ D ( γ ) l + u { 2 l o g ( 10 / δ ) l + u + 2 l o g ( 10 / δ ) } + 8 κ 4 2 l o g ( 10 / δ ) D ( γ ) 2 γ A s l + u (5.2)

H 2 κ 2 D ( γ ) γ A l + u ( 2 l o g ( 10 / δ ) l + u + 2 l o g ( 10 / δ ) ) (5.3)

hold for any x X l + u / U 1 .

Proposition 4.5 tell us that there exists a subset V r of X l + u with measure at most δ / 10 , such that

S 1 1 2 C 1 1 θ r 2 μ ( 1 θ ) 2 + μ l 2 ( 1 θ ) 4 2 θ + μ θ { ε τ ( π ( f z , γ ) ) ε τ ( f ρ τ ) } θ + ( 36 + 2 C θ 1 2 θ ) l o g ( 10 δ ) l 1 2 θ + C 2 r 2 μ 2 + μ l 2 4 2 θ + μ θ (5.4)

Proposition 4.6 ensures the existence of a subset U 2 of X l + u with measure at most δ / 2 such that

S 2 C 3 ( 2 l o g ( 10 / δ ) l + u + 2 l o g ( 10 / δ ) + 1 ) × l 2 ( 1 θ ) 4 2 θ + μ θ × ( γ A s 1 ( l + u ) 1 2 + γ A s 1 + θ ( l + u ) 1 θ 2 ) , x X l + u / U 2 (5.5)

Proposition 4.3 ensures that there exists a subset U 3 of X l + u with measure almost 10 / δ such that

M ( z , γ ) 4 ω κ 4 C 0 2 2 l o g ( 10 / δ ) γ A s ( l + u ) 1 / 2 (5.6)

Takeing U r = U 1 U 2 U 3 V r , the measure of U r is at most δ , combining (5.2)-(5.6) and (4.2.1), then for every x W ( r ) / U r we get

ε τ ( π ( f z , γ ) ) ε τ ( f ρ τ ) + γ A Ω ( f z , γ ) + γ I ( l + u ) 2 f z , γ T L f z , γ + 1 2 C 4 ( 2 log ( 10 / δ ) l + u + 2 log ( 10 / δ ) + 1 ) Ψ ( l , u , γ ) + 1 2 C 1 1 θ r 2 μ ( 1 θ ) 2 + μ l 2 ( 1 θ ) 4 + μ θ 2 θ { ε τ ( π ( f z , γ ) ) ε τ ( f ρ τ ) } θ + C 2 r 2 μ 2 + μ l 2 4 + μ θ 2 θ (5.7)

Here C 1 , C 2 , C 3 , C 4 , C 5 is a constant independent of l , u , γ A , δ , and Ψ ( l , u , γ ) are as above.

Next, let t = ε τ ( π ( f z , γ ) ) ε τ ( f ρ τ ) + γ A Ω ( f z , γ ) + γ I ( l + u ) 2 f z , γ T L f z , γ . Hence, the inequality (5.6) can be expressed as

t 1 2 C 1 1 θ r 2 μ ( 1 θ ) 2 + μ l 2 ( 1 θ ) 4 + μ θ 2 θ t θ Π 0 , (5.8)

where Π is the rest terms. From Lemma 7.2 in [learning theory: an approximation theory viewpoint], the (5.8) has a unique positive solution t which can be bounds as

t max { C 1 r 2 μ 2 + μ l 2 4 + μ θ 2 θ , 2 Π } C 1 r 2 μ 2 + μ l 2 4 + μ θ 2 θ + 2 Π , (5.9)

then the result is derived.,

6. Convergence Radius and Main Result

Proposition 6.1. Under the assumptions in proposition 5.1, we take ω 0 = 2 4 2 θ + μ θ , γ A = l β with β > 0 . Then, for any 0 < δ < 1 , with confidence 1 δ , there holds

f z , γ ( ( 1 + C ^ ) 2 + μ 2 μ + ( N 0 + 1 ) C ^ 4 × ( 2 l o g ( 10 / δ ) l + u + 2 l o g ( 10 / δ ) + 1 ) ) × l β ( 1 s ) (6.1)

Proof. Applying γ A = l β with β > 0 and letting Δ = 2 μ 2 + μ to proposition 5.1, then for any r 1 , there exists a subset V r of X l + u with measure at most δ such that

f z , γ a m r Δ + b m , x W ( r ) / V r (6.2)

where the constants are given

a m = C ^ l β ω 0 b m = C ^ 4 ( 2 log ( 10 / δ ) l + u + 2 log ( 10 / δ ) + 1 ) Ψ ( l , u , γ ) / γ A b δ Ψ ( l , u , γ ) / γ A (6.3)

where C ^ , C ^ 5 is a constant independent of l , u , γ A , δ .

It follows that

W ( r ) W ( a m r Δ + b m ) V r . (6.4)

Then, we define a sequence { r ( n ) } n = 1 N by r ( 0 ) = γ A 1 and, for n 1

r ( n ) = a m ( r ( n 1 ) ) Δ + b m , n . (6.5)

Duo to f z , γ 1 / γ A ensures W ( r ( 0 ) ) = X l + u , we have

X l + u = W ( r ( 0 ) ) W ( r ( 1 ) ) V r ( 0 ) W ( r ( N ) ) ( n = 0 N 1 V r ( n ) ) . (6.6)

with ρ ( n = 0 N 1 V r ( n ) ) N δ . Hence the measure of is at least. By the iteration formula (6.5), we have

r ( N ) a m 1 + Δ + Δ 2 + + Δ N 1 ( r ( 0 ) ) Δ N + n = 1 N 1 a m 1 + Δ + Δ 2 + + Δ n 1 b m Δ n + b m = a m 1 Δ N 1 Δ γ A Δ N + n = 1 N 1 a m 1 Δ n 1 Δ b m Δ n + b m ( 1 + C ^ ) 1 1 Δ l ( β ω 0 ) 1 1 Δ + Δ N ( β 1 1 Δ ( β ω 0 ) ) + a m 1 1 Δ n = 1 N 1 ( a m 1 1 Δ b m ) Δ n + b m ( 1 + C ^ ) 2 + μ 2 μ l ( β ω 0 ) 2 + μ 2 μ + Δ N ( 2 + μ 2 μ ω 2 β μ 2 μ ) + N a m 1 1 Δ max { a m 1 1 Δ b m , 1 } + b m ( 1 + C ^ ) 2 + μ 2 μ l β ω 0 2 μ 2 + μ + Δ N ( 2 + μ 2 μ ω 0 2 β μ 2 μ ) + ( N + 1 ) b m + N C ^ 2 + μ 2 μ l ( β ω 0 ) 2 + μ 2 μ (6.7)

where

b m C ^ 4 ( 2 l o g ( 10 / δ ) l + u + 2 l o g ( 10 / δ ) + 1 ) ( l β ( 1 s ) + l β ( 2 s ) 1 / 2 + l β ( 1 s ) ( l + u ) 1 / 2 + l β ( 2 s ) ω 0 ( 1 θ ) ( l + u ) 1 / 2 + l β ( 2 s θ ) ω 0 ( 1 θ ) ( l + u ) ( 1 θ ) / 2 )

Noting that 0 < β 1 2 < ω 0 , to ensure that

( β ω 0 ) 2 + μ 2 μ + Δ N ( 2 + μ 2 μ ω 0 2 β μ 2 μ ) β ( 1 s ) (6.8)

we only need

Δ N ω 0 β Δ ω 0 Δ β ( 1 Δ ) s β (6.9)

Then we get

N m a x { l o g 2 + μ 2 μ ω 0 β Δ ω 0 Δ β ( 1 Δ ) s β , 1 } l o g 2 + μ 2 μ ω 0 Δ / 2 ω 0 1 / 2 + 1 N 0 (6.10)

Combine (6.2) and (6.10), we have

f z , γ ( ( 1 + C ^ ) 2 + μ 2 μ + ( N 0 + 1 ) C ^ 4 × ( 2 l o g ( 10 / δ ) l + u + 2 l o g ( 10 / δ ) + 1 ) ) l β ( 1 s ) , (6.11)

with

r N 0 ( ( 1 + C ^ ) 2 + μ 2 μ + ( N 0 + 1 ) C ^ 4 × ( 2 log ( 10 / δ ) l + u + 2 log ( 10 / δ ) + 1 ) ) l β ( 1 s ) (6.12)

where The bound follows by replacing δ by δ N 0 .,

Theorem 6.1. Assume (3.7) and (3.13) hold. Taking γ A = l β , 0 < β 1 2 , l = u and γ I = γ A 2 s . Suppose that ρ has a t-quantile of p average type q for some p ( 0, + ] and q ( 1, ) , p = p q p + 1 > 0 . Then for any 0 < δ < 1 , with confidence 1 δ , we have

π ( f z , γ ) f ρ τ L ρ X p q a ( ( 1 + C ^ ) 2 μ 2 + μ + ( N 0 + 1 ) C ^ 4 × ( 2 log ( 10 / δ ) l + u + 2 log ( 10 / δ ) + 1 ) ) × ( l ( ω 0 ( 1 θ ) 2 μ 2 + μ β ( 1 s ) ) + l β s ) + b ( 2 log ( 10 / δ ) l + u + 2 log ( 10 / δ ) + 1 ) × ( l β s ( l + u ) 1 / 2 + l ( ω 0 ( 1 θ ) β ( 1 s ) ) ( l + u ) 1 / 2 + l ( ω 0 ( 1 θ ) β ( 1 s θ ) ) ( l + u ) ( 1 θ ) / 2 ) , (6.13)

Proof. Applying Lemma 3.1, proposition 6.1 and proposition 5.1, with confidence 1 δ , we have

π ( f z , γ ) f ρ τ L ρ X P q C q , ρ C ^ ( ( 1 + C ^ ) 2 μ 2 + μ + ( N 0 + 1 ) C ^ 4 ( 2 log ( 10 / δ ) l + u + 2 log ( 10 / δ ) + 1 ) ) 2 μ 2 + μ × l ( ω 0 ( 1 θ ) 2 μ 2 + μ β ( 1 s ) ) + C q , ρ C 4 ( 2 log ( 10 / δ ) l + u + 2 log ( 10 / δ ) + 1 ) Ψ 2 ( l , u , γ ) a ( ( 1 + C ^ ) 2 μ 2 + μ + ( N 0 + 1 ) C ^ 4 ( 2 log ( 10 / δ ) l + u + 2 log ( 10 / δ ) + 1 ) ) × ( l ( ω 0 ( 1 θ ) 2 μ 2 + μ β ( 1 s ) ) + l β s ) + b ( 2 log ( 10 / δ ) l + u + 2 log ( 10 / δ ) + 1 ) × ( l β s ( l + u ) 1 / 2 + l ( ω 0 ( 1 θ ) β ( 1 s ) ) ( l + u ) 1 / 2 + l ( ω 0 ( 1 θ ) β ( 1 s θ ) ) ( l + u ) ( 1 θ ) / 2 ) (6.14)

Here a , b is a constant independent of l , u , δ and

Ψ 2 ( l , u , γ ) = l β s + l β s ( l + u ) 1 / 2 + l ( ω 0 ( 1 θ ) β ( 1 s ) ) ( l + u ) 1 / 2 + l ( ω 0 ( 1 θ ) β ( 1 s θ ) ) ( l + u ) ( 1 θ ) / 2 (6.15)

with β 1 2 The proof is complete.,

7. The Sparsity of the Algorithm

In this subsection, we consider the purpose of investigating sparsity of algorithm (2.7). Here the sparsity means the vanishing of some coefficients in the expansion

f ^ z , γ = i = 1 l + u α i K ( x , x i ) (7.1)

We provide a general result for the vanishing of the coefficient.

Proposition 7.1. Let | f z , γ | 1 , j { 1, , l + u } and α ^ = ( α 1 , , α l + u ) be the coefficient vector of f ^ z , γ . If

max τ , 1 τ κ γ A 8 (7.2)

ω γ I 2 ( l + u ) 2 i , t = 1 l + u | α j | ( K ( x j , x i ) K ( x j , x t ) ) 2 γ A 8 (7.3)

ω γ I ( l + u ) 2 i , t = 1 l + u | K ( x j , x i ) K ( x j , x t ) | | f ( x i ) f ( x t ) | γ A 8 (7.4)

Then we have α j = 0 .

Proof. Define the function F ( α ) in (2.7) to be optimized with α ^ = ( α 1 , , α j 1 ,0, , α l + u ) l + u . Denote α ˜ = ( α 1 , , α j 1 , 0 , , α l + u ) by substituting the jth component of α ^ to zero. Comparing F ( α ^ ) with F ( α ˜ ) , since 0 K ( x i , x j ) κ and 0 ω i j ω , we have

F ( α ^ ) F ( α ˜ ) ( γ max τ , 1 τ κ ) | α j | ( ω γ I 2 ( l + u ) 2 i , t = 1 l + u | α j | ( K ( x j , x i ) K ( x j , x t ) ) 2 ) | α j | ( ω γ I ( l + u ) 2 i , t = 1 l + u | K ( x j , x i ) K ( x j , x t ) | | f ( x i ) f ( x t ) | ) | α j | (7.5)

If (7.2)-(7.4) are satisfied, we see from F ( α ^ ) F ( α ˜ ) 0 that we must have α j = 0 . In order to estimate error f ^ z , γ f ρ τ L ρ X p , we only need to bounds ε τ ( f ^ z , γ ) ε τ ( f ρ τ ) . We thus derive the following inequality, which plays an important role in our mathematical analysis.,

8. Conclusion

In this paper, we have a discussion of the lowest convergence rate of quantile regression with manifold regularization optimizing the intrinsic structure using the unlabeled data. The main result is to establish an upper bound for the total

error showing less than O ( l ( 1 2 β ) ) . Meanwhile, the quantile regression

provides a piecewice linearity but a convex technique to overcome difficulties such as a high nonlinearity dependence on the predictor and linear suboptimal models. Finally, the sparsity is analysised in the l 1 space.

Acknowledgements

We would like to thank the referees for their constructive comments and suggestions which have improved the paper. This work was supported by the [the Natural Sciences Foundation of China] under Grant [number 71401124].

Cite this paper

Feng, R. , Chen, S. and Rong, L. (2017) Quantile Regression Based on Laplacian Manifold Regularizer with the Data Sparsity in l1 Spaces. Open Journal of Statistics, 7, 786-802. doi: 10.4236/ojs.2017.75056.

References

[1] Heagerty, P. and Pepe, M. (1999) Semiparametric Estimation of Regression Quantiles with Application to Standardizing Weight for Height and Age in U.S. Children. Journal of the Royal Statistical Society, Series C, 48, 533-551.
https://doi.org/10.1111/1467-9876.00170
[2] Koenker, R. and Geling, O. (2001) Reappraising Medfly Longevity: A Quantile Regression Survival Analysis. Journal of the American Statistical Association, 96, 458-468.
https://doi.org/10.1198/016214501753168172
[3] Koenker, R. and Hallock, K. (2001) Quantile Regression: An Introduction. Journal of Economic Perspectives, 15, 43-56.
https://doi.org/10.1257/jep.15.4.143
[4] Shi, L., Huang, X., Tian, Z. and Suykens, J.A.K. (2013) Quantile Regression with Regularization and Gaussian Kernels. Advances in Computational Mathematics, 40, 517-551.
[5] Felipe, C. and Ding, Z. (2007) Learning Theory: An Approximation Theory Viewpoint. Cambridge Monographs on Applied and Computational Mathematics. www.cambridge.org/9780521865593
[6] Joachims (1999) Transductive Inference for Text Classification Using Support Vector Machines. Proceedings of the Sixteen International Conference on Machine Learning, 200-209.
[7] Bousquet, O., Chapelle, O. and Hein, M. (1999) Measure Based Regularization. Asvances in Neural Information Processing Systems, 16.
[8] Li, M., Zhang, M. and Sun, H. (2015) Conditional Qunantile Regression with Regularization and Instensitive Pinball Loss. International Journal of Wavelets, Multiresolution and Information Processing, 13.
[9] Li, M. and Hong, W.S. (2015) Asymptotic Analysis of Quantile Regression Learning Based on Coefficient Dependent Regularization. International Journal of Wavelets, Multiresolution and Information Processing, 13, Article ID: 1550018.

  
comments powered by Disqus

Copyright © 2017 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.