^{1}

^{*}

^{1}

Online learning algorithms are very attractive, in which iterations are applied efficiently instead of solving some optimization problems. In this paper, online learning with protecting privacy is considered. A perturbation term is added into the classical online algorithms to obtain the differential privacy property. Firstly the distribution for the perturbation term is deduced, and then an error analysis for the new algorithms is performed, which shows the convergence and learning rate. From the error analysis, a choice for the parameters for differential privacy can be found theoretically.

Online learning is widely used recently in computer sciences, due to its effi- ciency in calculation and well theoretical results. Compared with the classical batch learning in learning theory, online algorithms update the output only according to the last sample point. So such algorithms are very effective to handle the practical problems and have been studied in [

Our setting for online learning is introduced as follow. Let the input space X be a compact metric space, and output Y ∈ [ − M , M ] for some M > 0 as a regression problem. Denote Z : = X × Y as the sample space. Assume there is a probability measure ρ on Z , which can be decomposed to marginal dis- tribution ρ X on X and conditional distribution ρ ( y | x ) on Y at x ∈ X . Then the regression function is defined by

f ρ = ∫ Y y d ρ ( y | x ) (1)

which is indeed the conditional expectation of y given x . The regression fun- ction minimizes the least square generalization error (see [

E ( f ) : = ∫ Z ( f ( x ) − y ) 2 d ρ (2)

So learning algorithms always aim to approximate the regression function

based on samples { z t = ( x t , y t ) } t = 0 , 1 , 2 , ⋯ , which are drawn independently from

distribution ρ . Let K : X × X → R be a Mercer Kernel, and H K is the in- duced reproducing kernel Hilbert space (RKHS, [

span { K x : x ∈ X } where K x ( x ′ ) = K ( x , x ′ ) for any x , x ′ ∈ X with respect to

the inner product 〈 K x , K x ′ 〉 K = K ( x , x ′ ) . The corresponding norm in H K is denoted as ‖ ⋅ ‖ K . Now our online learning algorithm as

f t + 1 = f t − η t [ ( f t ( x t ) − y t ) K x t + λ t f t ] , t = 0 , 1 , 2 , ⋯ (3)

with f 0 = 0 . Here η t > 0 is the step size and λ t > 0 is the regularization parameter.

When applying this online algorithm on private data set, it may leak some sensitive information. To deal with this privacy problem, Dwork et al. introdu- ced differential privacy in [

d ( z 1 , z 2 ) = # { i = 1 , ⋯ , m : z 1 , i ≠ z 2 , i } (4)

Definition 1 A random algorithm A : Z m → Range ( A ) is ϵ -differential pri- vate if for every two data sets z 1 , z 2 satisfying d ( z 1 , z 2 ) = 1 , and every set O ∈ Range ( A ( z 1 ) ) ∩ Range ( A ( z 2 ) ) , there holds

P r { A ( z 1 ) ∈ O } ≤ e ϵ ⋅ P r { A ( z 2 ) ∈ O } (5)

To endow our online algorithm the differential privacy property, a perturba- tion term is added into the output of (3), that is,

f t , A = f t + b t (6)

where b t takes value in R with distribution to be determined in following analysis.

Differential private online learning has already been studied in [

In this section, a detail analysis for the perturbation term b t in algorithm (6) will be conducted. Firstly recall the useful definition of sensitivity and lemma proposed in [

Definition 2 denote Δ f t as the maximum infinite norm of difference betw- een the outputs when changing the last sample point in z . Let z = { ( x i , y i ) } i = 0 t and z = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x t − 1 , y t − 1 ) , ( x t ¯ , y t ¯ ) } , f t and f t ¯ derived from (3) accordingly, it is clear that

Δ f t : = sup z , z ¯ ‖ f t − f t ¯ ‖ ∞ (7)

Then a similar result to [

Lemma 1 Assume Δ f t is bounded by C t > 0 , and b t has density function

proportion to e x p { − ϵ | b | C t } , then algorithm (6) provides ϵ -differential privacy.

Proof. For all possible output function r , and z , z ¯ differ in last element, then

P r { f t , A = r } = P r b t { b t = r − f t } ∝ e x p ( − ϵ | r − f t | C t ) (8)

and

P r { f t ¯ , A = r } = Pr b t { b t = r − f t ¯ } ∝ exp ( − ϵ | r − f t ¯ | C t ) (9)

So by triangle inequality,

Pr { f t , A = r } ≤ Pr { f t ¯ , A = r } × e ϵ | f t − f t ¯ | C t ≤ e ϵ Pr { f t ¯ , A = r } (10)

Then the lemma is proved by a union bound.

It is obvious that if finding the upper bound for Δ f t , the distribution for b t can be derived. Set η t = 1 / ( t + t 0 ) θ and λ t = 1 / ( t + t 0 ) 1 − θ for some t 0 > 0 and

0 < θ < 1 . Moreover, denote κ = sup x , x ′ ∈ X K ( x , x ′ ) (as K is Mercer Kernel

on compact metric space X ). The next lemma is taken from [

Lemma 2 If t 0 θ ≥ κ 2 + 1 , then for all t ∈ ℕ , there holds

‖ f t ‖ K ≤ κ M λ t (11)

Now the main result for differential privacy for algorithm (6) follows.

Theorem 1 When choosing η t = 1 / ( t + t 0 ) θ and λ t = 1 / ( t + t 0 ) 1 − θ for some

1 2 < θ < 1 and t 0 θ ≥ κ 2 + 1 , let the density function of b t is 1 α e x p { ϵ | b | C t } with α = 2 C t / ϵ and

C t = 2 ( κ 2 + 1 ) κ M ( t − 1 + t 0 ) 2 θ − 1 (12)

then the algorithm (6) provides ϵ -differential privacy.

Proof. From (3) there holds

f t = f t − 1 − η t − 1 [ ( f t − 1 ( x t − 1 ) − y t − 1 ) K x t − 1 + λ t − 1 f t − 1 ] (13)

and

f t ¯ = f t − 1 − η t − 1 [ ( f t − 1 ( x t ¯ − 1 ) − y t ¯ − 1 ) K x t ¯ − 1 + λ t − 1 f t − 1 ] (14)

Then

f t − f t ¯ = η t − 1 [ ( f t − 1 ( x t ¯ − 1 ) − y t ¯ − 1 ) K x t ¯ − 1 − ( f t − 1 ( x t − 1 ) − y t − 1 ) K x t − 1 ] (15)

From the above lemma ‖ f t − 1 ‖ K ≤ κ M λ t − 1 for all t . By the reproducing proper- ty that f ( x ) = 〈 f , K x 〉 K ≤ ‖ f ‖ K ‖ K x ‖ K ≤ κ ‖ f ‖ K (see [

‖ f t − f t ¯ ‖ K ≤ 2 η t − 1 ( κ 2 M λ t − 1 + M ) κ (16)

Therefore

Δ f t = sup z , z ¯ ‖ f t − f t ¯ ‖ ∞ ≤ 2 κ 2 ( κ 2 + 1 ) M ( t − 1 + t 0 ) 2 θ − 1 (17)

Set B t to be the right hand side in lemma 1 then the theorem is proved.

In this section, f ρ ∈ H K is assumed for simplicity. It will be shown that f t obtained from (6) still converge to regression function f ρ by choosing appro- priate parameter ϵ under the choice of η t and λ t as in the theorem in the last section. To this end, an error decomposition is needed. Denote operators L t : H K → H K as L t ( f ) = f ( x t ) K x t for t = 0 , 1 , 2 , ⋯ , and I as the identity operator. It is easy to verify that ‖ L t ‖ ≤ κ 2 . Notice that f ρ ∈ H K , the following decomposition can be deduced:

f t + 1 − f ρ = ( I − η t λ t I − η t L t ) ( f t − f ρ ) + η t [ y t K x t − ( λ t I + L t ) f ρ ] (18)

= A t ( f t − f ρ ) + B t = A t [ A t − 1 ( f t − 1 − f ρ ) + B t − 1 ] + B t = ⋯ (19)

= A t A t − 1 ⋯ A 0 ( f 0 − f ρ ) + [ B t + A t B t − 1 + ⋯ + A t A t − 1 ⋯ A 1 B 0 ] (20)

Here A t = I − η t λ t I − η t L t and B t = η t [ y t K x t − ( λ t I + L t ) f ρ ] . In the follow- ing the first term is called initial error and second one is sample error. The initial error is easy to bound from the analysis above. Since t 0 is such that

t 0 θ ≥ κ 2 + 1 , A t is a positive operator with ‖ A t ‖ ≤ 1 − η t λ t = ( t + t 0 − 1 ) / ( t + t 0 ) .

‖ A t A t − 1 ⋯ A 0 ( f 0 − f ρ ) ‖ K ≤ ‖ A t ‖ ‖ A t − 1 ‖ ⋯ ‖ A 0 ‖ ⋅ ‖ f 0 − f ρ ‖ K ≤ Π j = 0 t j + t 0 − 1 j + t 0 ‖ f 0 − f ρ ‖ K = t 0 t + t 0 ‖ f ρ ‖ K .

For the sample error, it is more difficult and the Pinelis-Bernstein inequality [

Lemma 3 Let ξ i be a martingale difference sequence in a Hilbert space. Sup- pose that almost surely ‖ ξ i ‖ ≤ B and ∑ i = 1 t E i − 1 ‖ ξ i ‖ 2 ≤ σ t 2 for some constants

B , σ t > 0 , t = 1 , 2 , ⋯ . Then for any 0 < δ < 1 , with probability at least 1 − δ , there holds

‖ ∑ i = 1 t ξ i ‖ ≤ 2 ( B 3 + σ t ) ln ( 2 δ ) (21)

Now the error bounds for sample error can be derived. Notice that

‖ B t ‖ K ≤ η t [ M κ + ( κ 2 + λ t ) ‖ f ρ ‖ K ] ≤ η t [ M κ + ( κ 2 + 1 ) ‖ f ρ ‖ K ] . Set

ξ i = A t A t − 1 ⋯ A i B i − 1 , i = 1 , 2 , ⋯ , t , then

‖ ξ i ‖ K ≤ ‖ A t ‖ ‖ A t − 1 ‖ ⋯ ‖ A i ‖ ‖ B i − 1 ‖ K ≤ i + t 0 − 1 t + t 0 η i − 1 [ M κ + ( κ 2 + 1 ) ‖ f ρ ‖ K ] (22)

= 1 t + t 0 1 λ i − 1 [ M κ + ( κ 2 + 1 ) ‖ f ρ ‖ K ] ≤ 1 t + t 0 1 λ t [ M κ + ( κ 2 + 1 ) ‖ f ρ ‖ K ] (23)

= η t [ M κ + ( κ 2 + 1 ) ‖ f ρ ‖ K ] . (24)

And ∑ i = 1 t E i − 1 ‖ X i ‖ K 2 ≤ t η t 2 [ M κ + ( κ 2 + 1 ) ‖ f ρ ‖ K ] . So for any 0 < δ < 1 , with probability at least 1 − δ ,

‖ ∑ i = 1 t ξ i ‖ K ≤ 8 3 t η t [ M κ + ( κ 2 + 1 ) ‖ f ρ ‖ K ] ln ( 2 δ ) ≤ 8 3 t θ − 1 / 2 [ M κ + ( κ 2 + 1 ) ‖ f ρ ‖ K ] ln ( 2 δ ) (25)

Note that ‖ B t ‖ K ≤ η t [ M κ + ( κ 2 + 1 ) ‖ f ρ ‖ K ] , hence

‖ B t + A t B t − 1 + ⋯ + A t A t − 1 ⋯ A 1 B 0 ‖ K ≤ 11 3 t θ − 1 / 2 [ M κ + ( κ 2 + 1 ) ‖ f ρ ‖ K ] ln ( 2 δ ) (26)

Combining the initial error, sample error bounds and applying Markov in- equality for the fact that E | b t + 1 | = C t + 1 / ϵ , the total error estimation is obtained.

Theorem 2 Choose η t , λ t and b t as in the theorem in the last section, with confidence 1 − δ ( 0 < δ < 1 ) , there holds

‖ f t + 1 , A − f ρ ‖ K ≤ C ϵ 1 t θ − 1 / 2 4 δ (27)

where constant C ϵ = 4 ( κ 2 + 1 ) κ M / ϵ + t 0 ‖ f ρ ‖ K + 11 3 ( κ M + ( κ 2 + 1 ) ‖ f ρ ‖ K ) .

In this paper, analysis is performed for the differential privacy (Theorem 1) and generalization property (Theorem 2) for the online differential private learning algorithm (6). Under the choice of parameters in our theorems, the algorithm (6) can provide ϵ -differential privacy and keep learning rate close to 1 / 2 , for any ϵ > 0 . However, this error bound is not satisfactory enough. It might be an interesting problem to promote the error bound from 2 / δ to l n ( 2 / δ ) in our future work.

This work is supported by NSFC (Nos. 11326096, 11401247), NSF of Guangdong Province in China (No. 2015A030313674), Foundation for Distinguished Young Talents in Higher Education of Guangdong, China (No. 2013LYM_0089), National Social Science Fund in China (No. 15BTJ024), Planning Fund Project of Humanities and Social Science Research in Chinese Ministry of Education (No. 14YJAZH040) and Doctor Grants of Huizhou University (No. C511.0206).

Nie, W.L. and Wang, C. (2017) A Study on Differential Private Online Learning. Journal of Computer and Communications, 5, 28-33. https://doi.org/10.4236/jcc.2017.52004