Almost Sure Convergence of Proximal Stochastic Accelerated Gradient Methods

Abstract

Proximal gradient descent and its accelerated version are resultful methods for solving the sum of smooth and non-smooth problems. When the smooth function can be represented as a sum of multiple functions, the stochastic proximal gradient method performs well. However, research on its accelerated version remains unclear. This paper proposes a proximal stochastic accelerated gradient (PSAG) method to address problems involving a combination of smooth and non-smooth components, where the smooth part corresponds to the average of multiple block sums. Simultaneously, most of convergence analyses hold in expectation. To this end, under some mind conditions, we present an almost sure convergence of unbiased gradient estimation in the non-smooth setting. Moreover, we establish that the minimum of the squared gradient mapping norm arbitrarily converges to zero with probability one.

Share and Cite:

Xiang, X. and Xia, H. (2024) Almost Sure Convergence of Proximal Stochastic Accelerated Gradient Methods. Journal of Applied Mathematics and Physics, 12, 1321-1336. doi: 10.4236/jamp.2024.124081.

1. Introduction

We consider the following composite optimization problem:

min x d F ( x ) = def f ( x ) + g ( x ) , (1)

where g : d is the average of the smooth functions g 1 , , g n , i.e. g ( x ) = 1 n i = 1 n g i ( x ) and f : d is a closed proper function that can be non-differentiable. One of the most well-studied instances of this type of problem is l 1 -regularized least squares [1] :

min x d 1 n i = 1 n 1 2 ( a i T x b i ) 2 + λ x 1

where denotes the standard l 1 -norm.

We frequently encounter optimization problems of this nature in various fields such as machine learning, statistics, signal processing, and imaging [2] [3] [4] [5] . Specifically, we address the task of minimizing the aggregate of two functions: one represents the average of numerous smooth component functions, while the other characterizes a general function amenable to a straightforward proximal mapping. It is imperative for us to ensure that the problem is well-defined, denoted by arg min F , and that each g i remains bounded from below. Compared with the classical gradient descent (GD) method and stochastic gradient descent (SGD) method [6] , the proximal gradient descent (PGD) method has a relatively limited application scope, primarily employed for addressing objective functions that include non-differentiable components. To tackle the non-smooth optimization problem (1), mentioned earlier, we introduce the PGD. It can be delineated by the following update rule for k = 1,2, :

x k + 1 = prox t k f ( x k t k g ( x k ) ) , (2)

where t k is the step size at the kth iteration and the proximity operator of f prox f ( ) is defined by:

prox t f ( y ) = arg min x d { 1 2 x y 2 + t f ( x ) } .

We note that prox t f ( y ) maps to a singleton since f ( y ) is proper and closed, see for example (Beck, Theorem 6.3, 2017 [7] ). PGD has adopted attributes to proximal operators no longer rely on g ( x ) , only f ( x ) . That is, F ( x ) = g ( x ) + f ( x ) could be a combination of a very complex differentiable function and a less complex non-differentiable function originally, but with this method, we don’t need to consider g ( x ) (because it is differentiable and the gradient is easy to calculate), we just have to focus on the non-differentiable function f ( x ) , which greatly simplifies our problem. PGD can achieve an error level on the objective function of O ( 1 / k ) after k iterations [8] .

1.1. Related Work

1.1.1. Accelerated Proximal Gradient (APG) Method

Another effective method for solving problem (1) is the accelerated proximal gradient (APG) method, initially proposed by Nesterov [8] for minimizing smooth convex functions with constraints. It was later extended by Beck and Teboulle [9] to composite convex objective functions. The Fast Iterative Shrinkage-Threholding Algorithm (FISTA), an accelerated version of the proximal gradient method, has found applications in various fields, including image and signal processing. APG represents an accelerated variant of the deterministic gradient descent method, incorporating an extrapolation step in the algorithm. One simple version involves selecting an initial point x 0 = x 1 n , and repeating for k = 1,2,3, .

y k + 1 : = x k + β k ( x k x k 1 )

x k + 1 : = prox t k f ( y k + 1 t k g ( y k + 1 ) )

where β k [ 0,1 ) is an extrapolation parameter and t k is the usual step size.

These parameters must be chosen in specific ways to achieve the convergence accelerated. One simple choice takes β k = k k + 3 [10] . It remains to choose the step sizes t k . When g is Lipschitz continuous with constant L, this method can be shown to converge in objective value with O ( 1 / k 2 ) when a fixed step size t k = t ( 0, 1 / L ] is used [8] [9] . Following Nesterov, this method is called an accelerated or optimal first-order method and there are several versions of such methods, such as Nesterov [8] ; the software package TFOCS [11] is based on and contains several implementations of such methods. Li et al. [12] were the first to provide APG-type algorithms for general non-convex and non-smooth problems ensuring that every accumulation point is a critical point.

1.1.2. Proximal Stochastic Gradient Descent (PSGD) Method

With the emergence of big data, the efficiency of deterministic optimization algorithm has gradually become a bottleneck. In the PGD (2), we need to calculate the full gradient of g ( ) . When the size of the datasets n is very large, where the calculation costs will be high, first-order stochastic gradient methods have proven to be very effective thanks to their low iteration complexity. Therefore, one way to reduce calculation is to use stochastic algorithms that take advantage of the finite sum structure of the problem (1) to use cheaper iterations while preserving fast convergence. When f ( x ) = 0 , problem (1) reduces to general minimization optimization problem: min x d F ( x ) = def 1 n i = 1 n g i ( x ) , where F ( x ) arise as averages of a very large number of smooth functions. This problem often arises by approximation of the stochastic optimization loss function: F ( x ) = E ξ D [ g ξ ( x ) ] , where ξ is a random variable, g ξ : d is smooth for all ξ . First-order stochastic methods for the case of a non-smooth regularizer f ( x ) are an active research area. Non-asymptotic convergence results were first achieved in [13] . For finite-sum problems, Reddi et al. [14] were the first to develop a proximal stochastic variance reduced gradient algorithm with improved convergence complexity. Metel et al. [15] first presented the non-asymptotic convergence results for the non-smooth non-covex constrained sparse optimization problem and they presented two simple stochastic proximal gradient algorithms, for stochastic and finite-sum optimization problems. Kawashima et al. [16] considered f ( x ) as a non-smooth quasi-convex function and achieved the same convergence complexity as in Ghadimi et al. (2016) [13] . A stochastic variant of PGD is proximal stochastic gradient descent (PSGD). At each iteration k = 1,2, , it picks i k with probability 1 n from [ n ] = { 1,2, , n } via independent identically distribution. The PSGD takes the following update:

x k + 1 = prox t k f ( x k t k g i k ( x k ) ) . (3)

where t k > 0 is a sequence of step size (also known as learning rate). Sampling the index i k over all indices [ n ] = { 1,2, , n } with i.i.d., the gradient g i k ( x k ) satisfies the unbiased estimation:

E [ g i k ( x k ) ] = i = 1 n 1 n g i k ( x k ) = g ( x k ) . (4)

The advantage of PSGD over PGD lies in the fact that, at each iteration, PSGD only necessitates the computation of a single gradient g i k ( x k ) . In contrast, each iteration of PGD evaluates n g gradients. As a result, the computational cost of PSGD per iteration is 1 n of that of PGD. Consequently, the computation of g i k ( x k ) is approximately n times less expensive than that of g ( x ) . Numerous algorithms have been devised to tackle the composite optimization problem (1). Gorbunov et al. [17] provided a unified analysis covering a broad range of variants of PSGD. In a study by Cevher et al. [18] , it was demonstrated that PSGD, assuming strong convexity, displays linear convergence towards a region dominated by noise.

1.1.3. Convergence Criteria

The vast majority of the convergence rates analysis results for stochastic gradient methods in the literature are obtained in terms of the expectation (see, e.g. SGD, stochastic heavy ball (SHB) [19] , stochastic Nesterov’s accelerated gradient (SNAG) and so forth). However, almost sure convergence (a.s. for short, also known as “convergence with probability 1”) [20] properties are important, because they represent what happens to individual trajectories of the stochastic iterations, which are instantiations of the stochastic algorithms actually used in practice. Therefore, almost sure convergence of methods based on stochastic gradient is of practical relevance. For SGD, in the convex and smooth setting, Sebbouh et al. [21] provided the almost sure asymptotic convergence rates for a weighted average of the iterates. The almost sure convergence of the last iteration of SGD on non-convex functions is generalized by Orabona [22] . Liu et al. [23] provide a unified almost sure convergence rates analysis for SGD, SHB and SNAG. Recently, Liang et al. [24] presented an almost sure convergence analysis of stochastic composite objective mirror descent (SCOMID).

Remark 1.1. It is worth noting that our proof does not rely on the convexity of the function f, so convexity of f is not assumed in the model. However, to ensure the continuity of the generalized gradient operator, we assume that f is convex.

1.2. Proximal Stochastic Accelerated Gradient (PSAG) Method

Based on the above, we now consider an accelerated version of proximal stochastic

Table 1. Comparison of problem setting, momentum, algorithm and convergence results with some relevant literature.

SCOMID = Stochastic Composite Objective Mirror Descent; PSGD = Proximal Stochastic Gradient Descent; SGD = Stochastic Gradient Descent; AdaGrad = Adaptive Stochastic Gradient Descent; SHB = Stochastic Heavy Ball; SNAG = Stochastic Nesterov’s Accelerated graDient; PSAG = Proximal Stochastic Accelerated Gradient (in this paper).

gradient method and use the unbiased stochastic gradient (see the last column of Table 1), the iteration of the PSAG method is given by:

y k + 1 : = x k + β k ( x k x k 1 )

x k + 1 : = prox t k f ( y k + 1 t k G k ) (5)

where β k [ 0,1 ) is an extrapolation parameter and t k is the usual step size, G k = g i k ( y k + 1 ) is an unbiased estimator of the gradient, where i k is randomly picked from [ n ] = { 1,2, , n } with probability 1 n via the independent identically distribution. Then, we have:

x k + 1 arg min x d { t G k , x + t f ( x ) + 1 2 x x k 2 } . (6)

where in (5), t = t k is the stepsize, G k is the stochastic gradient and x k = y k + 1 is an extrapolation step. For non-smooth composite problem (1), we establish almost sure convergence rates of PSAG that the minimum of the squared gradient mapping norm is arbitrarily close to zero with probability one. It is noted that PSAG method reduces to PSGD method when β k = 0 in (5).

The rest of this paper is organized as follows. In Section 2, we recall some definitions and known results for further analysis. Then, we present our convergence analysis of PSAG and its convergence rate in Section 3. Finally, we summarize our findings and draw conclusions in Section 4.

2. Preliminaries

2.1. Notations

The optimal solution of problem (1) is denoted by x * , and the optimal set of problem (1) is non-empty and denoted by X * . The optimal value of the problem (1) is denoted by F * = F ( x * ) . In the rest of this work, for notational brevity, we will omit the subscript of norm 2 for .

2.2. Definitions

Definition 2.1. (Convexity). A function g : d is said to be convex; i.e. for all x , y d ,

g ( x ) g ( y ) + g ( y ) , x y . (7)

Definition 2.2. (L-smoothness). A function g : d is said to be L-smooth if the gradient g ( x ) is Lipschitz continuous with constant L, i.e. there exists a constant L > 0 such that:

g ( x ) g ( y ) L x y , for all x , y d . (8)

It is well-known that Definition 2.2 implies the following inequality (see, e.g. Nesterov [31] (Lemma 1.2.3)):

g ( x ) g ( y ) + g ( y ) , x y + L 2 x y 2 , x , y d . (9)

Remark 2.1. Definition 2.2 essentially implies that gradient descent with sufficiently small step size is well behaved, and we also have g ( x ) = i = 1 n g i ( x ) is L-smooth where L = i n L i ( g i ( x ) is L i -smooth).

Definition 2.3. (Unbiasedness). Given a random iterate { x k } k 0 . We call the stochastic gradient g i k ( x k ) ( i k is sampled randomly from { 1,2, , n } ) is unbiased if:

E [ g i k ( x k ) ] = g ( x k ) (10)

holds.

Definition 2.4. (Prox-grad operator). Consider the composite optimization problem (1), we have:

x k + = T t k f , g ( x k )

where T t f , g : i n t ( d o m ( g ) ) E ( t > 0 ) is the prox-grad operator defined by:

T t f , g ( x ) prox t f ( x t g ( x ) ) (11)

Definition 2.5. (Gradient mapping). Suppose that f , g satisfy the problem (1) settings. Then, the gradient mapping is the operator G t f , g : i n t ( d o m ( g ) ) E defined by:

G t f , g ( x ) 1 t ( x T t f , g ( x ) ) (12)

for any x i n t ( d o m ( g ) ) .

3. Convergence Analysis

3.1. Lemmas on Supermartingale Convergence Rates

The proof about almost sure convergence relies on the classical Robbins-Siegmund supermartingale convergence result (Theorem 1 in [32] ).

Lemma 3.1. (Theorem 1, [32] ) Assume that { X k } , { Y k } and { Z k } are three non-negative sequences of random variable, { γ k } is a non-negative real sequence and F k is a σ-algebra. If X k , Y k , Z k are all F k -measurable and the following conditions holds:

1) E [ Y k + 1 | F k ] ( 1 + γ k ) Y k X k + Z k .

2) k = 1 γ k < , k = 1 Z k < almost surely.

Then, Y k converges almost surely and k = 1 X k < almost surely (a.s.)

Lemma 3.2. (Lemma 3, [23] ) Assume that { X k } is a non-negative sequence of random variable and t k 0 is non-increasing. If the following conditions hold:

k = 1 t k + 1 s = 1 k t s = , k = 1 t k + 1 X k < a . s . (13)

then we have:

min 1 i k X i = o ( 1 j = 1 k t s ) a . s . (14)

where o denotes the higher-order infinitesimal. i.e. for two sequences { a k } 0 and { b k } 0 , a k = o ( b k ) if and only if lim k a k / b k = 0 .

3.2. Almost Sure Convergence Rate Analysis for Stochastic Proximal Accelerated Gradient Method

Reviewing the iteration of PSAG (5), using the Definition 2.4, 2.5, we rewrite the PSAG as follows:

y k + 1 : = x k + β ( x k x k 1 ) ,

x k + 1 : = y k + 1 t k G t k ( y k + 1 ) . (15)

where G t k ( y k + 1 ) = 1 t k ( y k + 1 T t k f , g i k ( y k + 1 ) ) , T t k f , g i k ( y k + 1 ) = prox t k f ( y k + 1 t k g i k ( y k + 1 ) ) , t k is the step size and β = β k [ 0 , 1 ) and i k is randomly picked from [ n ] = { 1,2, , n } with probability 1 n via the independent identically distribution.

Lemma 3.3. Suppose that u arg min x d { t z , x + t f ( x ) + 1 2 x y 2 } for some z , y d . If g ( x ) is L-smooth, the for any x d , we have:

F ( u ) F ( x ) + g ( y ) z , u x + 1 2 t x y 2 1 2 t u y 2 + L 2 u y 2 . (16)

Proof By the optimality of u, we can show that for any x d ,

t z , u + t f ( u ) + 1 2 u y 2 t z , x + t f ( x ) + 1 2 x y 2 (17)

Upon rearranging the above equation, we have:

f ( u ) f ( x ) + z , x u + 1 2 t x y 2 1 2 t u y 2 . (18)

Since g ( x ) is L-smooth, we can apply Definition 2.2 to have:

g ( u ) g ( y ) + g ( y ) , u y + L 2 u y 2 . (19)

g ( x ) g ( y ) + g ( y ) , x y + 1 2 L g ( x ) g ( y ) 2 . (20)

Next, upon combining (19) and (20), we arrive at:

g ( u ) g ( y ) + g ( y ) , u y + L 2 u y 2 g ( x ) + g ( y ) , y x 1 2 L g ( x ) g ( y ) 2 + g ( y ) , u y + L 2 u y 2 g ( x ) + g ( y ) , u x + L 2 u y 2 . (21)

By the fact that F ( u ) = g ( u ) + f ( u ) , we finally obtain:

F ( u ) = g ( u ) + f ( u ) g ( u ) + f ( x ) + z , x u + 1 2 t x y 2 1 2 t u y 2 g ( x ) + g ( y ) , u x + L 2 u y 2 + f ( x ) + z , x u + 1 2 t x y 2 1 2 t u y 2 = F ( x ) + g ( y ) z , u x + 1 2 t x y 2 1 2 t u y 2 + L 2 u y 2 . (22)

Theorem 3.1. Suppose that F ( x ) is lower bounded and there exists σ > 0 , A k 0 such that:

E [ g ( y k + 1 ) G k 2 | F k ] A k ( F ( x k ) F * ) + σ 2 , (23)

where G k = g i k ( y k + 1 ) , i k is randomly picked from [ n ] = { 1,2, , n } with probability 1 n via the independent identically distribution, the iteration of PSAG (15) is employed with a non-increasing stepsize t k such that:

k = 1 t k 2 < , k = 1 A k t k < , k = 1 t k + 1 s = 1 k t s = and t k 1 4 L ,

where A k 0 is defined in (23), then we have:

{ F ( x k ) F * } a . s .

If f : d is convex, we obtain:

min 1 i k G t k ( y k + 1 ) 2 = o ( 1 s = 1 k t s ) a . s . (24)

Proof We begin with the conclusion in (16) of Lemma 3.3. Firstly, by the definition of x k + in Definition 2.4, let u = x k + , then set t = t k 2 , z = g ( x k ) and y = x k to meet the condition u arg min x d { t z , x + t f ( x ) + 1 2 x y 2 } , i.e.

x k + arg min x d { t k 2 g ( x k ) , x + t k 2 f ( x ) + 1 2 x x k 2 } (25)

From the update in (2), we can show that the iterate x k d for any k 1 . Upon applying Lemma 3.3 with u = x k + , t = t k 2 , z = g ( x k ) and y = x k for x = x k d , we have:

F ( x k + ) F ( x k ) + g ( x k ) g ( x k ) , x k + x k + 1 t k x k x k 2 1 t k x k + x k 2 + L 2 x k + x k 2 = F ( x k ) + ( L 2 1 t k ) x k + x k 2 . (26)

Next, by the update rule in (6) of PSAG algorithm, we choose u = x k + 1 , then set t = t k , z = G k = g i ( y k + 1 ) and y = y k + 1 to meet the condition u arg min x n { t z , x + t f ( x ) + 1 2 x y 2 } , i.e.

x k + 1 arg min x d { t k G k , x + t k f ( x ) + 1 2 x y k + 1 2 } (27)

Similarly, from the definition of x k + in (25), we can show that for any k 1 , the sequence x k + d . Now, upon applying Lemma 3.3 with u = x k + 1 , t = t k , z = G k , and y = y k + 1 , for x = x k + d , we then obtain:

F ( x k + 1 ) F ( x k + ) + g ( y k + 1 ) G k , x k + 1 x k + + 1 2 t k x k + y k + 1 2 1 2 t k x k + 1 y k + 1 2 + L 2 x k + 1 y k + 1 2 = F ( x k + ) + ( L 2 1 2 t k ) x k + 1 y k + 1 2 + 1 2 t k y k + 1 x k + 2 + g ( y k + 1 ) G k , x k + 1 x k + . (28)

Now, we consider g ( y k + 1 ) G k , x k + 1 x k + ,

g ( y k + 1 ) G k , x k + 1 x k + g ( y k + 1 ) G k x k + 1 x k + 2 t k g ( y k + 1 ) G k 2 + 1 8 t k x k + 1 x k + 2 = 2 t k g ( y k + 1 ) G k 2 + 1 8 t k x k + 1 y k + 1 + y k + 1 x k + 2 2 t k g ( y k + 1 ) G k 2 + 1 4 t k x k + 1 y k + 1 2 + 1 4 t k y k + 1 x k + 2 , (29)

where we use the Cauchy-Schwarz | a , b | a b in the first inequality, and the second inequality follows from a b ( 1 / 2 ) ( a 2 + b 2 ) with a = 2 t k g ( y k + 1 ) G k and b = 1 2 t k x k + 1 x k + . The last inequality holds by a + b 2 2 a 2 + 2 b 2 with a = x k + 1 y k + 1 and b = y k + 1 x k + .

Upon substituting (29) back into (28), we further have:

F ( x k + 1 ) F ( x k + ) + ( L 2 1 2 t k ) x k + 1 y k + 1 2 + 1 2 t k y k + 1 x k + 2 + g ( y k + 1 ) G k , x k + 1 x k + F ( x k + ) + ( L 2 1 2 t k ) x k + 1 y k + 1 2 + 1 2 t k y k + 1 x k + 2 + 2 t k g ( y k + 1 ) G k 2 + 1 4 t k x k + 1 y k + 1 2 + 1 4 t k y k + 1 x k + 2

= F ( x k + ) + ( L 2 1 4 t k ) x k + 1 y k + 1 2 + 3 4 t k y k + 1 x k + 2 + 2 t k g ( y k + 1 ) G k 2 F ( x k ) + ( L 2 1 t k ) x k + x k 2 + 2 t k g ( y k + 1 ) G k 2 + ( L 2 1 4 t k ) x k + 1 y k + 1 2 + 3 4 t k y k + 1 x k + 2 F ( x k ) + 2 t k g ( y k + 1 ) G k 2 1 8 t k x k + 1 y k + 1 2 + 3 4 t k y k + 1 x k + 2 . (30)

where the last inequality holds by the stepsize condition t k 1 / 4 L , i.e. L / 2 1 / 4 t k 1 / 8 t k .

Next, we consider 3 4 t k y k + 1 x k + 2 and combine Definition 2.4 with (5) and (15), we have:

x k + = T t k f , g i k ( y k + 1 ) = prox t k f ( y k + 1 t k g i k ( y k + 1 ) ) ,

and

G t k ( y k + 1 ) = 1 t k ( y k + 1 T t k f , g i k ( y k + 1 ) ) = 1 t k ( y k + 1 x k + ) ,

Due to the convexity of f, we know that there exists M > 0 such that:

G t k ( y k + 1 ) < M .

Therefore,

3 4 t k y k + 1 x k + 2 = 3 t k 4 G t k ( y k + 1 ) 2 3 M 2 4 t k . (31)

Then, by (15), we also have:

1 8 t k x k + 1 y k + 1 2 = t k 8 G t k ( y k + 1 ) 2 . (32)

Upon combining (31) and (32) with (30), we have:

F ( x k + 1 ) F ( x k ) t k 8 G t k ( y k + 1 ) 2 + 2 t k g ( y k + 1 ) G k 2 + 3 M 2 4 t k . (33)

Now, the conditional expectation on (33) with respect to F k gives:

E [ F ( x k + 1 ) | F k ] E [ F ( x k ) | F k ] t k 8 E [ G t k ( y k + 1 ) 2 | F k ] + 2 t k E [ g ( y k + 1 ) G k 2 | F k ] + 3 M 2 4 t k = F ( x k ) t k 8 G t k ( y k + 1 ) 2 + 2 t k E [ g ( y k + 1 ) G k 2 | F k ] + 3 M 2 4 t k F ( x k ) t k 8 G t k ( y k + 1 ) 2 + 2 t k ( A k ( F ( x k ) F * ) + σ 2 ) + 3 M 2 4 t k = F ( x k ) t k 8 G t k ( y k + 1 ) 2 + 2 A k t k ( F ( x k ) F * ) + ( 2 σ 2 + 3 M 2 4 ) t k . (34)

where the first equality holds by the fact that x k is F k -measurable, i.e. E [ F ( x k ) | F k ] = F ( x k ) and E [ G t k ( y k + 1 ) 2 | F k ] = G t k ( y k + 1 ) 2 . The last inequality follows from E [ g ( y k + 1 ) G k 2 | F k ] A k ( F ( x k ) F * ) + σ 2 .

Upon subtracting F * from both sides of (34), we have:

E [ F ( x k + 1 ) F * | F k ] F ( x k ) F * t k 8 G t k ( y k + 1 ) 2 + 2 A k t k ( F ( x k ) F * ) + ( 2 σ 2 + 3 M 2 4 ) t k = ( 1 + 2 A k t k ) ( F ( x k ) F * ) t k 8 G t k ( y k + 1 ) 2 + ( 2 σ 2 + 3 M 2 4 ) t k . (35)

Upon multiplying (35) by t k + 1 , we can obtain:

E [ t k + 1 ( F ( x k + 1 ) F * | F k ) ] ( 1 + 2 A k t k ) t k + 1 ( F ( x k ) F * ) t k t k + 1 8 G t k ( y k + 1 ) 2 + ( 2 σ 2 + 3 M 2 4 ) t k t k + 1 ( 1 + 2 A k t k ) t k ( F ( x k ) F * ) t k t k + 1 8 G t k ( y k + 1 ) 2 + ( 2 σ 2 + 3 M 2 4 ) t k 2 . (36)

where the last inequality follows from the non-increasing behaviour of the stepsize t k , i.e. t k + 1 < t k .

Finally, let Y k = t k ( F ( x k ) F * ) , γ k = 2 A k t k , X k = t k 8 G t k ( y k + 1 ) 2 and Z k = ( 2 σ 2 + 3 M 2 4 ) t k 2 , then (36) becomes:

E [ Y k + 1 | F k ] ( 1 + γ k ) Y k t k + 1 X k + Z k .

Recalling the stepsize conditions k = 1 t k 2 < and k = 1 A k t k < , we know that k = 1 Z k < , k = 1 γ k < . Thus, by Lemma 3.1, we have:

k = 1 t k + 1 X k < a . s . and { F ( x k ) F * } a . s .

Finally, with the condition k = 1 t k + 1 s = 1 k t s = of Lemma 3.2, we have:

min 1 r k t r G t r ( y k + 1 ) 2 = o ( 1 s = 1 k t s ) a . s . (37)

This completes the proof.

Remark 3.1. (23) is a new variance assumption on the stochastic gradient, where G k is weaker than the bounded variance assumption (i.e. A k = 0 in [33] ).

Remark 3.2. It is noting that we get the boundedness of the gradient mapping G t k ( y k + 1 ) due to the continuity of the prox-grad operator T t k f , g i k ( y k + 1 ) which is deduced by the convexity of f.

Corollary 3.1. Following the setting of Theorem 3.1 and choosing the stepsize t k = β 1 + γ β k 1 2 + ε for any γ , β 0 where ε ( 0, 1 2 ) gives:

min 1 r k G t r ( y k + 1 ) 2 = o ( 1 ) a . s . (38)

Proof Stepsize satisfying t k = β 1 + γ β k 1 2 + ε has been studied in [23] . It follows that:

1 + γ β k 1 2 + ε k 1 2 + ε + γ β k 1 2 + ε = ( 1 + γ β ) k 1 2 + ε , (39)

which implies that t k β ( 1 + γ β ) k 1 2 + ε . Upon by the integral test inequality, we have:

s = 1 k t s s = 1 k β ( 1 + γ β ) s 1 2 + ε 1 k β ( 1 + γ β ) x 1 2 + ε d x = β ( k 1 2 ε 1 ) ( 1 + γ β ) ( 1 2 ε ) β k 1 2 ε ( k 1 ) 1 + γ β , (40)

where the last inequality follows from the concavity of h ( x ) = x 1 2 ε , so that:

h ( y ) h ( x ) + h ( x ) ( y x ) .

In other words, by taking y = 1 and x = k , we can get k 1 2 ε 1 ( 1 2 ε ) k 1 2 ε ( k 1 ) .

Next, combining (39) and (40), we can obtain:

( s = 1 k t s ) t k β k 1 2 ε ( k 1 ) 1 + γ β β ( 1 + γ β ) k 1 2 + ε = β 2 k 1 2 ε ( k 1 ) ( 1 + γ β ) 2 . (41)

Upon setting G r = G t r ( y k + 1 ) 2 , and applying the inequality min 1 r k t r G r t k min 1 r k G r , we have:

( s = 1 k t s ) min 1 r k t r G r ( s = 1 k t s ) t k min 1 r k G r β 2 k 1 2 ε ( k 1 ) ( 1 + γ β ) 2 min 1 r k G r 0. (42)

For the left hand side of (42), we apply Theorem 3.1 to get:

lim k ( s = 1 k t s ) min 1 r k t k G r = 0 a . s .

The application of Squeeze theorem in conjunction with (42) gives:

lim k k 1 2 ε ( k 1 ) min 1 r k G r = 0 a . s .

When ε 0 , we have k 1 k 1 + 2 ε 1 , which implies that:

lim k min 1 r k G r = 0 a . s .

Therefore, we finally obtain:

min 1 r k G r = min 1 r k G t r ( y k + 1 ) 2 = o ( 1 ) a . s .

This completes the proof.

Remark 3.3. Corollary 3.1 indicates that min 1 r k G t r ( y k + 1 ) 2 is arbitrarily close to 0 with probability one.

4. Conclusion

This paper presents PSAG with unbiased gradient estimation and analyzes its almost sure convergence for solving composite optimization problems, wherein the objective function comprises a smooth component and a non-smooth component. By leveraging certain key assumptions, we have established the almost sure convergence of PSAG with unbiased gradient estimation. Furthermore, we have demonstrated that the minimum of the squared gradient mapping norm approaches zero arbitrarily closely with probability one.

Founding

This paper was supported by the Natural Science Foundation of China (Grant No. 11801455), the Sichuan Science and Technology Program (Grant No. 2023NSFSC1922), the Innovation Team Funds of China West Normal University (Grant No. KCXTD2023-3), and the Fundamental Research Funds of China West Normal University (Grant No. 23kc010).

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1] Wright, S.J., Nowak, R.D. and Figueiredo, M.A.T. (2009) Sparse Reconstruction by Separable Approximation. IEEE Transactions on Signal Processing, 57, 2479-2493.
https://doi.org/10.1109/TSP.2009.2016892
[2] Bottou, L., Curtis, F.E. and Nocedal, J. (2018) Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60, 223-311.
https://doi.org/10.1137/16M1080173
[3] Mandic, D. and Chambers, J. (2001) Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability. Wiley, New York.
https://doi.org/10.1002/047084535X
[4] Shalev-Shwartz, S. and Ben-David, S. (2014) Understanding Machine Learning: from Theory to Algorithms. Cambridge University Press, New York.
https://doi.org/10.1017/CBO9781107298019
[5] Sun, R.Y. (2020) Optimization for Deep Learning: An Overview. Journal of the Operations Research Society of China, 8, 249-294.
https://doi.org/10.1007/s40305-020-00309-6
[6] Robbins, H. and Monro, S. (1951) A Stochastic Approximation Method. Annals of Mathematical Statistics, 22, 400-407.
https://doi.org/10.1214/aoms/1177729586
[7] Beck, A. (2017) First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia.
https://doi.org/10.1137/1.9781611974997
[8] Nesterov, Y. (2013) Gradient Methods for Minimizing Composite Functions. Mathematical Programming, 140, 125-161.
https://doi.org/10.1007/s10107-012-0629-5
[9] Beck, A. and Teboulle, M. (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2, 183-202.
https://doi.org/10.1137/080716542
[10] Vandenberghe, L. (2010) Fast Proximal Gradient Methods.
http://faculty.bicmr.pku.edu.cn/~wenzw/courses/fgrad.pdf
[11] Becker, S., Candès, E. and Grant, M. (2011) Templates for Convex Cone Problems with Applications to Sparse Signal Recovery. Mathematical Programming Computation, 3, 165-218.
https://doi.org/10.1007/s12532-011-0029-5
[12] Li, H. and Lin, Z.C. (2015) Accelerated Proximal Gradient Methods for Nonconvex Programming. The 28th International Conference on Neural Information Processing Systems, Montreal, 7-12 December 2015, 379-387.
[13] Ghadimi, S., Lan, G. and Zhang, H. (2016) Mini-Batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization. Mathematical Programming, 155, 267-305.
https://doi.org/10.1007/s10107-014-0846-1
[14] Reddi, S.J., Sra, S., Póczos, B. and Smola, A.J. (2016) Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization. Advances in Neural Information Processing Systems, 29, 1145-1153.
[15] Metel, M.R. and Takeda, A. (2019) Stochastic Proximal Methods for Non-Smooth Non-Covex Constrained Sparse Optimization. Journal of Machine Learning Research, 22, 1-36.
[16] Kawashima, T. and Fujisawa, H. (2018) Stochastic Gradient Descent for Stochastic Doubly-Nonconvex Composite Optimization. arXiv: 1805.07960.
[17] Gorbunov, E., Hanzely, F. and Richtárik, P. (2020) A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent. arXiv: 1905.11261.
https://arxiv.org/abs/1905.11261
[18] Cevher, V. and Vũ, B.C. (2019) On the Linear Convergence of the Stochastic Gradient Method with Constant Step-Size. Optimization Letters, 13, 1177-1187.
https://doi.org/10.1007/s11590-018-1331-1
[19] Polyak, T.B. (1964) Some Methods of Speeding up the Convergence of Iteration Methods. USSR Computational Mathematics and Mathematical Physics, 4, 1-17.
https://doi.org/10.1016/0041-5553(64)90137-5
[20] Biscarat, C.J. (1994) Almost Sure Convergence of a Class of Stochastic Algorithms. Stochastic Processes and their Applications, 50, 83-99.
https://doi.org/10.1016/0304-4149(94)90149-X
[21] Sebbouh, O., Gower, M.R. and Defazio, A. (2021) Almost Sure Convergence Rates for Stochastic Gradient Descent and Stochastic Heavy Ball. Proceedings of Machine Learning Research, 134, 1-37.
[22] Orabona, F. (2020) Almost Sure Convergence of SGD on Smooth Nonconvex Functions.
http://parameterfree.com
https://parameterfree.com/2020/10/05/almost-sure-convergence-of-sgd-on-smooth-non-convex-functions/
[23] Liu, J. and Yuan, Y. (2022) On Almost Sure Convergence Rates of Stochastic Gradient Methods. arXiv: 2202.04295.
https://arxiv.org/abs/2202.04295
[24] Liang, Y.Q., Xu, D.P., Zhang, N.M. and Mandic, D.P. (2023) Almost Sure Convergence of Stochastic Composite Objective Mirror Descent for Non-Convex Non-Smooth Optimization. Optimization Letters.
https://doi.org/10.1007/s11590-023-01972-3
[25] Khaled, A. and Richtárik, P. (2020) Better Theory for SGD in the Nonconvex World. arXiv: 2002.03329.
[26] Gower, R., Sebbouh, O. and Loizou, N. (2021) SGD for Structured Nonconvex Functions: Learning Rates, Mini-Batching and Interpolation. Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, Vol. 130, 13-15 April 2021, 1315-1323.
[27] Mertikopoulos, P., Hallak, N., Kavis, A. and Cevher, V. (2020) On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems. Optimization and Control, 33, 1117-1128.
https://arxiv.org/abs/2006.11144
[28] Ward, R., Wu, X. and Bottou, L. (2020) AdaGrad Stepsizes: Sharp Convergence over Nonconvex Landscapes. Journal of Machine Learning Research, 21, 9047-9076.
[29] Alacaoglu, A., Malitsky, Y. and Cevher, V. (2020) Convergence of Adaptive Algorithms for Weakly Convex Constrained Optimization. arXiv: 2006.06650.
[30] Davis, D. and Drusvyatskiy, D. (2019) Stochastic Model-Based Minimization of Weakly Convex Functions. SIAM Journal on Optimization, 29, 207-239.
https://doi.org/10.1137/18M1178244
[31] Nesterov, Y. (2003) Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media, New York.
https://doi.org/10.1007/978-1-4419-8853-9
[32] Robbins, H. and Siegmund, D. (1971) A Convergence Theorem for Non-Negative Almost Supermartingales and Some Applications. In: Rustagi, J.S., Ed., Optimizing Methods in Statistics, Academic Press, Cambridge, 233-257.
https://doi.org/10.1016/B978-0-12-604550-5.50015-8
[33] Nesterov, Y. (2018) Smooth Convex Optimization. Lectures on Convex Optimization, 137, 59-137.
https://doi.org/10.1007/978-3-319-91578-4_2

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.