Almost Sure Convergence of Proximal Stochastic Accelerated Gradient Methods ()
1. Introduction
We consider the following composite optimization problem:
(1)
where
is the average of the smooth functions
, i.e.
and
is a closed proper function that can be non-differentiable. One of the most well-studied instances of this type of problem is
-regularized least squares [1] :
where
denotes the standard
-norm.
We frequently encounter optimization problems of this nature in various fields such as machine learning, statistics, signal processing, and imaging [2] [3] [4] [5] . Specifically, we address the task of minimizing the aggregate of two functions: one represents the average of numerous smooth component functions, while the other characterizes a general function amenable to a straightforward proximal mapping. It is imperative for us to ensure that the problem is well-defined, denoted by
, and that each
remains bounded from below. Compared with the classical gradient descent (GD) method and stochastic gradient descent (SGD) method [6] , the proximal gradient descent (PGD) method has a relatively limited application scope, primarily employed for addressing objective functions that include non-differentiable components. To tackle the non-smooth optimization problem (1), mentioned earlier, we introduce the PGD. It can be delineated by the following update rule for
:
(2)
where
is the step size at the kth iteration and the proximity operator of f
is defined by:
We note that
maps to a singleton since
is proper and closed, see for example (Beck, Theorem 6.3, 2017 [7] ). PGD has adopted attributes to proximal operators no longer rely on
, only
. That is,
could be a combination of a very complex differentiable function and a less complex non-differentiable function originally, but with this method, we don’t need to consider
(because it is differentiable and the gradient is easy to calculate), we just have to focus on the non-differentiable function
, which greatly simplifies our problem. PGD can achieve an error level on the objective function of
after k iterations [8] .
1.1. Related Work
1.1.1. Accelerated Proximal Gradient (APG) Method
Another effective method for solving problem (1) is the accelerated proximal gradient (APG) method, initially proposed by Nesterov [8] for minimizing smooth convex functions with constraints. It was later extended by Beck and Teboulle [9] to composite convex objective functions. The Fast Iterative Shrinkage-Threholding Algorithm (FISTA), an accelerated version of the proximal gradient method, has found applications in various fields, including image and signal processing. APG represents an accelerated variant of the deterministic gradient descent method, incorporating an extrapolation step in the algorithm. One simple version involves selecting an initial point
, and repeating for
.
where
is an extrapolation parameter and
is the usual step size.
These parameters must be chosen in specific ways to achieve the convergence accelerated. One simple choice takes
[10] . It remains to choose the step sizes
. When
is Lipschitz continuous with constant L, this method can be shown to converge in objective value with
when a fixed step size
is used [8] [9] . Following Nesterov, this method is called an accelerated or optimal first-order method and there are several versions of such methods, such as Nesterov [8] ; the software package TFOCS [11] is based on and contains several implementations of such methods. Li et al. [12] were the first to provide APG-type algorithms for general non-convex and non-smooth problems ensuring that every accumulation point is a critical point.
1.1.2. Proximal Stochastic Gradient Descent (PSGD) Method
With the emergence of big data, the efficiency of deterministic optimization algorithm has gradually become a bottleneck. In the PGD (2), we need to calculate the full gradient of
. When the size of the datasets n is very large, where the calculation costs will be high, first-order stochastic gradient methods have proven to be very effective thanks to their low iteration complexity. Therefore, one way to reduce calculation is to use stochastic algorithms that take advantage of the finite sum structure of the problem (1) to use cheaper iterations while preserving fast convergence. When
, problem (1) reduces to general minimization optimization problem:
, where
arise as averages of a very large number of smooth functions. This problem often arises by approximation of the stochastic optimization loss function:
, where
is a random variable,
is smooth for all
. First-order stochastic methods for the case of a non-smooth regularizer
are an active research area. Non-asymptotic convergence results were first achieved in [13] . For finite-sum problems, Reddi et al. [14] were the first to develop a proximal stochastic variance reduced gradient algorithm with improved convergence complexity. Metel et al. [15] first presented the non-asymptotic convergence results for the non-smooth non-covex constrained sparse optimization problem and they presented two simple stochastic proximal gradient algorithms, for stochastic and finite-sum optimization problems. Kawashima et al. [16] considered
as a non-smooth quasi-convex function and achieved the same convergence complexity as in Ghadimi et al. (2016) [13] . A stochastic variant of PGD is proximal stochastic gradient descent (PSGD). At each iteration
, it picks
with probability
from
via independent identically distribution. The PSGD takes the following update:
(3)
where
is a sequence of step size (also known as learning rate). Sampling the index
over all indices
with i.i.d., the gradient
satisfies the unbiased estimation:
(4)
The advantage of PSGD over PGD lies in the fact that, at each iteration, PSGD only necessitates the computation of a single gradient
. In contrast, each iteration of PGD evaluates n g gradients. As a result, the computational cost of PSGD per iteration is
of that of PGD. Consequently, the computation of
is approximately n times less expensive than that of
. Numerous algorithms have been devised to tackle the composite optimization problem (1). Gorbunov et al. [17] provided a unified analysis covering a broad range of variants of PSGD. In a study by Cevher et al. [18] , it was demonstrated that PSGD, assuming strong convexity, displays linear convergence towards a region dominated by noise.
1.1.3. Convergence Criteria
The vast majority of the convergence rates analysis results for stochastic gradient methods in the literature are obtained in terms of the expectation (see, e.g. SGD, stochastic heavy ball (SHB) [19] , stochastic Nesterov’s accelerated gradient (SNAG) and so forth). However, almost sure convergence (a.s. for short, also known as “convergence with probability 1”) [20] properties are important, because they represent what happens to individual trajectories of the stochastic iterations, which are instantiations of the stochastic algorithms actually used in practice. Therefore, almost sure convergence of methods based on stochastic gradient is of practical relevance. For SGD, in the convex and smooth setting, Sebbouh et al. [21] provided the almost sure asymptotic convergence rates for a weighted average of the iterates. The almost sure convergence of the last iteration of SGD on non-convex functions is generalized by Orabona [22] . Liu et al. [23] provide a unified almost sure convergence rates analysis for SGD, SHB and SNAG. Recently, Liang et al. [24] presented an almost sure convergence analysis of stochastic composite objective mirror descent (SCOMID).
Remark 1.1. It is worth noting that our proof does not rely on the convexity of the function f, so convexity of f is not assumed in the model. However, to ensure the continuity of the generalized gradient operator, we assume that f is convex.
1.2. Proximal Stochastic Accelerated Gradient (PSAG) Method
Based on the above, we now consider an accelerated version of proximal stochastic
Table 1. Comparison of problem setting, momentum, algorithm and convergence results with some relevant literature.
SCOMID = Stochastic Composite Objective Mirror Descent; PSGD = Proximal Stochastic Gradient Descent; SGD = Stochastic Gradient Descent; AdaGrad = Adaptive Stochastic Gradient Descent; SHB = Stochastic Heavy Ball; SNAG = Stochastic Nesterov’s Accelerated graDient; PSAG = Proximal Stochastic Accelerated Gradient (in this paper).
gradient method and use the unbiased stochastic gradient (see the last column of Table 1), the iteration of the PSAG method is given by:
(5)
where
is an extrapolation parameter and
is the usual step size,
is an unbiased estimator of the gradient, where
is randomly picked from
with probability
via the independent identically distribution. Then, we have:
(6)
where in (5),
is the stepsize,
is the stochastic gradient and
is an extrapolation step. For non-smooth composite problem (1), we establish almost sure convergence rates of PSAG that the minimum of the squared gradient mapping norm is arbitrarily close to zero with probability one. It is noted that PSAG method reduces to PSGD method when
in (5).
The rest of this paper is organized as follows. In Section 2, we recall some definitions and known results for further analysis. Then, we present our convergence analysis of PSAG and its convergence rate in Section 3. Finally, we summarize our findings and draw conclusions in Section 4.
2. Preliminaries
2.1. Notations
The optimal solution of problem (1) is denoted by
, and the optimal set of problem (1) is non-empty and denoted by
. The optimal value of the problem (1) is denoted by
. In the rest of this work, for notational brevity, we will omit the subscript of norm
for
.
2.2. Definitions
Definition 2.1. (Convexity). A function
is said to be convex; i.e. for all
,
(7)
Definition 2.2. (L-smoothness). A function
is said to be L-smooth if the gradient
is Lipschitz continuous with constant L, i.e. there exists a constant
such that:
, for all
. (8)
It is well-known that Definition 2.2 implies the following inequality (see, e.g. Nesterov [31] (Lemma 1.2.3)):
(9)
Remark 2.1. Definition 2.2 essentially implies that gradient descent with sufficiently small step size is well behaved, and we also have
is L-smooth where
(
is
-smooth).
Definition 2.3. (Unbiasedness). Given a random iterate
. We call the stochastic gradient
(
is sampled randomly from
) is unbiased if:
(10)
holds.
Definition 2.4. (Prox-grad operator). Consider the composite optimization problem (1), we have:
where
is the prox-grad operator defined by:
(11)
Definition 2.5. (Gradient mapping). Suppose that
satisfy the problem (1) settings. Then, the gradient mapping is the operator
defined by:
(12)
for any
.
3. Convergence Analysis
3.1. Lemmas on Supermartingale Convergence Rates
The proof about almost sure convergence relies on the classical Robbins-Siegmund supermartingale convergence result (Theorem 1 in [32] ).
Lemma 3.1. (Theorem 1, [32] ) Assume that
and
are three non-negative sequences of random variable,
is a non-negative real sequence and
is a σ-algebra. If
are all
-measurable and the following conditions holds:
1)
.
2)
almost surely.
Then,
converges almost surely and
almost surely (a.s.)
Lemma 3.2. (Lemma 3, [23] ) Assume that
is a non-negative sequence of random variable and
is non-increasing. If the following conditions hold:
(13)
then we have:
(14)
where o denotes the higher-order infinitesimal. i.e. for two sequences
and
,
if and only if
.
3.2. Almost Sure Convergence Rate Analysis for Stochastic Proximal Accelerated Gradient Method
Reviewing the iteration of PSAG (5), using the Definition 2.4, 2.5, we rewrite the PSAG as follows:
(15)
where
,
,
is the step size and
and
is randomly picked from
with probability
via the independent identically distribution.
Lemma 3.3. Suppose that
for some
. If
is L-smooth, the for any
, we have:
(16)
Proof By the optimality of u, we can show that for any
,
(17)
Upon rearranging the above equation, we have:
(18)
Since
is L-smooth, we can apply Definition 2.2 to have:
(19)
(20)
Next, upon combining (19) and (20), we arrive at:
(21)
By the fact that
, we finally obtain:
(22)
Theorem 3.1. Suppose that
is lower bounded and there exists
,
such that:
(23)
where
,
is randomly picked from
with probability
via the independent identically distribution, the iteration of PSAG (15) is employed with a non-increasing stepsize
such that:
and
,
where
is defined in (23), then we have:
If
is convex, we obtain:
(24)
Proof We begin with the conclusion in (16) of Lemma 3.3. Firstly, by the definition of
in Definition 2.4, let
, then set
,
and
to meet the condition
, i.e.
(25)
From the update in (2), we can show that the iterate
for any
. Upon applying Lemma 3.3 with
,
,
and
for
, we have:
(26)
Next, by the update rule in (6) of PSAG algorithm, we choose
, then set
,
and
to meet the condition
, i.e.
(27)
Similarly, from the definition of
in (25), we can show that for any
, the sequence
. Now, upon applying Lemma 3.3 with
,
,
, and
, for
, we then obtain:
(28)
Now, we consider
,
(29)
where we use the Cauchy-Schwarz
in the first inequality, and the second inequality follows from
with
and
. The last inequality holds by
with
and
.
Upon substituting (29) back into (28), we further have:
(30)
where the last inequality holds by the stepsize condition
, i.e.
.
Next, we consider
and combine Definition 2.4 with (5) and (15), we have:
and
Due to the convexity of f, we know that there exists
such that:
Therefore,
(31)
Then, by (15), we also have:
(32)
Upon combining (31) and (32) with (30), we have:
(33)
Now, the conditional expectation on (33) with respect to
gives:
(34)
where the first equality holds by the fact that
is
-measurable, i.e.
and
. The last inequality follows from
.
Upon subtracting
from both sides of (34), we have:
(35)
Upon multiplying (35) by
, we can obtain:
(36)
where the last inequality follows from the non-increasing behaviour of the stepsize
, i.e.
.
Finally, let
,
,
and
, then (36) becomes:
Recalling the stepsize conditions
and
, we know that
,
. Thus, by Lemma 3.1, we have:
Finally, with the condition
of Lemma 3.2, we have:
(37)
This completes the proof.
Remark 3.1. (23) is a new variance assumption on the stochastic gradient, where
is weaker than the bounded variance assumption (i.e.
in [33] ).
Remark 3.2. It is noting that we get the boundedness of the gradient mapping
due to the continuity of the prox-grad operator
which is deduced by the convexity of f.
Corollary 3.1. Following the setting of Theorem 3.1 and choosing the stepsize
for any
where
gives:
(38)
Proof Stepsize satisfying
has been studied in [23] . It follows that:
(39)
which implies that
. Upon by the integral test inequality, we have:
(40)
where the last inequality follows from the concavity of
, so that:
In other words, by taking
and
, we can get
.
Next, combining (39) and (40), we can obtain:
(41)
Upon setting
, and applying the inequality
, we have:
(42)
For the left hand side of (42), we apply Theorem 3.1 to get:
The application of Squeeze theorem in conjunction with (42) gives:
When
, we have
, which implies that:
Therefore, we finally obtain:
This completes the proof.
Remark 3.3. Corollary 3.1 indicates that
is arbitrarily close to 0 with probability one.
4. Conclusion
This paper presents PSAG with unbiased gradient estimation and analyzes its almost sure convergence for solving composite optimization problems, wherein the objective function comprises a smooth component and a non-smooth component. By leveraging certain key assumptions, we have established the almost sure convergence of PSAG with unbiased gradient estimation. Furthermore, we have demonstrated that the minimum of the squared gradient mapping norm approaches zero arbitrarily closely with probability one.
Founding
This paper was supported by the Natural Science Foundation of China (Grant No. 11801455), the Sichuan Science and Technology Program (Grant No. 2023NSFSC1922), the Innovation Team Funds of China West Normal University (Grant No. KCXTD2023-3), and the Fundamental Research Funds of China West Normal University (Grant No. 23kc010).