Convergence Rate Analysis of Modified BiG-SAM for Solving Bi-Level Optimization Problems Based on S-FISTA ()
1. Introduction
In this paper, we mainly consider the bi-level optimization problems, which is derived from the Stackelberg game in game theorey. Bi-level optimization problems is a special kind of optimization problem that involves two levels, called outer level and inner level. This structure means that the goals and constraints of the outer level problem depend on the solution of the inner problem. Bi-level optimization problems have a wide range of applications in many fields, including economics, engineering design, transportation planning, machine learning and so on.
Recall that the classical bi-level optimization problem, where the outer and inner objective functions are convex. The outer objective function is a constrained minimization problem, it is
(OP)
where
is the set of minimizers of the inner objective function, which is a composite convex minimization problem, as follows,
(P1)
In this case,
is a strong convex and differentiable function,
is a convex and continuously differentiable function and
is an extended real-valued function on
. Here,
maybe is a nonsmooth function. Problem (OP)-(P1) is called simple bi-level optimization in [1], which is opposed to the more general version of the problem, see in [2].
Note that, both inner problem (P1)and outer problem (OP) are classical convex optimization problems, which can be solved in different cases, by projected gradient, proximal gradient algorithm, forward-backward splitting algorithm, and so on. However, if we combine problem (OP) and (P1) together, it is difficult to handle.
For problem (OP)-(P1), we can solve it directly or indirectly. In general, we can transform the bi-level optimization problems into a simple optimization problem structure. Then, it can be solved indirectly and easily. The common method for solving the classical bi-level optimization problem is Tikhonov regularization [3], it is for some
, solving the following regularized problem:
(1.1)
Problem (OP) and (P1) can be traced back to the work of Managsarian and Meyer [4] in the process of developing efficient algorithms for large scale linear programs. They proposed a modification of the Tikhonov regularization technique [3], the underlying idea is called finite-perturbation property. It is that finding a parameter
(Tikhonov perturbation parameter) such that for all
,
This property is proven by Managsarian initially, when the inner level problem is a linear program. Then, it was extended by Ferris et al. [5], where the inner objective problem is general convex optimization problem.
In [5], they considered the case that
is an indicator function of a closed convex set
, and under some restrictions, they demonstrated that the optimal solution of problem (1.1) is the optimal solution of problem (OP), when there exists a small enough
. In practice, the value of
is unknown, it means that solving problem (1.1) should depend on a sequence of regularizing parameters
, where
as
. Solodov [6] showed that for
, there is no need to find the optimal solution of problem (1.1) with indicator
. He proposed an explicit and more tractable proximal point method for the bi-level optimization problem (OP)-(P1), which is opposite to the algorithm proposed by Cabot [7], where the approximation scheme is only implicit thus making the method of Cabot [7] not easy to implement. Based on the proximal point algorithm, some researchers developed various proximal point algorithms to solve the problems under different types of framework, see [8] [9].
On the other hand, we can solve the bi-level problem (OP)-(P1) by a direct approach, called hybrid steepest descent method [10], where the sequence converges to the optimal solution according to
and
. Then, the hybrid steepest descent method was further extended by Neto et al. for solving a more general outer objective function.
Recently, Beck et al. [11] proposed a new direct first order method, which is called Minimal Norm Gradient, it can solve problem (OP). The author proved that in terms of the inner objective function, the algorithm has
convergence rate result. However, for some choice of outer objective function
, the computation of this method is so expensive to get the optimal solution. Motivated by the minimal norm gradient method, Sabach [12] suggested a first order method, called BiG-SAM, to solve the bilevel optimization problem, which is based on existing viscosity approximation methods [13]. According to the convergence analysis of the BiG-SAM, they get
global convergence rate of in the light of the inner level function. In addition, Yekini et al. combined inertial technique with Big-SAM and proposed an inertial BiG-SAM algorithm, more details see in [14].
In this paper, we consider a more general composite convex function as the inner objective function of the bi-level optimization problem. It is,
(P2)
where
is a continuously differentiable function with
-Lipschitz continuous gradient,
is a real-valued and convex and
is an extended valued function. It is rich enough to cover many interesting generic optimization models by appropriate choices of
. For more details about the assumption of the functions we will give in the following Section 2. Let
is the unique optimal solution of problem (OP).
This paper is organized as follows. In Section 2, we use smooth technique partially smooth inner level problem (P2), construct the inner objective function (Q) and give some useful lemmas for the convergence rate analysis. In Section 3, we introduce a new BiG-SAM algorithm for solving the bi-level optimization problem with outer level (OP). In Section 4, we investigate the convergence rate of BiG-SAM for non-smooth version of bi-level optimization problem.
2. Motivation and Construction
In this section, we will present the motivation and the process of our algorithm design, as well as the useful lemma. Recall the bi-level optimization problem, where the outer level is problem (OP), the inner level is
(P2)
where
,
and
are satisfy the following Assumption.
Assumption I:
i)
is a convex and continuous differentiable function, it has a Lipschitz continuous gradient with constant
, i.e.,
ii)
is a
-smoothable function,
. It is that for any
,
denotes a
-smooth approximation of
with parameters
.
iii)
is a proper, lower semicontinuous and convex function.
iv)
has bounded level sets. Specifically, for any
, there exists
such that
v) Let
be the optimal set of problem (P2), and it is nonempty. Set
as the optimal value of the problem (P2).
Definition 2.1. [15] A convex function
is called
-smoothable,
if for any
, there exists a convex differentiable function
such that the following holds:
a)
for all
.
b)
is
-smooth.
The function
is called a
-smooth approximation of
with parameters
.
According to the definition of
, combined with the Definition 2.1, we can smooth
as a
-smooth function
. Then, problem (P2) becomes into
For convenience, we write the above composite minimization problem as the following form,
(Q)
Remark 2.1. Here, let
be the optimal solution set of problem (Q), which is non-empty and
. When
is small enough, the optimal solution set
is equal to
. This implies that when
is small enough, the optimal solution of
over
is equivalent to the optimal solution of
over
, i.e.,
Observe that problem (P2) is a non-smooth composite function, involving two non-smooth functions. A common methodology for solving non-smooth optimization problems is to replace the original problem by a sequence of approximating smooth problems, and then using direct and classical methods [16] to solve. The main idea is to transform the nondifferentiable problem into a smooth problem, there are many different smoothing approaches to various classes of non-smooth optimization problem, see in [17]-[19]. Motivated by the work of Beck et al. [15], we consider to partially smooth the inner objective function (P2) and transform it into a classical structure of convex optimization problem. The motivation of this approach is twofold. Firstly, according to the design and algorithmic analysis of the related schemes, it comes from the classical composite optimization problem formula, like (P1) where
is smooth and
is nonsmooth, can be solved by gradient-based algorithms, [20] [21]. Second, in many applications [22] [23], one of the non-smooth terms in (Q), plays a key role in describing a desirable property of the decision variable
. If we smooth all non-smooth functions in (P2), it will destroy the property of
.
Since
is
-smooth, it has
-Lipschitz continuous gradient
. Due to
and
is also have
-Lipschitz continuous gradient, it implies the
is a continuous differentiable convex function, the Lipschitz constant of the gradient is equal to
. Thus, problem (Q) can be solved by the classical proximal gradient (PG) method or proximal forward-backward method, the iteration is as follow:
(2.1)
where the stepsize is
. Since
is a proper, lower semicontinuous and convex function,
is called Moreau Proximal Mapping, which is defined as follow:
(2.2)
In addition, the PG method (1) can be regarded as a fixed-point algorithm, it can be formulated as
(2.3)
it is called the prox-grad mapping (proximal-gradient mapping). Denote
, it is the fixed point set of
. From [24] and [22], we have the following two crucial properties.
Lemma 2.1. [12]
i) The prox-grad mapping
is nonexpansive for all
, that is,
(2.4)
ii) Fixed points of the prox-grad mapping
are optimal solutions of problem (Q) and the reverse is also true, i.e.,
(2.5)
Therefore, we have that
for all
.
Now, we give a key proposition, which is a significant result in convergence rate analysis. Indeed, we consider the following quadratic approximation of
at
, it is:
which admits a unique minimizer
It implies that we have,
From the characterize of the optimality of
, we have the following lemma.
Lemma 2.2. For any
, one has
if and only if there exists
, the subdifferential of
, such that
(2.6)
Then we have the following proposition.
Proposition 2.1. Suppose that Assumption I holds true. Let
and denote
, such that
(2.7)
Then, for any
and
, we have
(2.8)
Proof. From (2.7), we have,
(2.9)
Since
, and
are convex, it implies
where the
is defined from lemma 2.2. Now, Summing the above inequalities together, we have
(2.10)
On the other hand, from the definition of
, let
, we have
(2.11)
Now, combine (9) with (10) and (11), it follows that
where the first equality is getting from we used (6). Thus, we complete the proof.
□
Now, we turn to discuss the details of outer level problem (OP). Recall the formulation of (OP), it is a convex constraint optimization, where
is the optimal solution set of problem (Q). In general, we suppose that outer objective function (OP) satisfies the following assumptions.
Assumption II.
i)
is
-strongly convex,
.
ii)
is a continuously differentiable function such that
is Lipschitz continuous with constant
.
Due to
is differential, we can use the gradient descent method solving the outer level problem (OP). Nevertheless, not all the outer function
satisfies Assumption II (ii), that is,
is nonsmooth. So, we assume
satisfies the following property.
Assumption III:
is strong convex with parameter
and
-Lipschitz continuous.
In this case, we can depend on the Moreau envelop of
and solve the outer level problem, which is denoted by
, the formula is as follow:
(2.12)
It is well-known that
is continuously differentiable on
with an
-Lipschitz continuous gradient, which is given by
(2.13)
Additionaly, the Moreau envelope has another useful property, that is:
Lemma 2.3. [12] Let
be a strongly convex function with strong convexity parameter
and let
. Then, the Moreau envelope
is strongly convex with parameter
.
Definition 2.2. [12] A mapping
is said to be
-contraction if there exists
such that
When
satisfies Assumption II, the following result is crucial for our derivations.
Lemma 2.4. [12] Suppose that Assumption II holds and let
is a identity operator. Then, the mapping
is a contractive operator, for all
, that is,
(2.14)
3. BiG-SAM Algorithm for Smooth Bi-Level Optimization
In this section, we will introduce a new BiG-SAM algorithm for solving bi-level optimization. Firstly, we similarly construct a general framwork for the bi-level problem, consisting of inner problem (Q).
3.1. The General Framework
Motivated by Sabach et al. [12], our approach is also to use the Sequential Averaging Method (SAM), in which we can handle the fixed point problem, proposed in [13]. Right now, we will analyse how to use it for solving the bi-level optimization problems, which is made up of problem (OP) and (Q). The sequence
generated by SAM algorithm, converges to a solution of the fixed-point problem [13]. The iteration is
where
is a carefully chosen sequence in
.
The above algorithm, designed in [13], is to find a fixed-point of a nonexpansive operator
, i.e.
. This point also satisfies a variational inequality:
(3.1)
where
is a contraction mapping. Here, it means that
is the “better” fixed-point in
. Where
satisfies the following assumption.
Assumption IV. Let
be a sequence of real numbers in
which satisfies
,
and
.
It should be noted that Assumption IV holds true for several choices of sequences
which include, for example,
for any choice of
.
The following lemma summarizes the key results on SAM, as established in ([13], Theorem 3.2), which serve as the foundation for this paper.
Lemma 3.1. [12] Assume that
is a
-contraction and that
is nonexpansive mapping, for which
. Let
be the sequence generated by SAM. If Assumption IV holds true, then the following assertions are valid.
i) The sequence
is bounded, in particular, for any we have, for all
, that
(3.2)
Moreover, for all
, we also have that
ii) The sequence
converges to some
.
iii) The limit point
of
, which the existence is ensured by (ii), satisfies the following variational inequality
(3.3)
3.2. SAM for Smooth Bi-Level Optimization Problem
From the Section 2, we know that the inner level optimization problem (P2) can be smoothed as problem (Q), it has a same structure of the inner level optimization problem in [12]. Inspired by the works of [12], we can match the outer problem (OP) and the inner problem (Q) with mapping
and
, respectively. Here, we know that,
i) The mapping
and its fixed-point set
are related to problem (Q) with the composite function
and the optimal solution set
.
ii) The mapping
is related to problem (OP) and the outer objective function
.
Thus, we set
as the prox-grad mapping defined in (2.3), that is, for some
we have
(3.4)
According to Lemma 2.1 and based on the Assumption I, it implies that
is nonexpansive and
. Then, from Lemma 2.4 and Assumption II, we can construct the
-contraction mapping
as follow:
(3.5)
where
, and the contraction parameter is
.
Similarly, we use the Sequential Averaging Method (SAM) to design a new BiG-SAM algorithm for solving the bi-level optimization problems(Q) and (OP). The iteration is as follow.
New Bi-level Gradient SAM (BiG-SAM) |
Input:
, and
satisfying Assumption IV. |
Initialization:
. |
General Step (
): |
(3.6)
(3.7)
(3.8) |
Due to the new BiG-SAM algorithm is similar to the works in [12], we can get the similar convergence result.
Lemma 3.2. Let
be a sequence generated by the new BiG-SAM. Suppose that Assumptions I, II and IV hold true. Then, the sequence
converges to
which satisfies
(3.9)
and therefore,
is the optimal solution of problem (OP).
The proof of lemma 3.2 is similar to the proof Proposition 5 in [12].
3.3. The Global Convergence Rate of BiG-SAM
In this section, we first prove a technical result on the convergence rate of the gap between successive SAM iterations for fixed-point problems, as described in Section 3.1. Then, we use this to derive our main result: a convergence rate for BiG-SAM in terms of the values of the inner objective function.
We first present a technical lemma which will assist us in the rate of convergence proof.
Lemma 3.3. [12] Let
. Assume that
is a sequence of nonnegative real numbers which satisfy
and
where
,
is a sequence which is defined by
, and
is a sequence of real numbers such that
. Then, the sequence
satisfies
where
.
For simplicity, we denote
and
for any
. The convergence analysis rate is divided into two parts, which ultimately lead to the main conclusions of Theorem 3.1 and Theorem 3.2. Lemma 3.4 provides useful inequalities, while Proposition 3.1 shows that by choosing an appropriate sequence
, the distance between successive elements of
is bounded by
, and the sequence converges to a fixed-point of
at the same rate.
Lemma 3.4. [12] Assume that
is a
-contraction and that
is nonexpansive mapping, for which
. Let
and
be sequences generated by SAM. Then, for any
and any , defining
the following inequalities hold true
(3.10)
(3.11)
(3.12)
(3.13)
Now, we prove the convergence rate of the sequence
, where
is generated by SAM and the averaging parameters
, are chosen as follows.
(3.14)
where
. For simplicity, we prove our results under the assumption that
. It is important to note that all the following results remain valid for any
chosen from the interval
.
Proposition 3.1. Let
and
be sequences generated by SAM where
is defined by (3.14). Then, for any , the two sequences
and
converge to
, and the rates of convergence are given by
(3.15)
and
(3.16)
where
is defined in (3.2), and
.
Proof. From the definitions of
and
, we directly obtain:
(3.17)
where the second inequality follows from (3.10) and (3.11). Now, let and let
, then
(3.18)
where the second inequality follows from (3.12) and (3.13), as well as the definition of
, and the last inequality follows from Lemma 3.1(i). Additionally, we have
(3.19)
where the second inequality follows from Lemma 3.1(i). Let
,
and
. Then, it means that (3.17) is equal to
Note that,
and combine it with (3.18), we know that
. Now, set
. According to Lemma 3.3, we know that
it means that we have (3.15). Then the convergence rate for
is derived from the following arguments. Recall that
, then
where the second inequality is due to the previous result as well as (3.18).
□
It is no hard to see from (3.16) that the sequence generated by BiG-SAM algorithm converges to an optimal solution of the inner problem (Q) with
. In the following, we will discuss the convergence an important result that is the convergence of
. Why not discuss the convergence of
directly? Because of the domain of the function
may not be feasible for
. However, due to
as
and
is lower semicontinuous, we know that proving convergence of
to the optimal value also means the convergence of
to the same value. The global convergence rate result is as follow.
Theorem 3.1. Let
and
be sequences generated by BiG-SAM. Let
be defined by (3.14). Then, for all
and
, we have that
where , and
.
Proof. From Proposition 2.1 we have, for any step-size
, that the following inequality that holds true,
(3.20)
For
, from Lemma 3.1(i) and Lemma 3.1, we obtain
(3.21)
Substituting (3.21) into (3.20), we get
(3.22)
We complete the proof.
□
Now, we will show the complexity result of BiG-SAM algorithm.
Theorem 3.2. Suppose that Assumption I holds. Let
for some fixed
. Let
and
be generated by BiG-SAM algorithm with smoothing parameter
Then for any
satisfying
where
, and
, it holds that
.
Proof. Using the
-smooth approximation property of
with parameters
, it follows that for any
,
(3.23)
In particular, the following two inequalities hold:
(3.24)
In which, combined with (3.22) and
, it yields
where , and
. Therefore, for a given
, it holds that for any
,
(3.25)
Minimizing the right-hand side w.r.t.
, we obtain
(3.26)
Plugging (3.26) into (3.25), it implies that for any
,
Thus, to make sure that
is an
-optimal solution for any
, it is enough that
will satisfy
Setting
, the above inequality reduces to
which, by the fact that
, is equivalent to
We conclude that
should satisfy
In particular, if we choose
and
according to (3.26), meaning that
then for any
, it holds that
. By (3.23) and (3.24),
which along with the inequality
implies that , and hence, by Assumption I (iv), it follows that
, where
. Therefore,
. Consequently,
The second inequality follows from the fact that for any
, it holds that
. Consequently, for any
, we have
, thus establishing the desired result.
□
4. BiG-SAM for Nonsmooth Bi-level Optimization Problems
In this section, we adopt the problem (OP) described in Section 2, where the objective function
does not necessarily satisfy Assumption II, which satisfies the Assumption III.
Note that, BiG-SAM cannot be directly applied to bi-level problems with Assumption III. However, we can handle this case indirectly. From the strong convexity of
, we can smooth
by the Moreau envelope
. Recall the properties of Moreau envelope in Section 2,
is continuously differentiable, with a
-Lipschitz continuous gradient,
, and is strongly convex (see Lemma 2.3). Thus,
satisfies Assumption II, it maked BiG-SAM algorithm applicable. In this case, step (3.7) can be simplified as follow:
(4.1)
where the second equality follows from (2.13). This implies that computing
(
) requires evaluating the proximal mapping of
.
Remark 1. Note that the proximal mapping of a strongly convex function is a contraction ([12], Lemma 6), making the theory in Section 3.1 applicable here. A direct consequence of Lemma 3.2 applies to the mappings:
where
and
.
Lemma 4.1. [12] Let
be a sequence generated by BiG-SAM. Under Assumptions I, III and IV, for
, the sequence
converges to
, where
satisfies:
(4.2)
Thus,
is the optimal solution of the problem (OP) with respect to the Moreau envelope
, i.e.,
where
is the set of optimal solutions of problem (Q).
Smoothing the
seems to not affect the convergence rate, which is based on the inner function. From the works in [12], we know that the convergence rate depends on the contraction parameter
. We have the following result from ([12], Lemma 6),
Let
be the required uniform accuracy in terms of the outer objective function, that is,
(4.3)
where it should be noted that
for all
. Now, we aim to determine the number of iterations
required to achieve an
-optimal solution for the inner problem, that is,
while keeping the uniform accuracy as given in (4.3). This means that
depends on both
and
.
Proposition 4.1. Let
for some fixed
. Let
and
be a sequence generated by BiG-SAM and suppose that Assumptions I, III and IV hold true. In addition, suppose that the smoothing parameter is chosen by
and
Then, (4.3) holds true and for
where
, it holds that
.
Proof. Since
is
-Lipschitz continuous (see Assumption III) it follows that the norms of the subgradients of
are bounded from above by
. Thus, from ([15], Lemma 4.2) it follows, for all
, that
Therefore, for
, we obtain that
Using the
-smooth approximation property of
with parameters
, it follows that for any
,
(4.4)
In particular, the following two inequalities hold:
(4.5)
which, combined with (3.22), yields
where , and
. Therefore, for a given
, it holds that for any
,
(4.6)
Minimizing the right-hand side w.r.t.
, we obtain
(4.7)
Plugging the above expression into (4.6), we conclude that for any
,
Thus, to guarantee that
is an
-optimal solution for any
, it is enough that
will satisfy
Denoting
, the above inequality reduces to
which, by the fact that
, is equivalent to
We conclude that
should satisfy
In particular, if we choose
(4.8)
Now, substituting
,
, and
into equation (4.8), we obtain
and
according to (4.7), meaning that
then for any
it holds that
. By (4.4) and (4.5),
which along with the inequality
implies that , and hence, by Assumption I (iv), it follows that
, where
. Therefore,
. Consequently,
The second inequality follows from the fact that for any
, it holds that
. The desired result is achieved by selecting
as the upper bound derived above.
□
5. Conclusion
In this paper, we construct a novel bi-level gradient sequential averaging method (BiG-SAM) for solving a more composite convex bi-level optimization problem, where the inner level problem is to find the optimal solution of the sum of three functions, including two non-smooth function and one smoothable. We analyze the convergence rate of the BiG-SAM in two different cases, where the outer objective is smooth or non-smooth, the global convergence rate with respected to inner objective function is
. In the future, we could further explore the convergence rate and complexity analysis of the outer objective function. Additionally, we could design stochastic and parallel variants of BiG-SAM for high-dimensional data or distributed scenarios. This would help reduce computational complexity while ensuring convergence and scalability of both the outer and inner objectives in distributed environments.
NOTES
*First author.
#Corresponding author.