Convergence Rate Analysis of Modified BiG-SAM for Solving Bi-Level Optimization Problems Based on S-FISTA

Nishi Xiaoyin; Lin Yang

doi:10.4236/jamp.2025.134084

Journal of Applied Mathematics and Physics > Vol.13 No.4, April 2025

Convergence Rate Analysis of Modified BiG-SAM for Solving Bi-Level Optimization Problems Based on S-FISTA

Nishi Xiaoyin^*, Lin Yang^#
Key Laboratory of Optimization Theory and Applications at China West Normal University of Sichuan Province, School of Mathematics and Information, China West Normal University, Nanchong, China.
DOI: 10.4236/jamp.2025.134084 PDF HTML XML 16 Downloads 91 Views

Abstract

In this paper, we consider a more general bi-level optimization problem, where the inner objective function is consisted of three convex functions, involving a smooth and two non-smooth functions. The outer objective function is a classical strongly convex function which may not be smooth. Motivated by the smoothing approaches, we modify the classical bi-level gradient sequential averaging method to solve the bi-level optimization problem. Under some mild conditions, we obtain the convergence rate of the generated sequence, and then based on the analysis framework of S-FISTA, we show the global convergence rate of the proposed algorithm.

Keywords

Bi-Level Optimization, Convex Problems, First-Order Methods, Proximal Gradient Method, Sequential Averaging Method, Moreau Envelope

Share and Cite:

Xiaoyin, N. and Yang, L. (2025) Convergence Rate Analysis of Modified BiG-SAM for Solving Bi-Level Optimization Problems Based on S-FISTA. Journal of Applied Mathematics and Physics, 13, 1555-1576. doi: 10.4236/jamp.2025.134084.

1. Introduction

In this paper, we mainly consider the bi-level optimization problems, which is derived from the Stackelberg game in game theorey. Bi-level optimization problems is a special kind of optimization problem that involves two levels, called outer level and inner level. This structure means that the goals and constraints of the outer level problem depend on the solution of the inner problem. Bi-level optimization problems have a wide range of applications in many fields, including economics, engineering design, transportation planning, machine learning and so on.

Recall that the classical bi-level optimization problem, where the outer and inner objective functions are convex. The outer objective function is a constrained minimization problem, it is

$min_{x \in X^{*}} ω (x),$ (OP)

where $X^{*}$ is the set of minimizers of the inner objective function, which is a composite convex minimization problem, as follows,

$min_{x \in ℝ^{n}} {Φ (x) : = f (x) + g (x)} .$ (P1)

In this case, $ω$ is a strong convex and differentiable function, $f$ is a convex and continuously differentiable function and $g$ is an extended real-valued function on $ℝ^{n}$ . Here, $g$ maybe is a nonsmooth function. Problem (OP)-(P1) is called simple bi-level optimization in [1], which is opposed to the more general version of the problem, see in [2].

Note that, both inner problem (P1)and outer problem (OP) are classical convex optimization problems, which can be solved in different cases, by projected gradient, proximal gradient algorithm, forward-backward splitting algorithm, and so on. However, if we combine problem (OP) and (P1) together, it is difficult to handle.

For problem (OP)-(P1), we can solve it directly or indirectly. In general, we can transform the bi-level optimization problems into a simple optimization problem structure. Then, it can be solved indirectly and easily. The common method for solving the classical bi-level optimization problem is Tikhonov regularization [3], it is for some $θ > 0$ , solving the following regularized problem:

$min_{x \in ℝ^{n}} {Φ_{θ} (x) : = Φ (x) + θ ω (x)} .$ (1.1)

Problem (OP) and (P1) can be traced back to the work of Managsarian and Meyer [4] in the process of developing efficient algorithms for large scale linear programs. They proposed a modification of the Tikhonov regularization technique [3], the underlying idea is called finite-perturbation property. It is that finding a parameter $θ^{*}$ (Tikhonov perturbation parameter) such that for all $θ \in [0, θ^{*}]$ ,

$arg min_{x \in X^{*}} ω (x) = arg min_{x \in ℝ^{n}} Φ_{θ} (x) .$

This property is proven by Managsarian initially, when the inner level problem is a linear program. Then, it was extended by Ferris et al. [5], where the inner objective problem is general convex optimization problem.

In [5], they considered the case that $g$ is an indicator function of a closed convex set $C$ , and under some restrictions, they demonstrated that the optimal solution of problem (1.1) is the optimal solution of problem (OP), when there exists a small enough $θ^{*} > 0$ . In practice, the value of $θ^{*}$ is unknown, it means that solving problem (1.1) should depend on a sequence of regularizing parameters ${θ_{n}}$ , where $θ_{n} \to 0$ as $n \to + \infty$ . Solodov [6] showed that for $\sum_{n = 1}^{\infty} θ_{n} = \infty$ , there is no need to find the optimal solution of problem (1.1) with indicator $θ_{n}$ . He proposed an explicit and more tractable proximal point method for the bi-level optimization problem (OP)-(P1), which is opposite to the algorithm proposed by Cabot [7], where the approximation scheme is only implicit thus making the method of Cabot [7] not easy to implement. Based on the proximal point algorithm, some researchers developed various proximal point algorithms to solve the problems under different types of framework, see [8] [9].

On the other hand, we can solve the bi-level problem (OP)-(P1) by a direct approach, called hybrid steepest descent method [10], where the sequence converges to the optimal solution according to $\sum_{n = 1}^{\infty} θ_{n} = \infty$ and $θ_{n} \to 0 (n \to \infty)$ . Then, the hybrid steepest descent method was further extended by Neto et al. for solving a more general outer objective function.

Recently, Beck et al. [11] proposed a new direct first order method, which is called Minimal Norm Gradient, it can solve problem (OP). The author proved that in terms of the inner objective function, the algorithm has $O (1 / \sqrt{n})$ convergence rate result. However, for some choice of outer objective function $ω$ , the computation of this method is so expensive to get the optimal solution. Motivated by the minimal norm gradient method, Sabach [12] suggested a first order method, called BiG-SAM, to solve the bilevel optimization problem, which is based on existing viscosity approximation methods [13]. According to the convergence analysis of the BiG-SAM, they get $O (1 / n)$ global convergence rate of in the light of the inner level function. In addition, Yekini et al. combined inertial technique with Big-SAM and proposed an inertial BiG-SAM algorithm, more details see in [14].

In this paper, we consider a more general composite convex function as the inner objective function of the bi-level optimization problem. It is,

$min_{x \in ℝ^{n}} {H (x) : = f (x) + h (x) + g (x)},$ (P2)

where $f$ is a continuously differentiable function with $L_{f}$ -Lipschitz continuous gradient, $h$ is a real-valued and convex and $g$ is an extended valued function. It is rich enough to cover many interesting generic optimization models by appropriate choices of $(f, h, g)$ . For more details about the assumption of the functions we will give in the following Section 2. Let $x_{o p}^{*}$ is the unique optimal solution of problem (OP).

This paper is organized as follows. In Section 2, we use smooth technique partially smooth inner level problem (P2), construct the inner objective function (Q) and give some useful lemmas for the convergence rate analysis. In Section 3, we introduce a new BiG-SAM algorithm for solving the bi-level optimization problem with outer level (OP). In Section 4, we investigate the convergence rate of BiG-SAM for non-smooth version of bi-level optimization problem.

2. Motivation and Construction

In this section, we will present the motivation and the process of our algorithm design, as well as the useful lemma. Recall the bi-level optimization problem, where the outer level is problem (OP), the inner level is

$min_{x \in ℝ^{n}} {H (x) : = f (x) + h (x) + g (x)},$ (P2)

where $f$ , $h$ and $g$ are satisfy the following Assumption.

Assumption I:

i) $f : ℝ^{n} \to ℝ$ is a convex and continuous differentiable function, it has a Lipschitz continuous gradient with constant $L_{f}$ , i.e.,

$‖ \nabla f (x) - \nabla f (y) ‖ \leq L_{f} ‖ x - y ‖, \forall x, y \in ℝ^{n}$

ii) $h : ℝ^{n} \to ℝ$ is a $(α, β)$ -smoothable function, $(α, β > 0)$ . It is that for any $μ > 0$ , $h_{μ}$ denotes a $\frac{1}{μ}$ -smooth approximation of $h$ with parameters $(α, β)$ .

iii) $g : ℝ^{n} \to (- \infty, \infty]$ is a proper, lower semicontinuous and convex function.

iv) $H$ has bounded level sets. Specifically, for any $δ > 0$ , there exists $R_{δ} > 0$ such that

$‖ x ‖ \leq R_{δ} for any x satisfying H (x) \leq δ .$

v) Let $X_{P 2}^{*}$ be the optimal set of problem (P2), and it is nonempty. Set $H_{opt}$ as the optimal value of the problem (P2).

Definition 2.1. [15] A convex function $h : ℝ^{n} \to ℝ$ is called $(α, β)$ -smoothable, $(α, β > 0)$ if for any $μ > 0$ , there exists a convex differentiable function $h : ℝ^{n} \to ℝ$ such that the following holds:

a) $h_{μ} (x) \leq h (x) \leq h_{μ} (x) + β μ$ for all $x \in ℝ^{n}$ .

b) $h_{μ}$ is $\frac{α}{μ}$ -smooth.

The function $h_{μ}$ is called a $\frac{1}{μ}$ -smooth approximation of $h$ with parameters $(α, β)$ .

According to the definition of $h$ , combined with the Definition 2.1, we can smooth $h$ as a $1 / μ$ -smooth function $h_{μ}$ . Then, problem (P2) becomes into

$min_{x \in ℝ^{n}} {H_{μ} (x) : = \underset{F_{μ} (x)}{\underset{︸}{f (x) + h_{μ} (x)}} + g (x)} .$

For convenience, we write the above composite minimization problem as the following form,

$min_{x \in ℝ^{n}} {F_{μ} (x) + g (x)} .$ (Q)

Remark 2.1. Here, let $X^{*}$ be the optimal solution set of problem (Q), which is non-empty and $X^{*} \subseteq X_{P 2}^{*}$ . When $μ$ is small enough, the optimal solution set $X^{*}$ is equal to $X_{P_{2}}^{*}$ . This implies that when $μ$ is small enough, the optimal solution of $ω$ over $X^{*}$ is equivalent to the optimal solution of $ω$ over $X_{P 2}^{*}$ , i.e.,

$min_{x \in X^{*}} ω (x) = min_{x \in X_{P 2}^{*}} ω (x) .$

Observe that problem (P2) is a non-smooth composite function, involving two non-smooth functions. A common methodology for solving non-smooth optimization problems is to replace the original problem by a sequence of approximating smooth problems, and then using direct and classical methods [16] to solve. The main idea is to transform the nondifferentiable problem into a smooth problem, there are many different smoothing approaches to various classes of non-smooth optimization problem, see in [17]-[19]. Motivated by the work of Beck et al. [15], we consider to partially smooth the inner objective function (P2) and transform it into a classical structure of convex optimization problem. The motivation of this approach is twofold. Firstly, according to the design and algorithmic analysis of the related schemes, it comes from the classical composite optimization problem formula, like (P1) where $f$ is smooth and $g$ is nonsmooth, can be solved by gradient-based algorithms, [20] [21]. Second, in many applications [22] [23], one of the non-smooth terms in (Q), plays a key role in describing a desirable property of the decision variable $x$ . If we smooth all non-smooth functions in (P2), it will destroy the property of $x$ .

Since $h_{μ}$ is $\frac{α}{μ}$ -smooth, it has $\frac{α}{μ}$ -Lipschitz continuous gradient $\nabla h_{u}$ . Due to $F_{μ} = f + h_{μ}$ and $f$ is also have $L_{f}$ -Lipschitz continuous gradient, it implies the $F_{μ}$ is a continuous differentiable convex function, the Lipschitz constant of the gradient is equal to $L_{f} + \frac{α}{μ}$ . Thus, problem (Q) can be solved by the classical proximal gradient (PG) method or proximal forward-backward method, the iteration is as follow:

$x^{n + 1} = {prox}_{λ g} (x^{n} - λ \nabla F_{μ} (x^{n})), n \in ℕ,$ (2.1)

where the stepsize is $λ = 1 / (L_{f} + \frac{α}{μ})$ . Since $g$ is a proper, lower semicontinuous and convex function, ${prox}_{λ g}$ is called Moreau Proximal Mapping, which is defined as follow:

${prox}_{λ g} (x) : = \underset{u \in ℝ^{n}}{\arg \min} {g (u) + \frac{1}{2 λ} {‖ u - x ‖}^{2}} .$ (2.2)

In addition, the PG method (1) can be regarded as a fixed-point algorithm, it can be formulated as

$T_{λ} (x) : = {prox}_{λ g} (x - λ \nabla F_{μ} (x)),$ (2.3)

it is called the prox-grad mapping (proximal-gradient mapping). Denote $Fix (T_{λ}) : = {x \in ℝ^{n} | T_{λ} (x) = x}$ , it is the fixed point set of $T_{λ}$ . From [24] and [22], we have the following two crucial properties.

Lemma 2.1. [12]

i) The prox-grad mapping $T_{λ}$ is nonexpansive for all $λ \in (0, 1 / (L_{f} + α / μ)]$ , that is,

$T_{λ} (x) - T_{λ} (y) \leq ‖ x - y ‖, \forall x, y \in ℝ^{n}$ (2.4)

ii) Fixed points of the prox-grad mapping $T_{λ}$ are optimal solutions of problem (Q) and the reverse is also true, i.e.,

$x \in X^{*} \Leftrightarrow x = T_{λ} (x) = {prox}_{λ g} (x - λ \nabla F_{μ} (x))$ (2.5)

Therefore, we have that $Fix (T_{λ}) = X^{*}$ for all $λ > 0$ .

Now, we give a key proposition, which is a significant result in convergence rate analysis. Indeed, we consider the following quadratic approximation of $H_{μ} (x) : = f (x) + h_{μ} (x) + g (x)$ at $y$ , it is:

$Q_{λ} (x, y) : = f (y) + h_{μ} (y) + 〈 x - y, \nabla f (y) + \nabla h_{μ} (y) 〉 + \frac{1}{2 λ} {‖ x - y ‖}^{2} + g (x),$

which admits a unique minimizer

$p_{λ} (y) : = \arg \min {Q_{λ} (x, y) : x \in ℝ^{n}} .$

It implies that we have,

$p_{λ} (y) = \underset{x \in ℝ^{n}}{\arg \min} {g (x) + \frac{1}{2 λ} x - {(y - λ \nabla f (y) - λ \nabla h_{μ} (y))}^{2}} .$

From the characterize of the optimality of $p_{λ} (\cdot)$ , we have the following lemma.

Lemma 2.2. For any $y \in ℝ^{n}$ , one has $z = p_{λ} (y)$ if and only if there exists $γ (y) \in \partial g (z)$ , the subdifferential of $g (\cdot)$ , such that

$\nabla f (y) + \nabla h_{μ} (y) + \frac{1}{λ} (z - y) + γ (y) = 0.$ (2.6)

Then we have the following proposition.

Proposition 2.1. Suppose that Assumption I holds true. Let $y \in ℝ^{n}$ and denote $p_{λ} (y) = T_{λ} (y)$ , such that

$H_{μ} (p_{λ} (y)) \leq Q_{λ} (p_{λ} (y), y) .$ (2.7)

Then, for any $λ \leq 1 / (L_{f} + α / μ)$ and $x \in ℝ^{n}$ , we have

$H_{μ} (x) - H_{μ} (p_{λ} (y)) \geq \frac{1}{2 λ} {‖ p_{λ} (y) - y ‖}^{2} + \frac{1}{λ} 〈 y - x, p_{λ} (y) - y 〉 .$ (2.8)

Proof. From (2.7), we have,

$H_{μ} (x) - H_{μ} (p_{λ} (y)) \geq H_{μ} (x) - Q_{λ} (p_{λ} (y), y) .$ (2.9)

Since $f, h$ , and $g$ are convex, it implies

$f (x) \geq f (y) + 〈 x - y, \nabla f (y) 〉,$

$h_{μ} (x) \geq h_{μ} (y) + 〈 x - y, \nabla h_{μ} (y) 〉,$

$g (x) \geq g (p_{λ} (y)) + 〈 x - p_{λ} (y), γ (y) 〉,$

where the $γ (y)$ is defined from lemma 2.2. Now, Summing the above inequalities together, we have

$\begin{matrix} H_{μ} (x) \geq f (y) + h_{μ} (y) + 〈 x - y, \nabla f (y) + \nabla h_{μ} (y) 〉 + g (p_{λ} (y)) \\ + 〈 x - p_{λ} (y), γ (y) 〉 . \end{matrix}$ (2.10)

On the other hand, from the definition of $Q_{λ} (x, y)$ , let $x : = p_{λ} (y)$ , we have

$\begin{matrix} Q_{λ} (p_{λ} (y), y) = f (y) + h_{μ} (y) + 〈 p_{λ} (y) - y, \nabla f (y) + \nabla h_{μ} (y) 〉 \\ + \frac{1}{2 λ} {‖ p_{λ} (y) - y ‖}^{2} + g (p_{λ} (y)) . \end{matrix}$ (2.11)

Now, combine (9) with (10) and (11), it follows that

$\begin{matrix} H_{μ} (x) - H_{μ} (p_{λ} (y)) \geq H_{μ} (x) - Q_{λ} (p_{λ} (y), y) \\ \geq 〈 x - p_{λ} (y), \nabla f (y) + \nabla h_{μ} (y) + γ (y) 〉 - \frac{1}{2 λ} {‖ p_{λ} (y) - y ‖}^{2} \\ = - \frac{1}{2 λ} {‖ p_{λ} (y) - y ‖}^{2} + 〈 x - p_{λ} (y), - \frac{1}{λ} (p_{λ} (y) - y) 〉 \\ = - \frac{1}{2 λ} {‖ p_{λ} (y) - y ‖}^{2} + \frac{1}{λ} 〈 p_{λ} (y) - y + y - x, p_{λ} (y) - y 〉 \\ = - \frac{1}{2 λ} {‖ p_{λ} (y) - y ‖}^{2} + \frac{1}{λ} {‖ p_{λ} (y) - y ‖}^{2} + \frac{1}{λ} 〈 y - x, p_{λ} (y) - y 〉 \\ = \frac{1}{2 λ} {‖ p_{λ} (y) - y ‖}^{2} + \frac{1}{λ} 〈 y - x, p_{λ} (y) - y 〉, \end{matrix}$

where the first equality is getting from we used (6). Thus, we complete the proof.

□

Now, we turn to discuss the details of outer level problem (OP). Recall the formulation of (OP), it is a convex constraint optimization, where $X^{*}$ is the optimal solution set of problem (Q). In general, we suppose that outer objective function (OP) satisfies the following assumptions.

Assumption II.

i) $ω : ℝ^{n} \to ℝ$ is $σ$ -strongly convex, $σ > 0$ .

ii) $ω$ is a continuously differentiable function such that $\nabla ω$ is Lipschitz continuous with constant $L_{ω} > 0$ .

Due to $ω$ is differential, we can use the gradient descent method solving the outer level problem (OP). Nevertheless, not all the outer function $ω$ satisfies Assumption II (ii), that is, $ω$ is nonsmooth. So, we assume $ω$ satisfies the following property.

Assumption III: $ω : ℝ^{n} \to ℝ$ is strong convex with parameter $σ > 0$ and $ℓ_{ω}$ -Lipschitz continuous.

In this case, we can depend on the Moreau envelop of $ω$ and solve the outer level problem, which is denoted by $M_{s ω}$ , the formula is as follow:

$M_{s ω} (x) = min_{u \in ℝ^{n}} {ω (u) + \frac{1}{2 s} {‖ u - x ‖}^{2}} .$ (2.12)

It is well-known that $M_{s ω}$ is continuously differentiable on $ℝ^{n}$ with an $1 / s$ -Lipschitz continuous gradient, which is given by

$\nabla M_{s ω} (x) = \frac{1}{s} (x - {prox}_{s ω} (x)) .$ (2.13)

Additionaly, the Moreau envelope has another useful property, that is:

Lemma 2.3. [12] Let $ω : ℝ^{n} \to (- \infty, \infty]$ be a strongly convex function with strong convexity parameter $σ$ and let $s > 0$ . Then, the Moreau envelope $M_{s ω}$ is strongly convex with parameter $σ / (1 + s σ)$ .

Definition 2.2. [12] A mapping $S : ℝ^{n} \to ℝ^{n}$ is said to be $η$ -contraction if there exists $η < 1$ such that

$‖ S (x) - S (y) ‖ \leq η ‖ x - y ‖, \forall x, y \in ℝ^{n} .$

When $ω$ satisfies Assumption II, the following result is crucial for our derivations.

Lemma 2.4. [12] Suppose that Assumption II holds and let $I$ is a identity operator. Then, the mapping $S_{s} = I - s \nabla ω$ is a contractive operator, for all $s \leq 2 / (L_{ω} + σ)$ , that is,

$‖ x - s \nabla ω (x) - (y - s \nabla ω (y)) ‖ \leq \sqrt{1 - \frac{2 s σ L_{ω}}{σ + L_{ω}}} ‖ x - y ‖, \forall x, y \in ℝ^{n} .$ (2.14)

3. BiG-SAM Algorithm for Smooth Bi-Level Optimization

In this section, we will introduce a new BiG-SAM algorithm for solving bi-level optimization. Firstly, we similarly construct a general framwork for the bi-level problem, consisting of inner problem (Q).

3.1. The General Framework

Motivated by Sabach et al. [12], our approach is also to use the Sequential Averaging Method (SAM), in which we can handle the fixed point problem, proposed in [13]. Right now, we will analyse how to use it for solving the bi-level optimization problems, which is made up of problem (OP) and (Q). The sequence ${x^{n}}$ generated by SAM algorithm, converges to a solution of the fixed-point problem [13]. The iteration is

$x^{n} = α_{n} S (x^{n - 1}) + (1 - α_{n}) T (x^{n - 1}),$

where ${α_{n}}_{n \in ℕ}$ is a carefully chosen sequence in $(0, 1]$ .

The above algorithm, designed in [13], is to find a fixed-point of a nonexpansive operator $T$ , i.e. $x^{*} \in Fix (T)$ . This point also satisfies a variational inequality:

$〈 x^{*} - S (x^{*}), x - x^{*} 〉 \geq 0, \forall x \in Fix (T),$ (3.1)

where $S$ is a contraction mapping. Here, it means that $x^{*}$ is the “better” fixed-point in $Fix (T)$ . Where ${α_{n}}$ satisfies the following assumption.

Assumption IV. Let ${α_{n}}_{n \in ℕ}$ be a sequence of real numbers in $(0, 1]$ which satisfies $\lim_{n \to \infty} α_{n} = 0$ , $\sum_{n = 1}^{\infty} α_{n} = \infty$ and $\lim_{n \to \infty} α_{n + 1} / α_{n} = 1$ .

It should be noted that Assumption IV holds true for several choices of sequences ${α_{n}}_{n \in ℕ}$ which include, for example, $α_{n} = α / n, n \in ℕ$ for any choice of $α \in (0, 1]$ .

The following lemma summarizes the key results on SAM, as established in ([13], Theorem 3.2), which serve as the foundation for this paper.

Lemma 3.1. [12] Assume that $S : ℝ^{n} \to ℝ^{n}$ is a $η$ -contraction and that $T : ℝ^{n} \to ℝ^{n}$ is nonexpansive mapping, for which $Fix (T) \neq \emptyset$ . Let ${x^{n}}_{n \in ℕ}$ be the sequence generated by SAM. If Assumption IV holds true, then the following assertions are valid.

i) The sequence ${x^{n}}_{n \in ℕ}$ is bounded, in particular, for any $\tilde{x} \in Fix (T)$ we have, for all $n \in ℕ$ , that

$‖ x^{n} - \tilde{x} ‖ \leq C_{\tilde{x}} : = max {‖ x^{0} - \tilde{x} ‖, \frac{1}{1 - η} ‖ (I - S) \tilde{x} ‖} .$ (3.2)

Moreover, for all $n \in ℕ$ , we also have that

$‖ T (x^{n}) - \tilde{x} ‖ \leq C_{\tilde{x}} and ‖ S (x^{n}) - S (\tilde{x}) ‖ \leq η C_{\tilde{x}} .$

ii) The sequence ${x^{n}}_{n \in ℕ}$ converges to some $x^{*} \in Fix (T)$ .

iii) The limit point $x^{*}$ of ${x^{n}}_{n \in ℕ}$ , which the existence is ensured by (ii), satisfies the following variational inequality

$〈 x^{*} - S (x^{*}), x - x^{*} 〉 \geq 0, \forall x \in Fix (T) .$ (3.3)

3.2. SAM for Smooth Bi-Level Optimization Problem

From the Section 2, we know that the inner level optimization problem (P2) can be smoothed as problem (Q), it has a same structure of the inner level optimization problem in [12]. Inspired by the works of [12], we can match the outer problem (OP) and the inner problem (Q) with mapping $S$ and $T$ , respectively. Here, we know that,

i) The mapping $T$ and its fixed-point set $Fix (T)$ are related to problem (Q) with the composite function $H_{μ} = F_{μ} + g$ and the optimal solution set $X^{*}$ .

ii) The mapping $S$ is related to problem (OP) and the outer objective function $ω$ .

Thus, we set $T$ as the prox-grad mapping defined in (2.3), that is, for some $λ \in (0, 1 / (L_{f} + α / μ)]$ we have

$T (x) : = T_{λ} (x) = {prox}_{λ g} (x - λ \nabla F_{μ} (x)) .$ (3.4)

According to Lemma 2.1 and based on the Assumption I, it implies that $T$ is nonexpansive and $Fix (T) = X^{*}$ . Then, from Lemma 2.4 and Assumption II, we can construct the $η$ -contraction mapping $S$ as follow:

$S (x) : = x - s \nabla ω (x),$ (3.5)

where $s \in (0, 2 / (σ + L_{ω})]$ , and the contraction parameter is $η = {(1 - 2 s L_{ω} σ / (L_{ω} + σ))}^{1 / 2}$ .

Similarly, we use the Sequential Averaging Method (SAM) to design a new BiG-SAM algorithm for solving the bi-level optimization problems(Q) and (OP). The iteration is as follow.

New Bi-level Gradient SAM (BiG-SAM)

Input: $λ \in (0, 1 / (L_{f} + α / μ)], s \in (0, 2 / (L_{ω} + σ)]$ , and ${α_{n}}_{n \in ℕ}$ satisfying Assumption IV.

Initialization: $x^{0} \in ℝ^{n}$ .

General Step ( $n = 1, 2, \dots$ ):

$y^{n} = {prox}_{λ g} (x^{n - 1} - λ \nabla F_{μ} (x^{n - 1})),$ (3.6)

$z^{n} = x^{n - 1} - s \nabla ω (x^{n - 1}),$ (3.7)

$x^{n} = α_{n} z^{n} + (1 - α_{n}) y^{n} .$ (3.8)

Due to the new BiG-SAM algorithm is similar to the works in [12], we can get the similar convergence result.

Lemma 3.2. Let ${x^{n}}_{n \in ℕ}$ be a sequence generated by the new BiG-SAM. Suppose that Assumptions I, II and IV hold true. Then, the sequence ${x^{n}}_{n \in ℕ}$ converges to $x^{*} \in X^{*}$ which satisfies

$〈 \nabla ω (x^{*}), x - x^{*} 〉 \geq 0, \forall x \in X^{*},$ (3.9)

and therefore, $x^{*} = x_{o p}^{*}$ is the optimal solution of problem (OP).

The proof of lemma 3.2 is similar to the proof Proposition 5 in [12].

3.3. The Global Convergence Rate of BiG-SAM

In this section, we first prove a technical result on the convergence rate of the gap between successive SAM iterations for fixed-point problems, as described in Section 3.1. Then, we use this to derive our main result: a convergence rate for BiG-SAM in terms of the values of the inner objective function.

We first present a technical lemma which will assist us in the rate of convergence proof.

Lemma 3.3. [12] Let $M > 0$ . Assume that ${a_{n}}_{n \in ℕ}$ is a sequence of nonnegative real numbers which satisfy $a_{1} \leq M$ and

$a_{n + 1} \leq (1 - γ b_{n + 1}) a_{n} + (b_{n} - b_{n + 1}) c_{n}, n \geq 1,$

where $γ \in (0, 1]$ , ${b_{n}}_{n \in ℕ}$ is a sequence which is defined by $b_{n} = min {2 / (γ n), 1}$ , and ${c_{n}}_{n \in ℕ}$ is a sequence of real numbers such that $c_{n} \leq M < \infty$ . Then, the sequence ${a_{n}}_{n \in ℕ}$ satisfies

$a_{n} \leq \frac{M J}{γ n}, n \geq 1,$

where $J = ⌊ 2 / γ ⌋$ .

For simplicity, we denote $y^{n} = T (x^{n - 1})$ and $z^{n} = S (x^{n - 1})$ for any $n \in ℕ$ . The convergence analysis rate is divided into two parts, which ultimately lead to the main conclusions of Theorem 3.1 and Theorem 3.2. Lemma 3.4 provides useful inequalities, while Proposition 3.1 shows that by choosing an appropriate sequence ${α_{n}}_{n \in ℕ}$ , the distance between successive elements of ${x^{n}}_{n \in ℕ}$ is bounded by $O (1 / n)$ , and the sequence converges to a fixed-point of $T$ at the same rate.

Lemma 3.4. [12] Assume that $S : ℝ^{n} \to ℝ^{n}$ is a $η$ -contraction and that $T : ℝ^{n} \to ℝ^{n}$ is nonexpansive mapping, for which $Fix (T) \neq \emptyset$ . Let ${x^{n}}_{n \in ℕ}, {y^{n}}_{n \in ℕ}$ and ${z^{n}}_{n \in ℕ}$ be sequences generated by SAM. Then, for any $n \geq 1$ and any $\tilde{x} \in Fix (T)$ , defining $\tilde{z} = S (\tilde{x})$ the following inequalities hold true

$‖ y^{n + 1} - y^{n} ‖ \leq ‖ x^{n} - x^{n - 1} ‖,$ (3.10)

$‖ z^{n + 1} - z^{n} ‖ \leq η ‖ x^{n} - x^{n - 1} ‖,$ (3.11)

$‖ y^{n} - \tilde{x} ‖ \leq ‖ x^{n - 1} - \tilde{x} ‖,$ (3.12)

$‖ z^{n} - \tilde{z} ‖ \leq η ‖ x^{n - 1} - \tilde{x} ‖ .$ (3.13)

Now, we prove the convergence rate of the sequence ${‖ x^{n} - x^{n - 1} ‖}_{n \in ℕ}$ , where ${x^{n}}_{n \in ℕ}$ is generated by SAM and the averaging parameters $α_{n}, n \in ℕ$ , are chosen as follows.

$α_{n} = min {\frac{2 γ}{n (1 - η)}, 1}, n \geq 1,$ (3.14)

where $γ \in (0, 1]$ . For simplicity, we prove our results under the assumption that $γ = 1$ . It is important to note that all the following results remain valid for any $γ$ chosen from the interval $(0, 1]$ .

Proposition 3.1. Let ${x^{n}}_{n \in ℕ}, {y^{n}}_{n \in ℕ}$ and ${z^{n}}_{n \in ℕ}$ be sequences generated by SAM where ${α_{n}}_{n \in ℕ}$ is defined by (3.14). Then, for any $\tilde{x} \in Fix (T)$ , the two sequences ${‖ x^{n} - x^{n - 1} ‖}_{n \in ℕ}$ and ${‖ y^{n} - x^{n - 1} ‖}_{n \in ℕ}$ converge to $0$ , and the rates of convergence are given by

$‖ x^{n} - x^{n - 1} ‖ \leq \frac{2 C_{\tilde{x}} J}{(1 - η) n}, n \geq 1,$ (3.15)

and

$‖ y^{n} - x^{n - 1} ‖ \leq \frac{2 C_{\tilde{x}} (J + 2)}{(1 - η) n}, n \geq 1,$ (3.16)

where $C_{\tilde{x}}$ is defined in (3.2), and $J = ⌊ 2 / (1 - η) ⌋$ .

Proof. From the definitions of $x^{n}$ and $x^{n + 1}$ , we directly obtain:

$\begin{matrix} ‖ x^{n + 1} - x^{n} ‖ = ‖ (1 - α_{n + 1}) y^{n + 1} + α_{n + 1} z^{n + 1} - ((1 - α_{n}) y^{n} + α_{n} z^{n}) ‖ \\ = ‖ (1 - α_{n + 1}) (y^{n + 1} - y^{n}) + α_{n + 1} (z^{n + 1} - z^{n}) + (α_{n} - α_{n + 1}) (y^{n} - z^{n}) ‖ \\ \leq (1 - α_{n + 1}) ‖ y^{n + 1} - y^{n} ‖ + α_{n + 1} ‖ z^{n + 1} - z^{n} ‖ + (α_{n} - α_{n + 1}) ‖ y^{n} - z^{n} ‖ \\ \leq (1 - α_{n + 1}) ‖ x^{n} - x^{n - 1} ‖ + α_{n + 1} η ‖ x^{n} - x^{n - 1} ‖ + (α_{n} - α_{n + 1}) ‖ y^{n} - z^{n} ‖ \\ = (1 - α_{n + 1} (1 - η)) ‖ x^{n} - x^{n - 1} ‖ + (α_{n} - α_{n + 1}) ‖ y^{n} - z^{n} ‖, \end{matrix}$ (3.17)

where the second inequality follows from (3.10) and (3.11). Now, let $\tilde{x} \in Fix (T)$ and let $\tilde{z} = S (\tilde{x})$ , then

$\begin{matrix} ‖ y^{n} - z^{n} ‖ = ‖ y^{n} - \tilde{x} + \tilde{x} - \tilde{z} + \tilde{z} - z^{n} ‖ \\ \leq ‖ y^{n} - \tilde{x} ‖ + ‖ \tilde{x} - \tilde{z} ‖ + ‖ \tilde{z} - z^{n} ‖ \\ \leq ‖ x^{n - 1} - \tilde{x} ‖ + ‖ (I - S) \tilde{x} ‖ + η ‖ x^{n - 1} - \tilde{x} ‖ \\ \leq C_{\tilde{x}} + (1 - η) C_{\tilde{x}} + η C_{\tilde{x}} \\ = 2 C_{\tilde{x}}, \end{matrix}$ (3.18)

where the second inequality follows from (3.12) and (3.13), as well as the definition of $\tilde{z}$ , and the last inequality follows from Lemma 3.1(i). Additionally, we have

$\begin{matrix} ‖ x^{1} - x^{0} ‖ = ‖ x^{1} - \tilde{x} + \tilde{x} - x^{0} ‖ \\ \leq ‖ x^{1} - \tilde{x} ‖ + ‖ x^{0} - \tilde{x} ‖ \\ \leq 2 C_{\tilde{x}}, \end{matrix}$ (3.19)

where the second inequality follows from Lemma 3.1(i). Let $a_{n} : = ‖ x^{n} - x^{n - 1} ‖$ , $b_{n} : = α_{n}, γ : = 1 - η$ and $c_{n} : = ‖ y^{n} - z^{n} ‖$ . Then, it means that (3.17) is equal to

$\begin{matrix} a_{n + 1} = ‖ x^{n + 1} - x^{n} ‖ \leq (1 - α_{n + 1} (1 - η)) ‖ x^{n} - x^{n - 1} ‖ + (α_{n} - α_{n + 1}) ‖ y^{n} - z^{n} ‖ \\ = (1 - b_{n + 1} γ) a_{n} + (b_{n} - b_{n + 1}) c_{n} . \end{matrix}$

Note that, $c_{n} : = ‖ y^{n} - z^{n} ‖$ and combine it with (3.18), we know that $c_{n} \leq 2 C_{\tilde{x}}$ . Now, set $M : = 2 C_{\tilde{x}}$ . According to Lemma 3.3, we know that

$a_{n} \leq \frac{M J}{γ n} = \frac{2 C_{\tilde{x}} J}{(1 - η) n},$

it means that we have (3.15). Then the convergence rate for ${‖ y^{n} - x^{n - 1} ‖}_{n \in ℕ}$ is derived from the following arguments. Recall that $x^{n} = α z^{n} + (1 - α) y^{n}$ , then

$\begin{matrix} ‖ y^{n} - x^{n - 1} ‖ = ‖ y^{n} - x^{n} + x^{n} - x^{n - 1} ‖ \\ \leq ‖ y^{n} - x^{n} ‖ + ‖ x^{n} - x^{n - 1} ‖ \\ = α_{n} ‖ y^{n} - z^{n} ‖ + ‖ x^{n} - x^{n - 1} ‖ \\ \leq \frac{2}{(1 - η) n} 2 C_{\tilde{x}} + \frac{2 C_{\tilde{x}} J}{(1 - η) n} \\ = \frac{2 C_{\tilde{x}} (J + 2)}{(1 - η) n}, \end{matrix}$

where the second inequality is due to the previous result as well as (3.18).

□

It is no hard to see from (3.16) that the sequence generated by BiG-SAM algorithm converges to an optimal solution of the inner problem (Q) with $O (1 / n)$ . In the following, we will discuss the convergence an important result that is the convergence of ${H_{μ} (y^{n})}_{n \in ℕ}$ . Why not discuss the convergence of ${H_{μ} (x^{n})}_{n \in ℕ}$ directly? Because of the domain of the function $H_{μ}$ may not be feasible for ${H_{μ} (x^{n})}$ . However, due to $‖ y^{n} - x^{n - 1} ‖ \to 0$ as $n \to \infty$ and $H_{μ}$ is lower semicontinuous, we know that proving convergence of ${H_{μ} (y^{n})}_{n \in ℕ}$ to the optimal value also means the convergence of ${H_{μ} (x^{n})}_{n \in ℕ}$ to the same value. The global convergence rate result is as follow.

Theorem 3.1. Let ${x^{n}}_{n \in ℕ}, {y^{n}}_{n \in ℕ}$ and ${z^{n}}_{n \in ℕ}$ be sequences generated by BiG-SAM. Let ${α_{n}}_{n \in ℕ}$ be defined by (3.14). Then, for all $λ \leq 1 / (L_{f} + α / μ)$ and $n \in ℕ$ , we have that

$H_{μ} (y^{n}) - H_{μ} (x_{o p}^{*}) \leq \frac{2 C_{x_{o p}^{*}}^{2} (J + 2)}{(n + 1) (1 - η) λ},$

where $C_{x_{o p}^{*}} = max {‖ x^{0} - \tilde{x} ‖, \frac{1}{1 - η} ‖ (I - S) \tilde{x} ‖}$ , and $J = ⌊ 2 / (1 - η) ⌋$ .

Proof. From Proposition 2.1 we have, for any step-size $λ \leq 1 / (L_{f} + α / μ)$ , that the following inequality that holds true,

$H_{μ} (y^{n + 1}) - H_{μ} (x_{o p}^{*}) \leq \frac{1}{λ} 〈 x^{n} - y^{n + 1}, x^{n} - x_{o p}^{*} 〉 - \frac{1}{2 λ} {‖ x^{n} - y^{n + 1} ‖}^{2} .$ (3.20)

For $x_{o p}^{*} \in X^{*} = Fix (T_{λ})$ , from Lemma 3.1(i) and Lemma 3.1, we obtain

$〈 x^{n} - y^{n + 1}, x^{n} - x_{o p}^{*} 〉 \leq ‖ x^{n} - y^{n + 1} ‖ ‖ x^{n} - x_{o p}^{*} ‖ \leq \frac{2 C_{x_{o p}^{*}}^{2} (J + 2)}{(1 - η) (n + 1)} .$ (3.21)

Substituting (3.21) into (3.20), we get

$H_{μ} (y^{n + 1}) - H_{μ} (x_{o p}^{*}) \leq \frac{2 C_{x_{o p}^{*}}^{2} (J + 2)}{(n + 1) (1 - η) λ} .$ (3.22)

We complete the proof.

□

Now, we will show the complexity result of BiG-SAM algorithm.

Theorem 3.2. Suppose that Assumption I holds. Let $ε \in (0, \bar{ε})$ for some fixed $\bar{ε} > 0$ . Let ${x^{n}}_{n \in ℕ}, {y^{n}}_{n \in ℕ}$ and ${z^{n}}_{n \in ℕ}$ be generated by BiG-SAM algorithm with smoothing parameter

$μ = \sqrt{\frac{α}{β}} \frac{ε}{\sqrt{α β} + \sqrt{α β + L_{f} ε}} .$

Then for any $n$ satisfying

$n \geq \frac{2 Γ^{2} (J + 2) {(2 \sqrt{α β} + \sqrt{L_{f} ε})}^{2}}{ε^{2} (1 - η)},$

where $Γ = max {‖ R_{δ} - x^{0} ‖, \frac{1}{1 - η} ‖ (I - S) R_{δ} ‖}$ , and $J = ⌊ 2 / (1 - η) ⌋$ , it holds that $H (y^{n}) - H_{o p t} \leq ε$ .

Proof. Using the $\frac{1}{μ}$ -smooth approximation property of $h_{μ}$ with parameters $(α, β)$ , it follows that for any $y \in ℝ^{n}$ ,

$H_{μ} (y) \leq H (y) \leq H_{μ} (y) + β μ .$ (3.23)

In particular, the following two inequalities hold:

$H_{opt} \geq H_{μ} (x_{o p}^{*}) and H (y^{n}) \leq H_{μ} (y^{n}) + β μ, n = 0, 1, \dots .$ (3.24)

In which, combined with (3.22) and $λ = 1 / (L_{f} + α / μ)$ , it yields

$\begin{matrix} H (y^{n}) - H_{opt} \leq H_{μ} (y^{n}) + β μ - H_{μ} (x_{o p}^{*}) \\ \leq \frac{2 C_{x_{o p}^{*}}^{2} (J + 2)}{(n + 1) (1 - η) λ} + β μ \\ = 2 L_{f} \frac{C_{x_{o p}^{*}}^{2} (J + 2)}{(n + 1) (1 - η)} + (\frac{2 α C_{x_{o p}^{*}}^{2} (J + 2)}{(n + 1) (1 - η)}) \frac{1}{μ} + β μ \\ \leq 2 L_{f} \frac{C_{x_{o p}^{*}}^{2} (J + 2)}{n (1 - η)} + (\frac{2 α C_{x_{o p}^{*}}^{2} (J + 2)}{n (1 - η)}) \frac{1}{μ} + β μ, \end{matrix}$

where $C_{x_{o p}^{*}} : = max {‖ x^{0} - \tilde{x} ‖, \frac{1}{1 - η} ‖ (I - S) \tilde{x} ‖}$ , and $J = ⌊ 2 / (1 - η) ⌋$ . Therefore, for a given $N > 0$ , it holds that for any $n \geq N$ ,

$H (y^{n}) - H_{opt} \leq 2 L_{f} \frac{C_{x_{o p}^{*}}^{2} (J + 2)}{N (1 - η)} + (\frac{2 α C_{x_{o p}^{*}}^{2} (J + 2)}{N (1 - η)}) \frac{1}{μ} + β μ .$ (3.25)

Minimizing the right-hand side w.r.t. $μ$ , we obtain

$μ = \sqrt{\frac{2 α C_{x_{o p}^{*}}^{2} (J + 2)}{N β (1 - η)}} .$ (3.26)

Plugging (3.26) into (3.25), it implies that for any $n \geq N$ ,

$H (y^{n}) - H_{opt} \leq 2 L_{f} \frac{C_{x_{o p}^{*}}^{2} (J + 2)}{N (1 - η)} + 2 \sqrt{\frac{2 α β C_{x_{o p}^{*}}^{2} (J + 2)}{N (1 - η)}} .$

Thus, to make sure that $y^{n}$ is an $ε$ -optimal solution for any $n \geq N$ , it is enough that $N$ will satisfy

$2 L_{f} \frac{C_{x_{o p}^{*}}^{2} (J + 2)}{N (1 - η)} + 2 \sqrt{\frac{2 α β C_{x_{o p}^{*}}^{2} (J + 2)}{N (1 - η)}} \leq ε .$

Setting $τ^{2} = \frac{2 C_{x_{o p}^{*}}^{2} (J + 2)}{N (1 - η)}$ , the above inequality reduces to

$L_{f} τ^{2} + 2 \sqrt{α β} τ - ε \leq 0,$

which, by the fact that $τ > 0$ , is equivalent to

$\sqrt{\frac{2 C_{x_{o p}^{*}}^{2} (J + 2)}{N (1 - η)}} = τ \leq \frac{- \sqrt{α β} + \sqrt{α β + L_{f} ε}}{L_{f}} = \frac{ε}{\sqrt{α β} + \sqrt{α β + L_{f} ε}} .$

We conclude that $N$ should satisfy

$N \geq \frac{(2 C_{x_{o p}^{*}}^{2} (J + 2)) {(\sqrt{α β} + \sqrt{α β + L_{f} ε})}^{2}}{ε^{2} (1 - η)} .$

In particular, if we choose

$N = N_{1} \equiv \frac{(2 C_{x_{o p}^{*}}^{2} (J + 2)) {(\sqrt{α β} + \sqrt{α β + L_{f} ε})}^{2}}{ε^{2} (1 - η)},$

and $μ$ according to (3.26), meaning that

$μ = \sqrt{\frac{2 α C_{x_{o p}^{*}}^{2} (J + 2)}{N_{1} β (1 - η)}} = \sqrt{\frac{α}{β}} \frac{ε}{\sqrt{α β} + \sqrt{α β + L_{f} ε}},$

then for any $n \geq N_{1}$ , it holds that $H (y^{n}) - H_{opt} \leq ε$ . By (3.23) and (3.24),

$H (x_{o p}^{*}) - β μ \leq H_{μ} (x_{o p}^{*}) \leq H_{opt} \leq H (y^{0}),$

which along with the inequality

$μ = \sqrt{\frac{α}{β}} \frac{ε}{\sqrt{α β} + \sqrt{α β + L_{f} ε}} \leq \sqrt{\frac{α}{β}} \frac{ε}{\sqrt{α β} + \sqrt{α β}} \leq \frac{\bar{ε}}{2 β},$

implies that $H (x_{o p}^{*}) \leq H (y^{0}) + \frac{\bar{ε}}{2}$ , and hence, by Assumption I (iv), it follows that $‖ \tilde{x} ‖ \leq R_{δ}$ , where $δ : = H (y^{0}) + \frac{\bar{ϵ}}{2}$ . Therefore, $C_{x_{o p}^{*}} \leq max {‖ R_{δ} - x^{0} ‖, \frac{1}{1 - η} ‖ (I - S) R_{δ} ‖} = Γ$ . Consequently,

$\begin{matrix} N_{1} = \frac{(2 C_{x_{o p}^{*}}^{2} (J + 2)) {(\sqrt{α β} + \sqrt{α β + L_{f} ε})}^{2}}{ε^{2} (1 - η)} \\ \leq \frac{(2 C_{x_{o p}^{*}}^{2} (J + 2)) {(2 \sqrt{α β} + \sqrt{L_{f} ε})}^{2}}{ε^{2} (1 - η)} \\ \leq \frac{2 Γ^{2} (J + 2) {(2 \sqrt{α β} + \sqrt{L_{f} ε})}^{2}}{ε^{2} (1 - η)} \\ \equiv N_{2} . \end{matrix}$

The second inequality follows from the fact that for any $γ, δ \geq 0$ , it holds that $\sqrt{γ + δ} \leq \sqrt{γ} + \sqrt{δ}$ . Consequently, for any $n \geq N_{2}$ , we have $H (y^{n}) - H_{opt} \leq ε$ , thus establishing the desired result.

□

4. BiG-SAM for Nonsmooth Bi-level Optimization Problems

In this section, we adopt the problem (OP) described in Section 2, where the objective function $ω$ does not necessarily satisfy Assumption II, which satisfies the Assumption III.

Note that, BiG-SAM cannot be directly applied to bi-level problems with Assumption III. However, we can handle this case indirectly. From the strong convexity of $ω$ , we can smooth $ω$ by the Moreau envelope $M_{s ω}$ . Recall the properties of Moreau envelope in Section 2, $M_{s ω}$ is continuously differentiable, with a $1 / s$ -Lipschitz continuous gradient, $1 / s > 0$ , and is strongly convex (see Lemma 2.3). Thus, $M_{s ω}$ satisfies Assumption II, it maked BiG-SAM algorithm applicable. In this case, step (3.7) can be simplified as follow:

$\begin{matrix} z^{n} = x^{n - 1} - s \nabla M_{s ω} (x^{n - 1}) \\ = x^{n - 1} - s \frac{1}{s} (x^{n - 1} - {prox}_{s ω} (x^{n - 1})) \\ = {prox}_{s ω} (x^{n - 1}), \end{matrix}$ (4.1)

where the second equality follows from (2.13). This implies that computing $z^{n}$ ( $n \in ℕ$ ) requires evaluating the proximal mapping of $ω$ .

Remark 1. Note that the proximal mapping of a strongly convex function is a contraction ([12], Lemma 6), making the theory in Section 3.1 applicable here. A direct consequence of Lemma 3.2 applies to the mappings:

$S (x) = x - s \nabla M_{s ω} (x) and T (x) = {prox}_{λ g} (x - λ \nabla F_{μ} (x)),$

where $s > 0$ and $λ \in (0, 1 / (L_{f} + α / μ)]$ .

Lemma 4.1. [12] Let ${x^{n}}_{n \in ℕ}$ be a sequence generated by BiG-SAM. Under Assumptions I, III and IV, for $s > 0$ , the sequence ${x^{n}}_{n \in ℕ}$ converges to $x_{s}^{*} \in X^{*}$ , where $x_{s}^{*}$ satisfies:

$〈 \nabla M_{s ω} (x_{s}^{*}), x - x_{s}^{*} 〉 \geq 0, \forall x \in X^{*} .$ (4.2)

Thus, $x_{s}^{*}$ is the optimal solution of the problem (OP) with respect to the Moreau envelope $M_{s ω}$ , i.e.,

$x_{s}^{*} = \underset{x \in X^{*}}{\arg \min} M_{s ω} (x),$

where $X^{*}$ is the set of optimal solutions of problem (Q).

Smoothing the $ω$ seems to not affect the convergence rate, which is based on the inner function. From the works in [12], we know that the convergence rate depends on the contraction parameter $η$ . We have the following result from ([12], Lemma 6),

$η = \frac{1}{1 + s σ} .$

Let $δ > 0$ be the required uniform accuracy in terms of the outer objective function, that is,

$ω (x^{n}) - M_{s ω} (x^{n}) \leq δ, \forall n \in ℕ$ (4.3)

where it should be noted that $ω (x^{n}) - M_{s ω} (x^{n}) \geq 0$ for all $n \in ℕ$ . Now, we aim to determine the number of iterations $N^{'}$ required to achieve an $ε$ -optimal solution for the inner problem, that is,

$H (y^{N^{'}}) - H_{o p t} \leq ε,$

while keeping the uniform accuracy as given in (4.3). This means that $N^{'}$ depends on both $ε$ and $δ$ .

Proposition 4.1. Let $ε \in (0, \bar{ε})$ for some fixed $\bar{ε} > 0$ . Let ${x^{n}}_{n \in ℕ}$ and ${y^{n}}_{n \in ℕ}$ be a sequence generated by BiG-SAM and suppose that Assumptions I, III and IV hold true. In addition, suppose that the smoothing parameter is chosen by

$s = \frac{2 δ}{ℓ_{ω}^{2}}$

and

$μ = \sqrt{\frac{α}{β}} \frac{ε}{\sqrt{α β} + \sqrt{α β + L_{f} ε}} .$

Then, (4.3) holds true and for

$n \geq \frac{2 Γ^{2} {(2 \sqrt{α β} + \sqrt{L_{f} ε})}^{2}}{ε^{2}} (2 + \frac{3 ℓ_{ω}^{2}}{2 σ δ} + \frac{ℓ_{ω}^{4}}{4 σ^{2} δ^{2}}),$

where $Γ = max {‖ R_{δ} - x^{0} ‖, \frac{1}{1 - η} ‖ (I - S) R_{δ} ‖}$ , it holds that $H (y^{n}) - H_{o p t} \leq ε$ .

Proof. Since $ω$ is $ℓ_{ω}$ -Lipschitz continuous (see Assumption III) it follows that the norms of the subgradients of $ω$ are bounded from above by $ℓ_{ω}$ . Thus, from ([15], Lemma 4.2) it follows, for all $x \in ℝ^{n}$ , that

$ω (x) - \frac{s ℓ_{ω}^{2}}{2} \leq M_{s ω} (x) \leq ω (x)$

Therefore, for $s = 2 δ / ℓ_{ω}^{2}$ , we obtain that

$ω (x^{n}) - M_{s ω} (x^{n}) \leq δ, \forall n \in ℕ .$

Using the $\frac{1}{μ}$ -smooth approximation property of $h_{μ}$ with parameters $(α, β)$ , it follows that for any $y \in ℝ^{n}$ ,

$H_{μ} (y) \leq H (y) \leq H_{μ} (y) + β μ .$ (4.4)

In particular, the following two inequalities hold:

$H_{opt} \geq H_{μ} (x_{o p}^{*}) and H (y^{n}) \leq H_{μ} (y^{n}) + β μ, n = 0, 1, \dots,$ (4.5)

which, combined with (3.22), yields

where $C_{x_{o p}^{*}} : = max {‖ x^{0} - \tilde{x} ‖, \frac{1}{1 - η} ‖ (I - S) \tilde{x} ‖}$ , and $J = ⌊ 2 / (1 - η) ⌋$ . Therefore, for a given $N^{'} > 0$ , it holds that for any $n \geq N^{'}$ ,

$H (y^{n}) - H_{opt} \leq 2 L_{f} \frac{C_{x_{o p}^{*}}^{2} (J + 2)}{N^{'} (1 - η)} + (\frac{2 α C_{x_{o p}^{*}}^{2} (J + 2)}{N^{'} (1 - η)}) \frac{1}{μ} + β μ .$ (4.6)

Minimizing the right-hand side w.r.t. $μ$ , we obtain

$μ = \sqrt{\frac{2 α C_{x_{o p}^{*}}^{2} (J + 2)}{N^{'} β (1 - η)}} .$ (4.7)

Plugging the above expression into (4.6), we conclude that for any $n \geq N^{'}$ ,

$H (y^{n}) - H_{opt} \leq 2 L_{f} \frac{C_{x_{o p}^{*}}^{2} (J + 2)}{N^{'} (1 - η)} + 2 \sqrt{\frac{2 α β C_{x_{o p}^{*}}^{2} (J + 2)}{N^{'} (1 - η)}} .$

Thus, to guarantee that $y^{n}$ is an $ε$ -optimal solution for any $n \geq N^{'}$ , it is enough that $N^{'}$ will satisfy

$2 L_{f} \frac{C_{x_{o p}^{*}}^{2} (J + 2)}{N^{'} (1 - η)} + 2 \sqrt{\frac{2 α β C_{x_{o p}^{*}}^{2} (J + 2)}{N^{'} (1 - η)}} \leq ε .$

Denoting $τ^{2} = \frac{2 C_{x_{o p}^{*}}^{2} (J + 2)}{N^{'} (1 - η)}$ , the above inequality reduces to

$L_{f} τ^{2} + 2 \sqrt{α β} τ - ε \leq 0,$

which, by the fact that $τ > 0$ , is equivalent to

$\sqrt{\frac{2 C_{x_{o p}^{*}}^{2} (J + 2)}{N^{'} (1 - η)}} = τ \leq \frac{- \sqrt{α β} + \sqrt{α β + L_{f} ε}}{L_{f}} = \frac{ε}{\sqrt{α β} + \sqrt{α β + L_{f} ε}} .$

We conclude that $N^{'}$ should satisfy

$N^{'} \geq \frac{(2 C_{x_{o p}^{*}}^{2} (J + 2)) {(\sqrt{α β} + \sqrt{α β + L_{f} ε})}^{2}}{ε^{2} (1 - η)} .$

In particular, if we choose

$N^{'} = N_{3} \equiv \frac{(2 C_{x_{o p}^{*}}^{2} (J + 2)) {(\sqrt{α β} + \sqrt{α β + L_{f} ε})}^{2}}{ε^{2} (1 - η)}$ (4.8)

Now, substituting $J = ⌊ 2 / (1 - η) ⌋$ , $η = 1 / (1 + s σ)$ , and $s = 2 δ / ℓ_{ω}^{2}$ into equation (4.8), we obtain

$\begin{matrix} N^{'} = N_{3} \equiv \frac{(2 C_{x_{o p}^{*}}^{2} (J + 2)) {(\sqrt{α β} + \sqrt{α β + L_{f} ε})}^{2}}{ε^{2} (1 - η)} \\ = \frac{(2 C_{x_{o p}^{*}}^{2} (\frac{2}{1 - η} + 2)) {(\sqrt{α β} + \sqrt{α β + L_{f} ε})}^{2}}{ε^{2} (1 - η)} \\ = \frac{4 C_{x_{o p}^{*}}^{2} {(\sqrt{α β} + \sqrt{α β + L_{f} ε})}^{2} (2 - η)}{ε^{2} {(1 - η)}^{2}} \\ = \frac{4 C_{x_{o p}^{*}}^{2} {(\sqrt{α β} + \sqrt{α β + L_{f} ε})}^{2}}{ε} (2 + \frac{3}{s σ} + \frac{1}{{(s σ)}^{2}}) \\ = \frac{4 C_{x_{o p}^{*}}^{2} {(\sqrt{α β} + \sqrt{α β + L_{f} ε})}^{2}}{ε^{2}} (2 + \frac{3 ℓ_{ω}^{2}}{2 σ δ} + \frac{ℓ_{ω}^{4}}{4 σ^{2} δ^{2}}) \end{matrix}$

and $μ$ according to (4.7), meaning that

$μ = \sqrt{\frac{2 α C_{x_{o p}^{*}}^{2} (J + 2)}{N_{3} β (1 - η)}} = \sqrt{\frac{α}{β}} \frac{ε}{\sqrt{α β} + \sqrt{α β + L_{f} ε}},$

then for any $n \geq N_{3}$ it holds that $H (y^{n}) - H_{opt} \leq ε$ . By (4.4) and (4.5),

$H (x_{o p}^{*}) - β μ \leq H_{μ} (x_{o p}^{*}) \leq H_{opt} \leq H (y^{0}),$

which along with the inequality

$μ = \sqrt{\frac{α}{β}} \frac{ε}{\sqrt{α β} + \sqrt{α β + L_{f} ε}} \leq \sqrt{\frac{α}{β}} \frac{ε}{\sqrt{α β} + \sqrt{α β}} \leq \frac{\bar{ε}}{2 β},$

implies that $H (x_{o p}^{*}) \leq H (y^{0}) + \frac{\bar{ε}}{2}$ , and hence, by Assumption I (iv), it follows that $\tilde{x} \leq R_{δ}$ , where $δ : = H (y^{0}) + \frac{\bar{ϵ}}{2}$ . Therefore, $C_{x_{o p}^{*}} \leq max {‖ R_{δ} - x^{0} ‖, \frac{1}{1 - η} ‖ (I - S) R_{δ} ‖} = Γ$ . Consequently,

$\begin{matrix} N_{3} = \frac{4 C_{x_{o p}^{*}}^{2} {(\sqrt{α β} + \sqrt{α β + L_{f} ε})}^{2}}{ε^{2}} (2 + \frac{3 ℓ_{ω}^{2}}{2 σ δ} + \frac{ℓ_{ω}^{4}}{4 σ^{2} δ^{2}}) \\ \leq \frac{4 C_{x_{o p}^{*}}^{2} {(2 \sqrt{α β} + \sqrt{L_{f} ε})}^{2}}{ε^{2}} (2 + \frac{3 ℓ_{ω}^{2}}{2 σ δ} + \frac{ℓ_{ω}^{4}}{4 σ^{2} δ^{2}}) \\ \leq \frac{2 Γ^{2} {(2 \sqrt{α β} + \sqrt{L_{f} ε})}^{2}}{ε^{2}} (2 + \frac{3 ℓ_{ω}^{2}}{2 σ δ} + \frac{ℓ_{ω}^{4}}{4 σ^{2} δ^{2}}) \\ \equiv N_{4} . \end{matrix}$

The second inequality follows from the fact that for any $γ, δ \geq 0$ , it holds that $\sqrt{γ + δ} \leq \sqrt{γ} + \sqrt{δ}$ . The desired result is achieved by selecting $n$ as the upper bound derived above.

□

5. Conclusion

In this paper, we construct a novel bi-level gradient sequential averaging method (BiG-SAM) for solving a more composite convex bi-level optimization problem, where the inner level problem is to find the optimal solution of the sum of three functions, including two non-smooth function and one smoothable. We analyze the convergence rate of the BiG-SAM in two different cases, where the outer objective is smooth or non-smooth, the global convergence rate with respected to inner objective function is $O (1 / n)$ . In the future, we could further explore the convergence rate and complexity analysis of the outer objective function. Additionally, we could design stochastic and parallel variants of BiG-SAM for high-dimensional data or distributed scenarios. This would help reduce computational complexity while ensuring convergence and scalability of both the outer and inner objectives in distributed environments.

NOTES

*First author.

^#Corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1]	Dempe, S., Dinh, N. and Dutta, J. (2010) Optimality Conditions for a Simple Convex Bilevel Programming Problem. In: Burachik, R. and Yao, J.C., Eds., Variational Analysis and Generalized Differentiation in Optimization and Control, Springer, 149-161. https://doi.org/10.1007/978-1-4419-0437-9_7
[2]	Dempe, S. (2002) Foundations of Bilevel Programming. Springer Science & Business Media.
[3]	Tikhonov, A.N. and Arsenin, V.I. (1977) Solutions of Ill-Posed Problems. Winston & Sons.
[4]	Mangasarian, O.L. and Meyer, R.R. (1979) Nonlinear Perturbation of Linear Programs. SIAM Journal on Control and Optimization, 17, 745-752. https://doi.org/10.1137/0317052
[5]	Ferris, M.C. and Mangasarian, O.L. (1991) Finite Perturbation of Convex Programs. Applied Mathematics & Optimization, 23, 263-273. https://doi.org/10.1007/bf01442401
[6]	Solodov, M. (2007) An Explicit Descent Method for Bilevel Convex Optimization. Journal of Convex Analysis, 14, 227-237.
[7]	Cabot, A. (2005) Proximal Point Algorithm Controlled by a Slowly Vanishing Term: Applications to Hierarchical Minimization. SIAM Journal on Optimization, 15, 555-572. https://doi.org/10.1137/s105262340343467x
[8]	Boţ, R.I. and Nguyen, D. (2018) A Forward-Backward Penalty Scheme with Inertial Effects for Monotone Inclusions. Applications to Convex Bilevel Programming. Optimization, 68, 1855-1880. https://doi.org/10.1080/02331934.2018.1556662
[9]	Malitsky, Y. (2017). Chambolle-Pock and Tseng’s Methods: Relationship and Extension to the Bilevel Optimization. https://optimization-online.org/2017/06/6103/
[10]	Yamada, I., Yukawa, M. and Yamagishi, M. (2011) Minimizing the Moreau Envelope of Nonsmooth Convex Functions over the Fixed Point Set of Certain Quasi-Nonexpansive Mappings. In: Bauschke, H., Burachik, R., Combettes, P., Elser, V., Luke, D. and Wolkowicz, H., Eds., Fixed-Point Algorithms for Inverse Problems in Science and Engineering, Springer, 345-390. https://doi.org/10.1007/978-1-4419-9569-8_17
[11]	Beck, A. and Sabach, S. (2013) A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems. Mathematical Programming, 147, 25-46. https://doi.org/10.1007/s10107-013-0708-2
[12]	Sabach, S. and Shtern, S. (2017) A First Order Method for Solving Convex Bilevel Optimization Problems. SIAM Journal on Optimization, 27, 640-660. https://doi.org/10.1137/16m105592x
[13]	Xu, H. (2004) Viscosity Approximation Methods for Nonexpansive Mappings. Journal of Mathematical Analysis and Applications, 298, 279-291. https://doi.org/10.1016/j.jmaa.2004.04.059
[14]	Shehu, Y., Vuong, P.T. and Zemkoho, A. (2019) An Inertial Extrapolation Method for Convex Simple Bilevel Optimization. Optimization Methods and Software, 36, 1-19. https://doi.org/10.1080/10556788.2019.1619729
[15]	Beck, A. and Teboulle, M. (2012) Smoothing and First Order Methods: A Unified Framework. SIAM Journal on Optimization, 22, 557-580. https://doi.org/10.1137/100818327
[16]	Shor, N.Z. (1985) Minimization Methods for Nondifferentiable Functions. Springer-Verlag.
[17]	Ben-Tal, A. and Teboulle, M. (1989) A Smoothing Technique for Nondifferentiable Optimization Problems. In: Dolecki, S., Ed., Optimization, Springer, 1-11. https://doi.org/10.1007/bfb0083582
[18]	Bertsekas, D.P. (1975) Nondifferentiable Optimization via Approximation. In: Balinski, M.L. and Wolfe, P., Eds., Nondifferentiable Optimization, Springer, 1-25. https://doi.org/10.1007/bfb0120696
[19]	Bertsekas, D.P. (1982) Constrained Optimization and Lagrange Multiplier Methods. Academic Press.
[20]	Beck, A. and Teboulle, M. (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2, 183-202. https://doi.org/10.1137/080716542
[21]	Nesterov, Y. (2012) Gradient Methods for Minimizing Composite Functions. Mathematical Programming, 140, 125-161. https://doi.org/10.1007/s10107-012-0629-5
[22]	Beck, A. and Teboulle, M. (2010) Gradient-Based Algorithms with Applications to Signal-Recovery Problems. Journal of Convex Analysis, 17, 445-477.
[23]	Combettes, P.L. and Pesquet, J. (2011) Proximal Splitting Methods in Signal Processing. In: Bauschke, H., Burachik, R., Combettes, P., Elser, V., Luke, D. and Wolkowicz, H., Ed., Fixed-Point Algorithms for Inverse Problems in Science and Engineering, Springer, 185-212. https://doi.org/10.1007/978-1-4419-9569-8_10
[24]	Bauschke, H.H. and Combettes, P.L. (2019) Correction To: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. In: Bauschke, H.H. and Combettes, P.L., Eds., Convex Analysis and Monotone Operator Theory in Hilbert Spaces, Springer, C1-C4. https://doi.org/10.1007/978-3-319-48311-5_31

Journals Menu

Follow SCIRP

	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies