1.
Introduction
Convex bilevel optimization problems play an important role in many real-word applications such as image-signal processing, data science, data prediction, data classification, and artificial intelligence. For some interesting applications, we refer to the recent papers [1,2]. More precisely, we recall the concept of the convex bilevel optimization problem as the following. Let ψ and ϕ be two proper convex and lower semi-continuous functions from a real Hilbert space H into R∪{+∞}, and ϕ is a smooth function. In this work, we consider the following convex bilevel optimization problem:
where h is a strongly convex differentiable function of the form H into R with parameter s, and S⋆ is the solution set of the problem:
Problems (1.1) and (1.2) are known as outer-level and inner-level problems, respectively. It is well-known that if z⋆ satisfies the variational inequality:
then z⋆ is a solution of the outer-level problem (1.1); for more details, see [3]. Generally, the solution of problem (1.2) usually exists under the assumption that ∇ϕ is Lipschitz continuous with parameter Lϕ, that is, there exists Lϕ>0 such that ‖∇ϕ(w)−∇ϕ(v)‖≤Lϕ‖w−v‖ for all w,v∈H.
The proximity operator, proxμψ(z⋆)=J∂ψμ(z⋆)=(I+μ∂ψ)−1(z⋆), where I is an identity mapping and ∂ψ is a subdifferential of ψ, is crucial in solving problem (1.2). It is known that a point z⋆ in S⋆ is a fixed point of proximity operator proxμψ(I−μ∇ϕ). The following classical forward-backward splitting algorithm:
was proposed for solving problem (1.2). After that, Sabach and Shtern [4] introduced the bilevel gradient sequential averaging method (BiG-SAM), as seen in Algorithm 2. They also proved that sequence {xk} generated by BiG-SAM converges strongly to the optimal point z⋆ in the convex bilevel optimization problem (1.1) and (1.2). Later, to speed up the rate of convergence of BiG-SAM, Shehu et al. [5] employed an inertial technique proposed by Polyak [6], as defined by Algorithm 3 (iBiGSAM). Moreover, they proved a strong convergence theorem of Algorithm 3 under some weaker assumptions on {λk} given in [7], that is, limk→∞λk=0 and ∑∞k=1λk=+∞. Moreover, the convergence rate of the iBiG-SAM was consecutively improved by adapting the inertial technique, which is called the alternated inertial bilevel gradient sequential averaging method [8] (aiBiG-SAM), as seen in Algorithm 4. It was shown by some examples in [8] that the convergence behavior of aiBiG-SAM is better than BiG-SAM and iBiG-SAM. Recently, Jolaoso et al. [9] proposed a double inertial technique to accelerate the convergence rate of the strongly convergent 2-step inertial PPA algorithm solving for a zero of the sum of two maximal monotone operators. Yao et al. [10] also introduced a method for solving such a problem, called the weakly convergent FRB algorithm with momentum. This problem is just the inner-level problem in this work.
It is worth noting that all methods mentioned above desire a Lipschitz continuity assumption of ∇ϕ. However, finding a Lipschitz constant of ∇ϕ is sometimes too difficult. To solve the inner-level problem without computing the Lipschitz constant of gradient ∇ϕ, Cruz and Nghia [11] presented a linesearch technique (Linesearch 1) for finding some suitable step size for a forward-backward splitting algorithm. This notion provides weaker assumptions on the gradient of ϕ, as seen in the following criteria:
A1. ϕ,ψ:H→(−∞,+∞] are proper lower semicontinuous convex functions with domψ⊆domϕ;
A2. ϕ is differentiable on an open set containing domψ, and ∇ϕ is uniformly continuous on any bounded subset of domψ and maps any bounded subset of domψ to a bounded set in H.
It is observed that assumption A2 is weaker than the Lipschitz continuity assumption on ∇ϕ. Under assumptions A1 and A2, they proved that the sequence {xk} generated by (1.3), where μk is derived from Linesearch 1 (see more detail in the appendix), converges weakly to the optimal solution of the inner level problem (1.2). Inspired by [11], several algorithms with the linesearch technique were proposed in order to solve problem (1.2); see [12,13,14,15,16,17], for examples. Recently, Hanjing et al. [17] introduced a new linesearch technique (Linesearch 2) and a new algorithm (Algorithm 5), called the forward-backward iterative method with the inertial technical term and linesearch technique, to solve the inner-level problem (1.2). For more details on Linesearch 2 and Algorithm 5, see the appendix. They proved that the sequence {xk} generated by Algorithm 7 converges weakly to a solution of problem (1.2) under some control conditions.
Note that Algorithm 7 was employed to find a solution of the inner-level problem (1.2) and it provided only weak convergence, but the strong convergence is more desired. Inspired by all of the mentioned works, we aim to develop Algorithm 7 for solving the convex bilevel problems (1.1) and (1.2) by employing Linesearch 2 together with the viscosity approximation methods. The strong convergence theorem of our developed algorithm is established under some suitable conditions and assumptions. Furthermore, we apply our proposed algorithm to solve image restoration and data classification problems including comparison of its performance with other algorithms.
2.
Notations and instruments for convergent analysis
In this section, we provide some important definitions, propositions, and lemmas which will be used in the next section. Let H be a real Hilbert space and X be a nonempty closed convex subset of H. Then, for each w∈H, there exists a unique element PXw in X satisfying
The mapping PX is known as the metric projection of H onto X. Moreover,
holds for all w∈H and z∈X. A mapping f:X→H is called Lipschitz continuous if there exists Lf>0 such that
If Lf∈[0,1), then f is called a contraction. Moreover, f is nonexpansive if Lf=1. The domain of function f:H→[−∞,+∞] is denoted by domf, when domf:={v∈H:f(v)<∞}. Let {xk} be a sequence in H, and we adopt the following notations:
1) xk⇀w denotes that sequence {xk} converges weakly to w∈H;
2) xk→w denotes that {xk} converges strongly to w∈H.
For each v,w∈H, the following conditions hold:
1) ‖v±w‖2=‖v‖2±2⟨v,w⟩+‖w‖2;
2) ‖v+w‖2≤‖v‖2+2⟨w,v+w⟩;
3) ‖tv+(1−t)w‖2=t‖v‖2+(1−t)‖w‖2−t(1−t)‖v−w‖2,∀t∈R.
Let ψ:H→(−∞,+∞] be a proper function. The subdifferential ∂ψ of ψ is defined by
If ∂ψ(u)≠∅, then ψ is subdifferentiable at u, and the elements of ∂ψ(u) are the subgradients of ψ at u. The proximal operator, proxψ:H→domψ with proxψ(x):=(I+∂ψ)−1, is single-valued with a full domain. Moreover, we have from [18] that for each x∈H and μ>0,
Let us next revisit some important properties for this work.
Lemma 1 ([19]). Let ∂ψ be a subdifferential of ψ. Then, the following hold:
1) ∂ψ is maximal monotone,
2) Gph(∂ψ):={(v,w)∈H×H:w∈∂ψ(v)} is demiclosed, i.e., if {(vk,wk)} is a sequence in Gph(∂ψ) such that vk⇀v and wk→w, then (v,w)∈ Gph(∂ψ).
Using the same idea of [4, Proposition 3], the following result can be proven.
Proposition 2. Suppose h:H→R is strongly convex with parameter s>0 and continuously differentiable such that ∇h is Lipschitz continuous with constant Lh. Then, the mapping I−t∇h is c-contraction for all 0<t≤2Lh+s, where c=√1−2stLhs+Lh and I is the identity operator.
Proof: For any x,y∈H, we obtain
Using the same proof as in the case of H=Rn on [20, Theorem 2.1.12], we get
From (2.2) and (2.4), we get
Lemma 3 ([21]). Let {ak} be a sequence of nonnegative real numbers satisfying
where {bk} is a sequence in (0,1) such that ∑∞k=1bk=+∞ and {sk} is a sequence satisfying lim supk→∞sk≤0. Then, limk→∞ak=0.
Lemma 4 ([22]). Let {uk} be a sequence of real numbers such that there exists a subsequence {umj} of {uk} such that umj<umj+1 for all j∈N. Then there exists a nondecreasing sequence {kℓ} of N such that limℓ→∞kℓ=∞ and for all sufficiently large ℓ∈N, the following holds:
3.
Main convergence theorems
We begin this section by introducing a new accelerated algorithm (Algorithm 1) by using a linesearch technique together with some modifications of Algorithm 5 for solving bilevel convex minimization problems (1.1) and (1.2). Throughout this section, we let Ω be the set of all solutions of convex bilevel problems (1.1) and (1.2), and we assume that h:H→R is a strongly convex differentiable function with parameter s such that ∇h is Lh-Lipschitz continuous and t∈(0,2Lh+s]. Suppose f:domψ→domψ is a c-contraction for some c∈(0,1). Let {γk} be a real positive sequence, {ξk} a positive sequence, and {λk} be a sequence in (0,1). We propose the following Algorithm 1:
Remark 1. Our proposed algorithm uses a linesearch technique for finding the step size of the proximal gradient methods in order to relax the continuity assumption on the gradeint of f. Note that this linesearch technique employes two proximal evaluations which is appropriated from the algorithms consisting of two proximal evaluations at each iteration, see [12,13,14,15,16,17]. It is observed that those algorithms have a better convergence behavior than the others.
To prove the convergence results of Algorithm 1, we need to find the following results.
Lemma 5. Let {xk} be a sequence generated by Algorithm 1, and v∈domψ. Then, the following inequality holds:
Proof: Let v∈domψ. It follows from (2.2) that
and
Then, by the definition of ∂ψ, we get
and
By the convexity of ϕ, we have for every x∈domϕ and y∈domψ,
which implies
and
From (3.2), (3.3), (3.5), and (3.6), we have
This together with (3.4) gives
By the definition of Linesearch 1, we get
From (3.7) and (3.8), we have
Moreover, we know that
and
By replacing (3.10) and (3.11) in (3.9), we obtain
Lemma 6. Let {xk} be a sequence generated by Algorithm 1 and S⋆≠∅. Suppose that limk→∞ξkλk=0. Then {xk} is bounded. Furthermore, {f(xk)},{uk},{yk}, and {vk} are also bounded.
Proof: Let v⋆∈S⋆. By Lemma 5, we have
which implies
By the definition of uk and since f is a contraction with constant c, we get
This together with (3.14) gives
From (3.1), we have
Using limk→∞ξkλk=0, we obtain limk→∞ηkλk‖yk−yk−1‖=0. Therefore, there exists N>0 such that ηkλk‖yk−yk−1‖≤N for all k∈N. The above inequality implies
By induction, we have ‖xk+1−v⋆‖≤max{‖x1−v⋆‖,‖f(v⋆)−v⋆‖1−c+N1−c}, and so {xk} is bounded. It follows that {f(xk)} is bounded. Combining this with the definition of uk, we obtain that {uk} is bounded. It follows by (3.14) that {yk} and {vk} are also bounded.
Theorem 7. Let {xk} be a sequence generated by Algorithm 1 and S⋆≠∅. Suppose ϕ and ψ satisfy A1 and A2 and the following conditions hold:
1) {λk} is a positive sequence in (0,1);
2) μk≥μ for some μ∈R+;
3) limk→∞λk=0 and ∑∞k=1λk=+∞;
4) limk→∞ξkλk=0.
Then, xk→v⋆∈S⋆ such that v⋆=PS⋆f(v⋆). Moreover, if f:=I−t∇h, then xk→v⋆∈Ω.
Proof: Let v⋆∈S⋆ be such that v⋆=PS⋆f(v⋆). By (3.17), Algorithm 1, and the fact that f is a contraction with constant c, we have
Since limk→∞ηk‖yk−yk−1‖=limk→∞(λk)ξkλk=0, there exists N1>0 such that ηk‖yk−yk−1‖≤N1 for all k∈N. From Lemma 6, we have ‖yk−v⋆‖≤N2 for some N2>0. Choose ˉN=supk∈N{N1,N2}. By (3.21), we get
In order to verify the convergence of {xk}, we analyze the proof into the following two cases.
Case 1. Suppose there exists M∈N such that ‖xk+1−v⋆‖≤‖xk−v⋆‖ for all k≥M. This implies limk→∞‖xk−v⋆‖ exists. From (3.22), we set ak=‖xk−v⋆‖2,bk=λk(1−c), and sk=21−c⟨f(v⋆)−v⋆,uk−v⋆⟩+3ˉNηkλk(1−c)‖yk−yk−1‖. It follows from ∞∑k=1λk=+∞ that ∞∑k=1bk=(1−c)∞∑k=1λk=+∞. In addition,
Then, by limk→∞ξkλk=0, we get limk→∞3ˉNηkλk(1−c)‖yk−yk−1‖=0.
To employ Lemma 3, we need to guarantee that lim supk→∞sk≤0. Since {uk} is bounded, there exists a subsequence {ukj} of {uk} such that ukj⇀w, for some w∈H, and
Next, we show that w∈S⋆. We have from (3.19) and (3.20) that
Combining this with (3.18) and (3.20), we obtain
From (3.13), we have
and
From (3.23) and (3.24), we obtain
and
Moreover, we know that
The uniform convexity of ∇ϕ and (3.26) yield
It implies, by assumption (2), that
This together with (3.26) and (3.27) yields
By the demiclosed nature of Gph(∂(ϕ+ψ)), 0∈∂(ϕ+ψ)(w), and so w∈S⋆. It derives from (2.1) that
which implies that lim supk→∞sk≤0. By Lemma 3, we obtain
Case 2. Suppose that there exists a subsequence {xmj} of {xk} such that
for all j∈N. By Lemma 4, there is a nondecreasing sequence {kℓ} of N such that limℓ→∞kℓ=∞, and for all sufficiently large ℓ∈N, the following formula holds:
We have from (3.25) and (3.26) that
Since {ukℓ} is bounded, there exists a weakly convergent subsequence {ukℓi} of {ukℓ} such that ukℓi⇀w⋆ for some w⋆∈H, and
The uniform convexity of ∇ϕ and (3.29) imply
Moreover, we know that
It implies, by assumption (2), that
Using (3.29) and (3.30), we get
By demiclosedness of Gph(∂(ϕ+ψ)), we obtain 0∈∂(ϕ+ψ)(w⋆) and thus w∈S⋆. This implies that
We derive from (3.22) and ‖xkℓ−v⋆‖≤‖xkℓ+1−v⋆‖ that
which implies
Consequently,
From the above inequality and ‖xℓ−v⋆‖≤‖xkℓ+1−v⋆‖, we obtain
Therefore, we can conclude that xk→v⋆. Finally, we show that v⋆ is the solution of problem (1.1). Since f:=I−t∇h, it follows that f(v⋆):=v⋆−t∇h(v⋆), which implies
for all x∈S⋆. This together with 0<t give us that 0≤⟨∇h(v⋆),x−v⋆⟩ for all x∈S⋆. Hence v⋆ is the solution of the outer-level problem (1.1).
4.
Applications
In this section, we present an experiment on image restoration and data classification problems by using our algorithm, and compare the performance of the proposed algorithm with BiG-SAM, iBiG-SAM, and aiBiG-SAM. We apply MATLAB 9.6 (R2019a) to perform all numerical experiments throughout this work. It runs on a MacBook Air 13.3-inch, 2020, with an Apple M1 chip processor and 8-core GPU, configured with 8 GB of RAM.
4.1. Image restoration problems
In this section, we apply the proposed algorithm to solve the true RGB image restoration problems, and compare its performance with BiG-SAM, iBiG-SAM, and aiBiG-SAM. Let A be a blurring operator, and x be an original image. If b represents an observed image, then a linear image restoration problem is defined by
where x∈Rn×1 and w denotes an additive noise. In the traditional way, we apply the least absolute shrinkage and selection operator (LASSO) [23] method to approximate the original image x. It is given by
where α denotes a positive regularization parameter, ‖x‖1=∑nk=1|xk|, and ‖x‖2=√∑nk=1|xk|2. We see that (4.2) is the inner-level problem (1.2) when ϕ(x)=‖Ax−b‖22 and ψ(x)=β‖x‖1. When the true RGB image is transformed as the matrix on the LASSO model, we see that the size of matrix A and x as well as their members have an effect on the computation for the multiplication of Ax and ‖x‖. To prevent this effect, we adopt the 2-D fast Fourier transform to convert the true RGB images into matrices instead. If W represents the 2-D fast Fourier transform, and B denotes the blurring matrix such that the blurring operator A=BW, then problem (4.2) is transformed to the following problem:
where b∈Rm×n is the observed image of size m×n, and α>0 is a regularization parameter. Therefore, our proposed algorithm can be applied to solve an image restoration problem (4.1) by setting the inner-level problem as follows: ϕ(x)=‖Ax−b‖22, ψ(x)=α‖Wx‖1, and we choose the outer-level problem as h(x)=12‖x‖2. Next, we select all of the parameters satisfying the convergence theorem of each algorithm as seen in Table 1.
Also, the Lipschitz constant Lϕ of ∇ϕ for BiG-SAM, iBiG-SAM, and aiBiG-SAM is calculated by the maximum eigenvalue of the matrix ATA. The efficiency of a restorative image is measured by the peak signal-to-noise ratio (PSNR) in decibel (dB), which is given by
where mse=1K‖xk−v⋆‖22, K denotes the number of image samples, and v⋆ indicates the original image. We select the regularization parameter α=0.00001 and consider the original image (Wat Lok Moli) of size 256×256 pixels from [24]. We employ a Gaussian blur to construct blurred and noisy images of size 9×9 with the standard deviation σ=4. The original and blurred images are shown in Figure 2. The results of deblurring the image of Wat Lok Moli at 500 iterations is demonstrated in Table 2.
As seen in Table 2, our proposed algorithm (Algorithm 1) gives a higher value of PSNR than the others, which means that our algorithm has the best performance of the image restoration compared with others. The graph of PSNR for deblurring images at the 500th iteration are shown in Figure 1.
All restoration images of Wat Lok Moli of each algorithm at the 500th iteration are shown in Figure 2.
4.2. Data classification
Machine learning is crucial because it allows computers to learn from data and make decisions or predictions. There are three types of machine learning such as supervised learning, unsupervised learning, and reinforcement learning. Our work uses supervised learning which uses the extreme learning machine (ELM) [25] and a single-layer feedback neural network (SLFNs) model while the reinforcement learning is typically used for decision-making problems where an agent learns to perform actions in an environment to maximize cumulative rewards (see more information in [26,27]). However, it is not commonly used directly for data classification, which is more traditionally tackled using supervised learning techniques.
In this work, we aim to use the studied algorithm to solve a binary data classification problem. We focus on classifying the patient datasets of heart disease [28] and breast cancer [29] into classes. The details of the studied datasets are given in Table 3.
Here, we accessed the above datasets on June 12, 2022 from https://archive.ics.uci.edu. We first start with a necessary notion for data classification problems. Now, we recall a concept of ELM. Suppose pk∈Rn is an input data, and qk∈Rm is the target. The training set of N samples is given by S:={(pk,qk):pk∈Rn,qk∈Rm,k=1,2,…,N}. The output of the i-th hidden node for any single hidden layer of ELM is
where G is an activate function, ri is a bias, and wi is the weight vector connecting the i-th hidden node and the input node. If M denotes the amount of the hidden nodes, then ELM for SLFNs gives the output function as:
where mi is the weight vector connecting the i-th hidden node and the output node. Thus, an output matrix of hidden layer A is given by
A main purpose of ELM is to find a weight m=[mT1,…,mTM]T such that
where Q=[qT1,…,qTN]T is the training data. We observe from (4.5) that m=A†Q whenever the Moore–Penrose generalized inverse A† of A exists. In some situations, if A† does not exist, it may be difficult to find wight m, which satisfies (4.5). In order to overcome this situation, we utilize the following convex minimization problem (4.2) to solve m:
where β is the regularized parameter and ‖(m1,m2,…,mp)‖1=∑pi=1|mi|. It derives from (4.2) that ϕ(m):=‖Am−Q‖22 and ψ:=β‖m‖1 are inner-level functions of problem (1.2). To employ the proposed algorithm, BiG-SAM, iBiG-SAM, and aiBiG-SAM for solving data classification, we choose the outer-level function h(m)=12‖m‖2 for problem (1.1). With datasets from Table 3, we select an activation function G as sigmoid, and set the hidden node M=30. Choose t0=1 and tk+1=1+√1+4t2k2, for all k≥0. All parameters of each algorithm are chosen as in Table 4.
Also, the Lipschitz gradient Lϕ of ∇ϕ for BiG-SAM, iBiG-SAM, and aiBiG-SAM can be calculated by 2‖A‖2. In order to measure the performance of the accuracy for prediction, we use the following formula:
where TP is the number of cases correctly identified as patient, TN represent the number of cases correctly identified as healthy, FN means the number of cases incorrectly identified as healthy, and FP denotes the number of cases incorrectly identified as patient. In what follows, Acc Train refers to the accuracy of training on the dataset, while Acc Test indicates the accuracy of testing on the dataset. We present the iteration numbers and training time on the learning model for each algorithm in Table 5.
As seen in Table 5, we observe that the training time of Algorithm 1 is not significantly different compared with the other algorithms. However, it needs to compute parameter μk occurring from the lineserch technique, while the other algorithms do not have this process. Note that under the linesearch technique, our algorithm has better convergence behavior than the others in terms of the number of iterations. This means that the proposed algorithm provides the best optimal weight compared with the others. To evaluate the performance of each algorithm, we construct a 10-fold cross validation. The 10-fold cross validation splits data into training sets and testing sets, as seen in Table 6.
In addition, we use the following formula in order to measure the success probability of making a correct positive class classification, which is defined by
Also, the sensitivity of the model toward identifying the positive class is estimated by
The appraising tool is the average accuracy which is given by
where N is the number of sets considered during the cross validation (N=10), ui is the number of correctly predicted data at fold i, and vi is the number of all data at fold i.
Let ErrM be the sum of errors in all 10 training sets, ErrK be the sum of errors in all 10 testing sets, M be the sum of all data in 10 training sets, and K be the sum of all data in 10 testing sets. Then,
where errorM%=ErrMM×100% and errorK%=ErrKK×100%.
We show the performance of each algorithm for patient prediction of heart disease and breast cancer with the 300th iteration in Tables 7 and 8.
According to Tables 7 and 8, Algorithm 1 gives the best average accuracy of training and testing datasets compared with BiG-SAM, iBiG-SAM, and aiBiG-SAM. We also see that our algorithm provides higher the recall and precision for diagnosis of heart disease and breast cancer. Furthermore, the proposed algorithm has the lowest percent error on prediction.
5.
Conclusions
Recently, there are various algorithms for solving convex bilevel optimization problems (1.1) and (1.2). These methods require the Lipschitz continuity assumption of the gradient of the objective function on problem (1.2). To relax this criteria, the linesearch technique is applied. In this work, we proposed a novel accelerated algorithm employing both linesearch and inertial techniques for solving convex bilevel optimization problems (1.1) and (1.2). The convergence theorem of the proposed algorithm was analyzed under some suitable conditions. Furthermore, we applied our algorithm to solve image restoration and data classification problems. According to our experiment, we obtained that the proposed algorithm has more efficiency on image restoration and data classification than the others.
It is worth mentioning that in real-world application, if we appropriately choose the objective function of the outer-level problem (1.1), our algorithm can provide more benefit and accuracy for the specific objective of data classifications. Note that we use 12‖x‖2 as the outer-level objective function, so our solution is a minimum norm problem. In order to improve the accuracy for prediction, in future work, we need a new mathematical model and deep learning algorithm. Very recently, a deep extreme learning machine is an appropriate model for improving accuracy for prediction, see [30,31]. However, deep extreme learning algorithms are also challenging to study and discuss. Moreover, we would like to employ our method for prediction of noncommunicable diseases of patient data from the Sriphat Medical Center, Faculty of Medicine, Chiang Mai University.
Author contributions
Adisak Hanjing: formal analysis, investigation, resources, methodology, writing-review & editing, validation, data curation, and funding acquisition; Panadda Thongpaen: formal analysis, invesigation, writing-original draft, software, visualization, data curation; Suthep Suantai: conceptualization, supervision, project administration, methodology, validation, and formal funding acquisition. All authors have read and agreed to the published version of the manuscript.
Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
This work was partially supported by Chiang Mai University and the Fundamental Fund 2024 (FF030/2567), Chiang Mai University. The first author was supported by the Science Research and Innovation Fund, agreement no. FF67/P1-012. The authors would also like to thank the Rajamangala University of Technology Isan for partial financial support.
Conflict of interest
All authors declare no conflicts of interest in this paper.
Appendix A
In this section, we discuss the specific details of algorithms related to our work. These algorithms were proposed for solving convex bilevel optimization problems as follows:
Next, the details of the linesearch technique related to this work are provided as follows: