Softening Bilevel Problems Via Two-scale Gibbs Measures

We introduce a new, and elementary, approximation method for bilevel optimization problems motivated by Stackelberg leader-follower games. Our technique is based on the notion of two-scale Gibbs measures. The first scale corresponds to the cost function of the follower and the second scale to that of the leader. We explain how to choose the weights corresponding to these two scales under very general assumptions and establish rigorous Γ-convergence results. An advantage of our method is that it is applicable both to optimistic and to pessimistic bilevel problems.


Introduction
Bilevel optimization is defined as mathematical programming where an optimization problem contains another one as a constraint. It consists of decision making problems with hierarchical leader-follower structure and has a natural interpretation in game theory. Bilevel problems have a long history that dates back to von Stackelberg [18] and have been intensively studied from a theoretical point of view as well as in applications to various domains including traffic planning, security, supply chain management, principal-agent models, production planning, market deregulation, optimal taxation, parameter estimation, see the book [11], the recent surveys [7,9] and the references therein. In the context of Cournot duopolies, von Stackelberg investigated the leader-follower model where the leader firm maximizes profit under the constraint that the follower firm reacts with an optimal choice of the quantity that is supposed to be unique. Later on, Leitmann [14] discussed the case where the optimal solutions for the follower's problem form a set that the leader has to take into account in order to solve her own optimization problem.
In case the follower's program has several solutions, we see that there is some ambiguity even in the definition of the leader's program. In the literature, several concepts have been considered. The optimistic (or strong) Stackelberg solution assumes a cooperative like behavior between the agents: the leader expects the follower to choose solutions leading to the best outcome for her. On the contrary, the pessimistic (or weak) Stackelberg solution assumes that the follower always breaks ties by choosing the worst actions for the leader which corresponds to a security strategy for her, see [2,6]. Some intermediate cases can also be considered. In [1], a cooperation degree is assumed leading to the optimization of a convex combination of the best and the worst payoff value for the leader, while, in [16], a probabilistic information about the follower's behavior is assumed resulting in the optimization of an average payoff.
Both optimistic and pessimistic bilevel programs are challenging and often difficult to solve in practice. In the present paper, we present a new and quite simple (unconstrained) approximation scheme for such problems based on the notion of two-scale Gibbs measures. Our method is directly inspired by the classical Laplace method: Gibbs probability measures which have a density proportional to e −λu with respect to a reference measure with full support concentrate on the set where u is minimal as λ → ∞. We refer to Hwang [13] for a fine study of the method and precise statements in smooth finite-dimensional situations. In the context of bilevel optimization, we have to take into account the objective of both the leader and the follower and a single parameter λ is not enough to capture the nested structure of the leader-follower problem. This is why we introduce two-scale Gibbs measures where the first scale (with weight λ) takes into account the follower's objective and the secondary one (with a smaller weight to be chosen properly) takes into account the leader's objective. We investigate in details convergence issues (both in the pointwise and -convergence sense) and the choice of the secondary scale only in terms of the reference measure and the modulus of continuity of the leader and follower objective functions. Our method is flexible enough to cope with both the optimistic and the pessimistic case.
Of course, the idea of regularizing bilevel problems is not new. In particular, the fact that adding to the lower level objective function a small multiple of the upper level objective function can be used to select solutions is classical, see [10,15] and has been used in [17] where an efficient numerical scheme is introduced for convex bilevel problems. The more recent article [4] proposes a smoothing strategy combining logarithmic penalties and Tikhonov regularization for two-stage stochastic programming. However the approximation approach we propose is different in nature, it consists in approximating the value function of the leader by an integral with respect to a Gibbs measure. In the euclidean setting, this yields a smooth approximation of this value (which may not even be lsc in the pessimistic case) for which we establish convergence results for an arbitrary sampling measure with full support provided the parameters are chosen appropriately. Eventhough, the two-scale Gibbs measures we consider involve the exponential of the sum of the lower level objective function with a small multiple of the upper level objective function, the concentration of these measures on suitable solutions of the lower level depends in a nonlinear way on the parameters.
The paper is organized as follows: the setting of our analysis is introduced in Section 2. In Section 3, we recall some basic facts on Gibbs measures and analyze the convergence of two-scale Gibbs measures in a simple case. In Section 4, we give a construction for the weights which guarantees convergence of the corresponding two-scale Gibbs measures under general assumptions. We then establish -convergence results, first for the optimistic case in Section 5 and then in the pessimistic case in Section 6. Finally, Section 7 concludes with some remarks and examples.

Setting
Throughout this paper, X (strategy set for the leader) and Y (strategy set for the follower) will be two compact metric spaces. We will also assume that the cost functions of both the leader and the follower are continuous, ϕ ∈ C(X × Y ), ψ ∈ C(X × Y ) will denote the cost function of the leader (who chooses x ∈ X) and the follower (choosing y ∈ Y ) respectively. The Stackelberg problem is the program of the leader which reads Under our assumptions, it is obvious that (2.1) admits at least one solution but finding such solutions in practice is a challenging task due to the constraint y ∈ arg min ψ(x, .). Of course, one can rewrite (2.1) as a minimization problem with respect to x only: Problem (2.1) is usually refered to as the optimistic problem since it assumes that in case the follower has several optimal strategies she will break ties by choosing one which is optimal for the leader. The pessimistic problem consists, on the contrary, in assuming that the follower actually breaks ties by choosing strategies which are the worst for the leader. The corresponding bilevel pessimistic program therefore consists in In general, the pessimistic value ϕ * is not lower-semicontinuous (lsc) so (2.4) does not necessarily admit solutions which makes the pessimistic problem more involved than the optimistic one and requires some suitable relaxation of ϕ * . Our goal is to approximate the bilevel problem (2.1) by a family of unconstrained ones (we will also address the approximation of the pessimistic bilevel problems (2.4) in Section 6). We shall indeed prove that the somehow rough function ϕ * can be approximated by a family of more regular ones defined by an integral depending on a parameter. By approximated we mean both in the pointwise sense and in the sense of -convergence we recall below (see [5] or [8] for an overview of -convergence and its applications): Definition 2.1 Let F : X → R and let for every λ > 0, F λ : X → R, then F λ is said to -converge to F as λ → +∞ if the following two conditions are satisfied: • for every x ∈ X and every family (x λ ) λ>0 converging to x as λ → +∞, one has the -liminf inequality: • for every x ∈ X, there exists a (so-called recovery) family (x λ ) λ>0 converging to x as λ → +∞ such that the following -limsup inequality holds Our approximation is a variant of the celebrated Laplace method, which as far as we know, has not been investigated in the bilevel framework. First of all, we give ourselves a Borel probability measure ν on Y (which we will denote ν ∈ P(Y )). Recall that the support of ν, denoted spt(ν), is the smallest closed subset of Y having full mass for ν. We assume that ν has full support spt(ν) = Y . (2.6) For r > 0 and y ∈ Y , we denote by B r (y) the closed ball of radius r centered at y and set for every r ≥ 0, Note that the full support assumption (2.6) and the compactness of Y ensure that α ν (r) > 0 for every r > 0. Note however that α ν need not be neither continuous nor strictly increasing (take Y finite for instance). Given x ∈ X, λ > 0 and δ > 0, we consider the probability measure on Y μ λ,δ (dy|x) := Z λ,δ (x)e −λ(ψ(x,y)+δϕ(x,y)) ν(dy) (2.8) where Z λ,δ (x) is the normalizing constant which makes μ λ,δ (.|x) a probability measure i.e.
Our main result is that one can choose the secondary scale δ = δ λ with lim λ→+∞ δ λ = 0, lim λ→+∞ λδ λ = +∞ (2.9) in such a way that the family -converges and converges pointwise to ϕ * as λ → ∞. Our construction of δ λ only depends on the function α ν defined in (2.8) and a modulus of continuity of ϕ and ψ, it will be detailed in Section 4.

On standard Gibbs Measures
In this section, we temporarily leave the approximation of Stackelberg problems and focus on the asymptotic behavior of Gibbs measures. Given ν ∈ P(Y ) with full support as in (2.6), λ > 0 and w ∈ C(Y ), we define the Gibbs measure Of course, ν λ,w is unchanged if one adds a constant to w so there is no loss of generality in normalizing w in some way, and the most natural way is to assume that its minimum is 0. The following elementary result will be used intensively in the sequel: .
Proof Let us write so putting eveything together yields (3.2).
Since, for every λ, ν λ,w is a probabilty measure and Y is compact, there is a sequence λ n → ∞ andν ∈ P(Y ) such that, ν λ n ,w weakly 1 star converges toν as n → ∞. Since the set {w > ε} is open, it follows from Portmanteau's theorem (see [3]) that Hence, for every ε > 0 thanks to (3.2) and the fact that ν({w ≤ ε 2 }) > 0 (because ν has full support and the minimum of w is 0), we get ν({w > ε}) = 0 letting ε → 0 + and using the fact that w is continuous, we conclude thatν concentrates on the set where w = 0. We thus recover the well-known fact that Gibbs measures concentrate on the set where the potential is minimal: and ν λ,w be defined by (3.1), then any weak star cluster point of ν λ,w as λ → ∞ has its support in argmin Y w.

Convergence of two-scale Gibbs Measures in a Simple Case
To understand how to approximate bilevel problems with Gibbs measures, we first have to understand the following question. Given two functions u and v in C(Y ), we want to find a weight δ λ such that δ λ → 0, λδ λ → +∞ as λ → +∞ in such a way that the two-scale Gibbs measure ν λ,u+δ λ v concentrates when λ → ∞ on the double argmin set: Note that for fixed x ∈ X, and for u(y) = ψ(x, y), v = ϕ(x, y), the set above corresponds to the solutions of the lower level problem which minimize the leader's cost. Throughout this paragraph, as well as in Section 4, we will focus on this subproblem and will therefore consider functions u and v that depend only on the variable y ∈ Y . The convergence analysis to the leader's value function is postponed to Sections 5 (optimistic case) and Section 6 (pessimistic case). We will give a general constructive answer ensuring the concentration on the double argmin set in paragraph Section 4 (depending on the modulus of continuity of u and v and the function α ν in (2.7)). Yet, for now, we prefer to focus on a rather simple case where the explicit choice δ λ := 1 √ λ works (as well as many other simple ones, see Remark 3.4 below). This simple case corresponds to the extra assumptions that both u and v are Hölder continuous and the function α ν is bounded from below by a power function. Denoting by dist the distance on Y and diam(Y ) its diameter, these assumptions mean that there exist and To shorten notations, let us set which corresponds to the two-scale Gibbs measure ν λ,

Proposition 3.3
Assume that u and v satisfy (3.4), that ν satisfies (3.5) and define ν λ by (3.6) then any weak star cluster point of ν λ as λ → ∞ has its support in the double argmin set argmin argminu v.
Proof To ease notations, let us normalize u and v in such a way that Also define w λ := u + v √ λ and observe that (3.7) implies that min Y w λ ≤ 0. Let then λ n → ∞,ν ∈ P(Y ) such that ν λ n weakly star converges toν. Let ε > 0, for λ large enough It then follows from the inequality (3.2) of Lemma 3.1 that so that, again thanks to Portmanteau's Theorem,ν({u > ε}) = 0 andν is supported by argmin u = {u = 0}. In particular, with (3.7), v ≥ 0 on spt(ν). To conclude, we thus have to show that for every ε > 0,ν({v > ε}) = 0. Since u ≥ 0 and min Y w λ ≤ 0, we have {v > ε} ⊂ w λ > min Y w λ + ε √ λ so using Lemma 3.1 again we get Let y λ be a point where w λ achieves its minimum, then it follows from (3.4) that for λ ≥ 1, in the ball of center y λ and radius where the last inequality follows from (3.5). With (3.8), this yields since for every ε > 0 the right hand side tends to 0 as λ → ∞, Portmanteau's Theorem again allows us to conclude thatν({v > ε}) = 0.
above is just for illustrative purpose and by no means the only possible one or optimal in any sense. It is indeed straightforward to check, with the same proof as above, that under the assumptions of Proposition 3.3, any choice of δ λ such that guarantees that the corresponding two-scale measures ν λ,u+δ λ v tend to concentrate on the double argmin set (3.3) as λ → +∞. In particular any power choice for δ λ i.e. δ λ = λ −γ with γ ∈ (0, 1) (or much larger weights such as δ λ = 1 log(λ) , δ λ = 1 log log(λ) ) ensures convergence to the double argmin set. Note that smaller weights such as δ λ = log(λ) λ violate condition (3.9). Assumptions (3.4) and (3.5) are essential if one wishes to use power like weights. In Section 7, we will consider examples where α ν is much smaller than a power function. In such cases, the choice δ λ = 1 √ λ may rule out the desired convergence property (see Example 7.1). Even worse, it may be the case that no power-like weight converges to the double argmin set (see Example 7.2).

Choosing the Weights Under General Assumptions
Now we consider the general case where u and v are continuous and ν has full support. We wish to find secondary weights δ λ satisfying (2.9) in such a way that defining the two-scale Gibbs measures, every weak star cluster point of ν λ as λ → ∞ is supported by the double argmin set (3.3).
Since u is continuous and Y is compact, u is uniformly continuous on Y , hence satisfies ω u (t) → 0 + as t → 0 + and ω u is a modulus of continuity of u in the sense that Hence ω 1 := 2 max(ω u , ω v ) satisfies Therefore if ω 2 is the concave envelope of t → ω 1 (t) + t, we have By construction, ω 2 is strictly increasing and concave (hence continuous) on the whole of R + and ω 2 (t) → 0 + as t → 0 + . From now on, we fix an increasing and concave modulus ω (possibly diffrent from ω 2 ) such that (4.3) holds. We then denote by , R * + → R * + the inverse of ω, := ω −1 , by construction we thus have Recalling that α ν is defined by (2.7), we define for every t > 0 Note then that θ is nondecreasing, lsc (this is why we define it as a left limit) and θ(t) → −∞ as t → 0 + .

Proposition 4.2 Let δ λ be defined as in Lemma 4.1 and ν λ be defined by (4.1). Then any weak star cluster point of ν λ as λ → ∞ has its support included in the double argmin set (3.3).
Proof Again we can normalize u and v so that (3.7) holds and set w λ := u + δ λ v, since δ λ → 0, one can proceed as in the proof of Proposition 3.3 to show that for every ε > 0, ν λ ({u > ε}) tends to 0 as λ → +∞ and thus deduce that any weak star cluster point of ν λ as λ → ∞ has its support included in {u = 0} = argmin u. To conclude that such weak star cluster points are in fact supported by the double argmin set (3.3), it is enough to show that for every ε > 0, ν λ ({v > ε}) tends to 0 as λ → +∞. To prove this, we observe that since u ≥ 0, {v > ε} ⊂ {w λ ≥ δ λ ε} and since min Y w λ ≤ 0 (because of (3.7)), we have {v > ε} ⊂ {w λ ≥ min Y w λ + δ λ ε}. Since ν λ = ν λ,w λ , our basic inequality (3.2) in Lemma 3.1 gives: . (4.12) Choose now λ large enough so that δ λ ≤ 1, doing so ω is a modulus of continuity of w λ . Hence if y λ is a minimum point of w λ , the ball B δ λ ε 2 (y λ ) is contained in hence by definition of , α ν and θ we get replacing in (4.12) gives for λ large enough, δ λ ≤ ε so that, using the monotonicity of θ: and since the right hand side tends to +∞ as λ → +∞ thanks to (4.8), we obtain that for every ε > 0  and by construction for every ε > 0: Now observe that the quantity in (4.15) only depends on u and v through the function θ i.e. through the modulus ω. So the speed at which ν λ ({v > min argmin u v + ε}) converges to 0 is uniform with respect to v and u admitting ω as modulus of continuity. This uniform behavior will be useful in Section 6 devoted to the pessimistic case.

Remark 4.4
Let us also emphasize that what really matters for convergence is the fact that the left-hand side of (4.14) tends to 0 as λ → +∞ for every ε > 0 and not really the precise construction of δ λ . Our construction based on Lemma 4.1 is just a cooking recipe which guarantees this property, and we are not claiming that it is optimal in any sense. In fact, the choice δ λ = 2 √ t λ in Lemma 4.1 is a bit arbitrary, taking δ λ = t γ λ with γ ∈ (0, 1) or δ λ = −t λ log(t λ ) would have worked just as well.
As a consequence of the previous -convergence, we immediately have: Corollary 5.2 Let ϕ * be defined by (2.3) and ϕ λ be as above, then and if x λ is a minimizer of ϕ λ and x is a cluster point of x λ as λ → +∞ then x is a minimizer of ϕ * .

The Pessimistic Case
Let us now address the pessimistic case and the approximation via Gibbs-measures of the pessimistic value function ϕ * defined in (2.5). Since ϕ * is not necessarily lsc, we have to relax it and consider its lsc envelope (that is the largest lsc function lying below ϕ * on X) which we denote by ϕ * and is given by The relevance of the lsc envelope of ϕ * for pessimistic bilevel problems was already emphasized in [12] and [11].
We construct δ λ exactly as in paragraph Section 5, and for every x ∈ X, we consider the Gibbs measure: .
We consider then the approximations Our convergence result concerning the pessimistic value is the following: Theorem 6.1 Let ϕ * be defined by (2.5) and ϕ + λ be as above. Then, as λ → +∞, ϕ + λ converges pointwise to ϕ * and -converges to its lsc envelope, ϕ * , defined in (6.1).

Examples and Remarks
In practice, our convergence results imply that the initial optimistic/pessimistic bilevel problems can be approximated by the (nonconstrained) minimization of ϕ λ /ϕ + λ for large λ. Since ϕ λ and ϕ + λ are typically smooth, one can use for instance gradient descent methods to minimize these functions.
If one wishes to implement our approximation on concrete problems (which, at the moment, we leave for future works), a certain number of issues have to be seriously addressed. Among them one can think of the choice of the reference measure ν as well as of the numerical method to efficiently compute the integrals which appear in ϕ λ . But the first question that naturally comes to mind is the choice of the weight δ λ for the secondary scale. Since δ λ captures the tradeoff between the upper and lower level, a small δ λ will result in a good accuracy for the follower's optimizing behavior but might too slowly take into account the leader's objective. We already saw that δ λ cannot be chosen too small for the convergence to be guaranteed but choosing it too large may affect the speed of convergence to the value function of the leader. A universallly good choice for δ λ is certainly impossible and the aim of this final section is precisely to illustrate on some particular examples the behavior of our approximations.

A Case where δ λ Cannot be too Small
We first consider an example where the assumption (3.5) of a power-like lower bound on α ν is relaxed. This example shows that one cannot hope for a universal choice of δ λ , in particular δ λ = λ − 1 2 does not guarantee convergence to the double argmin set if α ν happens to be too small near 0. δ 3 + e −λ 3/4 δ f ∞ since the right hand-side goes to 0 as λ → +∞, we deduce that there exists a neighbourhood of 0 which has zero measure for any weak cluster point of ν λ , in particular none of these cluster points can concentrate on {0}. In an example like this one, one typically has θ(t) ∼ −1 t 3 , so applying the cooking recipe of Lemma 4.1 one finds t λ ∼ λ − 1 4 and weights δ λ which guarantee convergence are δ λ = λ −γ with γ ∈ 0, 1 4 or δ λ = λ − 1 4 log(λ).

A Case where δ λ Cannot be Power-like
We now consider a variant of the previous example where any power choice for δ λ fails to guarantee convergence of the two-scale measure to the double argmin set.
and since γ < 1+γ 2 , we reach the conclusion that ν λ ([−δ, δ]) tends to 0 as λ → ∞, ruling out the convergence of ν λ to the Dirac mass at 0 (recall that {0} is the double argmin set in this example). In other words, no power like secondary weight gives convergence. But using Lemma 4.1 and Remark 4.4 the (very slowly decaying!) weight δ λ = log(log(log(λ))) log(log(λ)) ensures the desired convergence. Of course, one may think that it is crazy to use such a pathological reference measure, but a rough behavior of u, v or the boundary of the set Y in higher dimensions may generate similar pathologies as well.

A Case Where the Pessimistic Value is not lsc
The following explicit example illustrates the convergence of the approximations in a case where the pessimistic value is not lsc. Since ϕ and ψ are Lipschitz and ν is the Lebesgue measure any choice of δ λ of the form δ λ = λ −γ with γ ∈ (0, 1) ensures the validity of our convergence result.
In the optimistic case, the leader minimizes the optimistic value x + y = x and the solution is 0. Now, the approximation scheme proposed in the paper can be explicitly computed. For a given λ > 0, consider In the pessimistic case, the leader minimizes the pessimistic value which is not lsc at 0 and the infimum of ϕ * which is 0 is not achieved. Note that the lsc envelope of ϕ * coincides with the optimistic value ϕ * . In this case, for a given λ > 0, consider We know that ϕ + λ pointwise converges to ϕ * and -converges to ϕ * as λ → +∞. The pointwise convergence is of course slower near 0 and we have tested various exponents for δ λ (the square root as above but also γ = 1 4 , for which the convergence is even slower and γ = 3 4 which seems to give more accurate approximations) (Figs. 1, 2, 3 and 4).

The Choice of δ λ is Critical in Practice
Even in the case where argmin ψ(x, .) is a singleton for any x ∈ X, so that optimistic and pessimistic solutions coincide, the optimistic and pessimistic λ-approximations converge, with a different convergence speeed, to the common value ϕ * = ϕ * , as shown in the next example in which X is two-dimensional and for which the choice of δ λ seems to be crucial for practical convergence.    The Stackelberg solution isx = (0, 0) andȳ = 0. Since ϕ and ψ are Lipschitz we can chose any power function for δ λ , δ λ = λ −γ . Our approximation is given by: ϕ λ (x) = Y ϕ(x, y)e −λψ(x,y)−λ 1−γ ϕ(x,y) dy Y e −λψ(x,y)−λ 1−γ ϕ(x,y) dy (and the pessimistic approximation ϕ + λ is given by a smilar formula, just by changing the sign of the term involving ϕ) which converges as λ → +∞ to ϕ * which achieves its minimum at (0, 0). We illustrate the convergence with various exponents and values of λ. The convergence turns out to be very bad for γ = 1 2 but very good for γ close to 1 as in the case γ = 9 10 (Figs. 5, 6, 7, 8, 9 and 10).