Asymptotic convergence rates for averaging strategies

Parallel black box optimization consists in estimating the optimum of a function using $\lambda$ parallel evaluations of $f$. Averaging the $\mu$ best individuals among the $\lambda$ evaluations is known to provide better estimates of the optimum of a function than just picking up the best. In continuous domains, this averaging is typically just based on (possibly weighted) arithmetic means. Previous theoretical results were based on quadratic objective functions. In this paper, we extend the results to a wide class of functions, containing three times continuously differentiable functions with unique optimum. We prove formal rate of convergences and show they are indeed better than pure random search asymptotically in $\lambda$. We validate our theoretical findings with experiments on some standard black box functions.


INTRODUCTION
Finding the minimum of a function from a set of points ( ) ≤ and their images ( ( )) ≤ is a standard task used for instance in hyper-parameter tuning [8], or control problems. While random search estimate of the optimum consists in returning arg min ( ) ≤ , Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. in this paper we focus on the similar strategy that consists in averaging the best samples, i.e. returning 1 =1 ( ) where ( (1) ) ≤ . . . ≤ ( ( ) ).
These kinds of strategies are used in many evolutionary algorithms such as CMA-ES. Although experiments show that these methods perform well, it is not still understood why taking the average of best points actually leads to a lower regret. In [23], it is proved in the case of quadratic functions that the regret is indeed lower for the averaging strategy than for pure random search. In this paper, we extend the result of this paper by proving convergence rates for a wide class of functions including three times continuously differentiable functions with unique optima.

Related work
1.1.1 Better than picking up the best. Given a finite number of samples equipped with their fitness values, we can simply pick up the best, or average the "best ones" [9,23], or apply a surrogate model [7,12,16,19,27,29]. Overall, the best is quite robust, but the surrogate or the averaging usually provides better convergence rates. Using surrogate modeling is fast when the dimension is moderate and the objective function is smooth (simple regret in ( − / ) for points in dimension with times differentiability, leading to superlinear rates in evolutionary computation [7]). In this paper, we are interested in the rates obtained by averaging the best samples for a wide class of functions. We extend the results of [23] which only hold for the sphere function.
1.1.2 Weighted averaging. Among the various forms of averaging, it has been proposed to take into account the fact that the sampling is not uniform (evolutionary algorithms in continuous domains typically use Gaussian sampling) in [30]: we here simplify the analysis by considering a uniform sampling in a ball, though we acknowledge that this introduces the constraint that the optimum is indeed in the ball. [3,6] have proposed weights depending on the fitness value, though they acknowledge a moderate impact: we here consider equal weights for the best.
1.1.3 Choosing the selection rate. The choice of the selection rate / is quite debated in evolutionary computation: one can find = /7 [17], = /2 [11], = 0.27 [10], = /4 [20], = min( , /4) [18,31] and still others in [9,22]. In this paper, we focus on the selection rate when the number of samples is very large in the case of parallel optimization. In this case, the selection ratio would arXiv:2108.04707v1 [math.OC] 10 Aug 2021 tend to 0. We carefully analyze this ratio and derive convergence rates using this selection ratio.
1.1.4 Taking into account many basins. While averaging the best samples, the non-uniqueness of an optimum might lead to averaging points coming from different basins. Thus we consider at first the case of a unique optimum and hence a unique basin. Then we aim to tackle the case where there are possibly different basins. Island models [28] have also been proposed for taking into account different basins. [23] has proposed a tool for adapting depending on the (non) quasi-convexity. In the present work, we extend the methodology proposed in [23].

Outline
In the present paper, we first introduce, in Section 2, the large class of functions we will study, and study some useful properties of these functions in Section 3. Then, in Section 4, we prove upper and lower convergence rates for random search for these functions. In Section 5, we extend [23] by showing that asymptotically in the number of samples , the handled functions satisfy a better convergence rate than random search. We then extend our results on wider classes of functions in Section 6. Finally we validate experimentally our theoretical findings and compare with other parallel optimization methods.

BEYOND QUADRATIC FUNCTIONS
In the present section, we present the assumptions to extend the results from [23] to the non-quadratic case. We will denote (0, ) the closed ball centered at 0 of radius in R endowed with its canonical Euclidean norm denoted by ∥·∥. We will also denote by • (0, ) the corresponding open ball. All other balls intervening in what follows will also follow that notation. For any subset ⊂ (0, ), we will denote ( ) the uniform law on . Let : (0, ) → R be a continuous function for which we would like to find an optimum point * . The existence of such an optimum point is guaranteed by continuity on a compact set. For the sake of simplicity, we assume that ( ★ ) = 0. We define the ℎ-level sets of as follows. Definition 1. Let : (0, ) → R be a continuous function. The closed sublevel set of of level ℎ is defined as: We now describe the assumptions we will make on the function that we optimize. Assumption 1. : (0, ) → R is a continuous function and admits a unique optimum point ★ such that ∥ ★ ∥ < . Moreover we assume that can be written: for some bounded function (there exists > 0 such that for all , | ( )| ≤ ), H a symmetric positive definite matrix and > 2 a real number.
Note that is uniquely defined by the previous relation. In the following we will denote by 1 (H) and (H) respectively the smallest and the largest eigenvalue of H. As H is positive definite, we have 0 < 1 (H) ≤ (H). We will also set ∥ ∥ H = √ H , which is a norm (the H-norm) on R as H is symmetric positive definite. We then have Remark 1 (Why a uniqe optimum ?). The uniqueness of the optimum is an hypothesis required to avoid that chosen samples come from two or more wells for . In this case the averaging strategy would lead to a mistaken point because points from the different wells would be averaged. Nonetheless, multimodal functions can be tackled using our non-quasiconvexity trick (Section 6.2).
Remark 2 (Which functions satisfy Assumption 1?). One may wonder if Assumption 1 is restrictive or not. We can remark that three times continuously differentiable functions satisfy the assumption with = 3, as long as the unique optimum satisfies a strict second order stationary condition. Also, we will see in Section 6.1 that results are immediately valid for strictly increasing transformations of any for which Assumption 1 holds, so that we indirectly include all piecewise linear functions as well as long as they have a unique optimum. So the class of functions is very large, and in particular allows non symmetric functions to be treated, which might seem counter intuitive at first.
We then introduce the -best average In the following of the paper, we will compare the standard random search algorithm (i.e. = 1) with the algorithm that consists in returning the average of the best points. To this end, we will study the expected simple regret for functions satisfying the assumption:

TECHNICAL LEMMAS
In this section, we prove two technical lemmas on that will be useful to study the convergence of the algorithm. The first one shows that can be upper bounded and lower bounded by two spherical functions. Lemma 1. Under Assumption 1, there exist two real numbers 0 < ≤ , such that, for all ∈ (0, ): (1) Moreover such and must satisfy 0 < ≤ 1 (H) ≤ (H) ≤ .
Proof. As H is symmetric positive definite, we have the following classical inequality for the H-norm Now set for ∈ (0, ) \ { ★ } By the above inequalities, we have Thus, as > 2, we obtain ∥ − ★ ∥ −2 H → → ★ 0. By assumption, the function is also bounded as → ★ . We thus conclude that there exists > 0 such that, for all ∈ is a closed subset of the compact set (0, ) hence it is also compact. Moreover, by assumption is continuous on (0, ) and ( ) > 0 = ( ★ ) for all ≠ ★ . Hence is continuous and positive on this compact set. Thus it attains its minimum and maximum on this set and its minimum is positive. In particular, we can write, on this set, for some 0 , 0 > 0 0 ≤ ( ) ≤ 0 . We now set = min{ 0 , 1 2 1 (H)}. Note that > 0 because 0 > 0 and 1 (H) > 0 (as H is positive definite). We also set = max{ 0 , 2 1 (H)} which is also positive. These are global bounds for which gives the first part of the result. For the second part, let u 1 be a normalized eigenvector respectively associated to 1 (H). Then Taking the limit as → 0. we get that, if satisfies (1), then ≤ 1 (H). Similarly, we can prove that ≥ (H). □ Secondly, we frame ℎ into two ellipsoids as ℎ → 0. This lemma is a consequence of the assumptions we make on . Lemma 2. Under Assumption 1, there exists ℎ 0 ≥ 0 such that for ℎ ≤ ℎ 0 , we have ℎ ⊂ ℎ ⊂ ℎ where: when ℎ → 0 for some constants > 0 and > 0 which are respectively a (specific) lower and upper bound for .

BOUNDS FOR RANDOM SEARCH
In this section we provide upper bounds and lower bounds for the random search algorithm for functions satisfying Assumption 1. These bounds will also be useful for analyzing the convergence of the -best approach.

Upper bound
First, we prove an upper bound for functions satisfying Assumption 1.
Lemma 3 (Upper bound for random search algorithm). Let be a function satisfying Assumption 1. There exists a constant 0 > 0 and an integer 0 ∈ N such that for all integers ≥ 0 : Proof. Let us first recall the following classical property about the expectation of a positive valued random variable: By independence of the samples we have: Then thanks to Lemma 1: where the second equality follows because ∥ − ★ ∥ ≤ almost surely. Then, by definition of the uniform law as well as the non- The first term has a closed form given in [23]: Finally thanks to the Stirling approximation, we conclude: This lemma proves that the strategy consisting in returning the best sample (i.e. random search) has an upper rate of convergence of order −2/ , which depends on dimension of the space. It also worth noting this result is common in the literature [8,27]

Lower bound
We now give a lower bound for the convergence of the random search algorithm. We also prove a conditional expectation bound that will be useful for the analysis of the -best averaging approach. Lemma 4 (Lower bound for random search algorithm). Let be a function satisfying Assumption 1. There exist a constant 1 > 0 and 1 ∈ N such that for all integers ≥ 1 , we have the following lower bound for random search: Moreover, let ( ) ∈N be a sequence of integers such that ∀ ≥ 2, 1 ≤ ≤ − 1 and → ∞. Then, there exist a constant 2 > 0 and 2 ∈ N such that for all ℎ ∈ [0, max ] and ≥ 2 , we have the following lower bound when the sampling is conditioned: Using Lemma 1, we get: We can decompose the integral to obtain: where the last inequality follows by Stirling's approximation applied to the first term and because the second term is ( −2/ ) as in previous proof. This concludes the proof of the first part of the lemma. Let us now treat the case of the conditional inequality. Using the same first identity as above we have (1) ≥ | ( ( +1) ) = ℎ Remark 3. Note that if we sample independent variables 1 . . . while conditioning on ( ( +1) ) = ℎ and keep only the -best variables such that ( ) ≤ ℎ, this is exactly equivalent to sampling directly 1 . . . from the ℎ-level set. This result was justified and used in [23] in their proofs.
Hence we obtain Using Lemma 1, we get: where the last inequality follows from the inclusion ℎ ⊂ ( ★ , which is also a consequence of Lemma 1. We then get for sufficiently large.

□
This lemma, along with Lemma 3, proves that for any function satisfying Assumption 1, its rate of convergence is exponentially dependent on the dimension and of order −2/ where is the number of points sampled to estimate the optimum.

Remark 4 (Convergence of the distance to the optimum).
It is worth noting that, thanks to Lemma 1, the convergence rates are also valid for the square distance to the optimum ★ .

CONVERGENCE RATES FOR THE -BEST AVERAGING APPROACH
In the next section we focus on the case where we average the best samples among the samples. We first prove a lemma when the sampling is conditional on the ( + 1)-th value.
Lemma 5. Let be a function satisfying Assumption 1. There exists a constant 3 > 0 such that for all ℎ ∈ [0, max ] and and two integers such that 1 ≤ ≤ − 1, we have the following conditional upper bound: Proof. We first decompose the expectation as follows.
where we have use the same argument as in Remark 3 in the first equality. We will treat the terms (3) and (4) independently. We first look at (3). We have the following "bias-variance" decomposition.

We then have by inclusion of sets
Note that the volume of the -dimensional ellipsoid ℎ satisfies vol( ℎ ) = + (ℎ) det(H) with = vol( (0, 1)) and similarly for ℎ . From this we deduce by the squeeze theorem that vol( ℎ ) ∼ ℎ /2 det(H) .

We now decompose the integral
(because ℎ is an ellipsoid centered at ★ hence the integral of − ★ over it is 0). We then upper-bound using the triangle inequality for the H−norm: For the last equivalent, we used a Taylor expansion for the volume of ℎ and ℎ . We conclude that there exist ℎ 1 > 0 and a constant > 0 not depending on and such that for ℎ ≤ ℎ 1 , Since ℎ is upper bounded by max , the previous inequality can be extended to ℎ ∈ [0, max ], with a possibly larger constant still not depending on and . Let us now upper bound the remainder term (4). As ≤ by assumption, we can write We have 1 , · · · , ∈ ℎ ⊂ ℎ hence by the convexity of ℎ (which is a ball for the H-norm) we also have¯∈ ℎ and thus, for ℎ sufficiently small, we have: Note that + (ℎ) ∼ 0 √ ℎ thus, for ℎ sufficiently small, ∥¯− ★ ∥ H ≤ 1 almost surely, hence, as > 2 almost surely. Since ℎ is upper bounded, we have the existence of a constant ′ > 0 not depending on and , such that for all ℎ ∈ [0, max ], Thus we can upper bound the remainder with the same bounds as the one for the main term (up to constants), for any ℎ ∈ [0, max ]. We now group the "main" term and remainder term to get the existence of a constant 3 > 0 not depending on and such that for all ℎ ∈ [0, max ],

□
We are now set to prove our main result, which is an upper convergence rate for the -best approach. This is the main result of the paper. Theorem 1. Let be a function satisfying Assumption 1. Let ( ) ∈N be a sequence of integers such that ∀ ≥ 2, 1 ≤ ≤ − 1 and → ∞. Then, there exist two constants , ′ > 0 and˜∈ N such that for ≥˜, we have the upper bound: +2( −2) for some ′′ > 0, we obtain: for some ′′′ > 0 independent of .
We note that → 0 as → 0. This makes sense intuitively: we average points in a sublevel set, which makes sense only if, asymptotically in , this sublevel set shrinks to a neighborhood of the optimum.
Proof. The random variable ( ( +1) ) takes its values in [0, max ] almost surely. As such, thanks to Lemma 5, there exists a constant 3 > 0 such that for all ≥ 1: Let us first bound E ( ( +1) ) . Thanks to Lemma 4, there exist a constant 2 > 0 and 2 ∈ N such that: Thanks to Lemma 3, there exists a constant 0 > 0 and an integer 0 ∈ N such that for all integers ≥ 0 : Then finally for ≥ max( 0 , 2 ) For the term E ( ( +1) ) −1 , we write thanks to Lemma 4 Then, by Jensen's inequality for the conditional expectation, we get Similarly to Lemma 3, by replacing ∥ − ★ ∥ 2 by ∥ − ★ ∥ 2( −1) , one can show E ( (1) ) −1 ≤ ′ 3 −2( −1)/ for some ′ 3 > 0 independent of . We thus get E ( ( +1) ) −1 ≤ 2( −1) / 2( −1) / for some > 0 independent of , which, combined with the above bound on E ( ( +1) ) , concludes the proof of the main bound. To conclude for the final bound, it suffices to notice that this choice of ensures that the two terms in the upper bound are of the same order. □ This theorem gives an asymptotic upper rate of convergence for the algorithm that consists in averaging the best samples to optimize a function with parallel evaluations. The proof of the optimality of the rate is left as further work. We also remark that the selection ratio depends on the dimension and goes to 0 as → ∞. It sounds natural since the level sets might be assymetric and then keeping a constant selection rate would give a biased estimate of the optimum (see Figure 1). However, the choice proposed for is the best one can make with regards to the upper bound we obtained. We make two important remarks about the theorem.
Remark 5 (Comparison with random search). The asymptotic rate obtained for the -best averaging approach is of order +2( −2) , which is strictly better than the −2/ rate obtained with random Figure 1: Assume that we consider a fixed ratio / and that goes to ∞. The average of selected points, in an unweighted setting and with uniform sampling, converges to the center of the area corresponding to the ratio / : we will not converge to the optimum if that optimum is not the middle of the sublevel. This explains why we need / → 0 as → ∞: we do not want to stay at a fixed sublevel.
search, as soon as > 2 (because > 2) . This theorem then proves our claim on a wide range of functions.
Remark 6 (Comparison with [23]). [23] obtained a rate of order −1 for the sphere function. This rate is better than the one described in Theorem 1. This comes from the bias term in Lemma 5. Indeed for the sphere function, sublevel sets are symmetric, hence the bias term equals 0, which is not the case in general for functions satisfying Assumption 1. In this paper we are able to deal with potentially non symmetric functions. One can remark, that if the sublevel sets are symmetric the bias term vanishes and we recover the rate of [23].

HANDLING WIDER CLASSES OF FUNCTIONS
The results we proved are valid for functions satisfying Assumption 1. In particular, the functions are supposed to be regular and have a unique optimum point. In this section, we propose to extend our results to wider classes of functions.

Invariance by composition with non-decreasing functions
Mathematical results are typically proved under some smoothness assumptions: however, algorithms enjoying some invariance to monotonic transformations of the objective functions do converge on wider spaces of functions as well [1]. Since the method is based on comparison between the samples, the rank is invariant when the function is composed with a strictly increasing function . Let be a function satisfying Assumption 1 and be a strictly increasing function. Consider ℎ = • . Then ℎ admits a unique minimum ★ coinciding with the one of . As such, the expectation E 1 ,... ∼ ( (0, )) ∥ ( ) − ★ ∥ 2 satisfies the same rates than Theorem 1. This an immediate consequence of Lemma 1. In particular, using the square distance criteria, the rate are preserved even for potentially non regular functions. For example, our theorem can be adapted to convex piecewise-linear functions, compositions of quadratic functions with non-differentiable increasing functions, and many others. Results based on surrogate models are not applicable here.

Beyond unique optima: the convex hull trick, revisited
One of the drawbacks of averaging strategies is that they do not work when there are two basins of optima. For instance, if the two best points (1) and (2) have objective values close to those of two distinct optima ★ , ★ respectively then averaging (1) and (2) may result in a point whose objective value is close to neither. However, in the presence of quasi-convexity this can be countered. It thus makes sense to take into account the possible obstructions to the quasi-convexity of the function and try to counter these, while still maintaining the same basic algorithm as in the case of a unique optimum. [23] proposed to take into account contradictions to quasi-convexity by restricting the number of points used in the averaging. Based on their ideas, we propose the following heuristic.
Let us fix the number of initially selected points equal to max . Let (1) , . . . , ( max ) be these points ranked from best to worst. Define = ( (1) , . . . , ( ) ) and the interior of the convex hull of . Assume that there is no tie in fitness values, that is no ≠ such that ( ) = ( ). Given max , choose maximal such that One can remark that ( ) ∈ ⇒ is not quasi-convex on . However, this may not detect all cases in which is not quasiconvex on . More generally, If such a is not , Eq. (5) does not detect the non-quasiconvexity: therefore, (6) detects more non-quasiconvexities than Eq. (5). Therefore we choose maximal such that for all < , > , ( ) ∉ . This heuristic leads to a choice of average which is "consistent" with the existence of multiple basins.

EXPERIMENTS
We divide the experimental section in two parts. In a first part, we focus on validating theoretical findings, then we compare with existing optimization methods.

Validation of theoretical findings
In this section, we will assume that = 1 and that the optimum * will be sampled uniformly in the ball of radius 0.9. We compare results on the following functions: (1) Sphere function: (2) Rastrigin function: (3) Perturbed sphere function: with ( ) = if > 0 and −2 otherwise. This function has highly non symmetric sublevel sets, but still satisfies Assumption 1.
We plotted in Figure 2 the regret (¯( ) ) − ( ★ ) as a function of / for different dimensions and number of samples . The experiments are averaged over 30 runs. We remark for instance on the Rastrigin function that for the -best averaging approach to be better than random search, we need a very large number of samples as the dimension increases. Overall, these plots validate our theoretical findings that averaging a few best points leads to a better regret than only taking the best one.

Comparison with other methods
In this section, we compare averaging strategies with other standard strategies, using the Nevergrad library [26]. Figure 3 presents experimental results based on Nevergrad. Instead of the uniform sampling used in the theoretical results and the previous experimental validation, we use Gaussian sampling in this set of experiments. Following the notation from [23], we consider distinct averaging prefixes: • AvgXX = method XX, plus averaging of the = /(1.1 ) best points in dimension . • HAvgXX = = method XX, plus averaging of the /(1.1 ) best points, restricted by the convex hull trick (Section 6.2).
Experimental setup. We measure the simple regret and compare methods by average frequency of win against other methods. For each test case, we randomly draw the optimum as N (0, ) (multivariate standard Gaussian), with different budgets in {30, 100, 300, 1000, 3000, 10000, 100000} and dimensions in {3, 10, 30, 100, 300, 1000, 3000}. Due to their time of evaluation, we did not run the cases with both = 3000 and = 100000. We evaluated on 3 different functions: the sphere function, the Griewank function, and the Highly Multimodal function. Previous results [13] from the literature have already shown that replacing random sampling by scrambled Hammersley sampling (i.e. modern low discrepancy sequences compatible with high dimension) leads to better results.
Analysis of results. Analyzing the table results from Figure 3, we observe that • Averaging performs well overall: AvgXX is better than XX;  All results are averaged over 30 independent runs. We observe, consistently with our theoretical results and intuition, that (i) the optimal = decreases as increases (ii) we need a smaller when the function is multimodal (Rastrigin) (iii) we need a smaller in case of dissymmetry at the optimum (perturbed sphere).
• The quasi-convex trick from Section 6.2 does work: HAvgXX is better than AvgXX; • The rescaling strategy from [24] outperforms the ones in [14] (MetaTuneRecentering better than MetaRecentering or than PlusMiddlePoint) which are already better than standard quasi-random sampling. Quasi-Opposite sampling is also competitive.
We also include various methods present in the platform, including those which are based on Cauchy or Hammersley without scrambling (Hammersley in the name without "Scr" prefix), or sophisticated uses of convex hulls for estimating the location of the optimum (HCH in the name).

CONCLUSION
We proved that averaging > 1 points rather than picking up the best works even for non quadratic functions, in the sense that the convergence rate is better than the one obtained just by picking up the best point. We also proved faster rates than methods based on meta-models (such as [27]) unless the objective function is very smooth and low dimensional. We also show that our results cover a wider family of functions (Section 6.1). We also propose a rule for choosing , depending on and the dimension. This shows that the optimal / ratio decreases to 0 as the dimension goes to infinity, which is confirmed by Fig. 2. We also note, by comparing with [23], that the optimal ratio should be smaller (Fig. 1), which is confirmed by our experiments on the perturbed sphere (Fig. 2). We also propose a method for adapting this , by automatically detecting non-quasi-convexity and reducing it: and prove that it detects more non-quasiconvexities than the method proposed in [23]. Finally, we validate the approach on a reproducible opensourced platform (Fig. 3). Then rows are sorted per average winning rate and we keep the 6 best ones. Zero is a naive method just choosing zero: we see that, consistently with [14], many methods are worse than that when the dimension is huge compared to the budget.

Further work
Using density-dependent weights as in [30] should allow us to get rid of the constraint || * || < using a Gaussian sampling instead of a uniform sampling. Better rates might be obtained with rankdependent weights as in [3]. We also leave as further work the proof of the optimality of the rate for this strategy. Moreover, we also believe better rates can be obtained for smoother functions, and leave this study for further work. The case of noisy objective functions [2] is critical. The study is harder, and good evolutionary algorithms use large populations, making the overall algorithm closer to a small number of one-shot optimization algorithms: actually, some fast algorithms use mainly learning [4,5,15]. Population control [21] is successful and its last stage looks exactly like a oneshot optimization method.