On the uniqueness of local minima for general abstract nonlinear least-squares problems

The effectiveness of the inversion of a mapping phi defined on a set C by nonlinear least-squares techniques relies on, among other things, the uniqueness of local minima of the least-squares criterion, which ensures that the numerical optimisation algorithm (if they do) converges towards the global minimum of the least-squares functional. The author defines a number y depending only on C and phi which, if the size of phi (C) is not too large with respect to its curvature, is strictly positive, thus yielding the uniqueness of all local minima having a value smaller than y. The condition y>0 requires neither convexity of C nor any monotonic property of phi , but involves the computation of an infimum over delta C* delta C of first and second derivatives of phi . Numerical applications to the estimation of two parameters in a parabolic equation are given.


. Introduction
CO n s i de r E = normed vector space (norm 11 1 1 , ) F = pre-Hilbert space (scalar product ( , )k) C = closed, C'-path-connected subset of E p = C'-mapping of C into F z E F a given point and the optimisation problem Problem ( 1 . 1 ) is the general least-squares setting of the problem find ,f E C such that @(i) = z.
(1.3) when the right-hand side z does not necessarily belong to the image set $(C).
problem (1.1) cannot admit two distinct local minima (and hence has at most one solution), provided that the distance Our goal is to find conditions on C and $ such that from z to @(C) is taken smaller than a certain number 7 > 0.
(1.4) 0266-561 1/88/020417+ 17 $02.50 0 I988 1OP Publishing Ltd In this paper we will be able to ensure the uniqueness only of the local minima of] having a value smaller than y (propositions 3 and 4). We also seek conditions ensuring uniqueness of all local minima.
Let us now explain our motivations. The first quesfiotz: what kind of applications have motivated the author to undertake this study'? The answer is: parameter estimation problems. In this application, x is the parameter. C is the set of admissible parameters, z is the observed data and q5 is the parameter+output mapping resulting from the resolution of the model state equations and the observation operator. Our concern is primarily with overspecitied inverse problems, where dim F b dim E. so that we can expect that the derivative @' (,Y) to be more or less injective from E into F. In order to be more specific, we can give an example.
E,uanzple 1 . We consider the ID parabolic equation when the parameters u E R'..: and b E Rhave to be estimated from the measurement z E L'(0, T ) of the solution 11 at point y = against time. Here we have x = ( a , b ) E W'= E, C is a given closed subset of R'+ x R'. which represents the a priori knowledge of the experimenter about the parameter x = ( a , b ) . and @ is the mapping which makes the t-+ U(!, t ) function of L'(0, T ) = F correspond to a given x = (U, b ) E C. In this example the evaluation of @ ( x ) involves the resolution of the parabolic equation (1.5): the problem is obviously overspecified as dim F = + a >dim E = 21 The second quesfiotz: why d o we address the problem of uniqueness of local minima'? The only way of actually solving the parameter estimation problems described above is to undertake the minimisation of J over C on a computer. However, optimisation algorithms are only able to find local minima over a closed set. Hence the least-squares problem ( 1 . 1 ) will be practically solvable by an optiniisation algorithm as soon as C is closed and d has a t most one local minimum over C.
This w i l l ensure that the optimisation algorithm, once converged, will give the sought global minimum of]. One can also remark that the uniqueness of local minima implies (but is not equivalent to) the uniqueness of the solution i of problem (1. I ) , or in terms of parameter estimation problems, the identifiability o f i from the knowledge of z a n d C. Of course, one other extremely important practical problem is that of the stability of the solution .iof (1.1) with respect to perturbations of the data z : this problem will not be addressed a s such in this paper, but one can remark that, when C is compact. the above uniqueness property will ensure the existence of a unique ,idepending continuously on z as long as the distance of z to @ ( C ) is taken small enough.
The third (Litid lust) qurstioti: what kind of conditions on C and @ are we looking for? The first idea is that we want data-independant conditions: for a given set C and mapping 4. we want to be able to decide whether property (1.4) holds or not. If it holds we will get as a by-product the upper limit y>O t o the distance of z from @(C) for which the uniqueness property of local minima holds. If it does not hold the experimenter will then have to acquire more data (i.e. change the mapping 4) and/or augment the fi priori available information (i.e. diminish the size of C) before checking again for property (1.4). The second idea is that such conditions will in no way be cheap! As in view of the applications, no hypothesis will be made o n the shape of C and 4 (no convexity, no monotonicity), the conditions will necessarily involve exploration all over C-which of course will require a lot of computer time as soon as the dimension of C , i.e. the number of unknown parameters, increases.
Nevertheless, we believe that such a condition will be practically useful f o r problems with few unknown parameters, and that it will at least help to understand what happens in nonlinear least-squares problems. As a test for the forthcoming sufficient condition for (1.4) to hold. we will add to example 1 an extremely simple example.
Exumple2. Determine a real number x from the measurement ( z , , z 2 ) of its cosine and sine. Then we have (1.6) Of course, one has to restrict a priori the search for x to an interval of length smaller than 27r if we want the problem to have a chance of being well posed! So suppose we take for example with Xgiven, X<27r. (1.7) Then obviously problem (1.1) has a unique global minimum as soon as d(z, @(C)) <y=sin(X/2) as one can see in figure l ( a ) for different data z . However, one sees also in figure 1 that there may exist, beside the global minimum, a distinct local minimum (with value larger than y ! ) so that the solution of (1.1) by an optimisation algorithm may fail because condition (1.4) is not satisfied! In order to satisfy conditions (1.4). it is sufficient to replace condition (1.7) by the stronger condition (1 .X) d(z, @(C)) <sin X . (1.9) Then, as seen in figure l(b), condition (1.4) holds when Conditions (1.8) plus (1.9) are clearly equivalent to (1.4), and will be used as a benchmark to indicate the precision of the condition that we will derive. To conclude this introduction. we will recall a previous result of Spiess (1969)-who considered exactly the same problem, namely the uniqueness of local minima of problem ( l . l ) , but set on an open and conuex set C. In fact he gave data-dependent sufficient conditions, i.e. conditions which, for a given datum z and a given local minimum i . imply that .f is a global minimum. These conditions, when translated into data-independent conditions. read as follows: Spie.s,\ conditions. If C is an open convex subset of E, @ is injective and C' over C (1.10) then J has at most one local minimum over the open set C as soon as d(z, @(C)) < y .  So y is strictly positive, and the sufficient condition is satisfied, in this example, for all Of course, as C is taken open. the Spiess condition does not eliminate the local minima which may arise on @ ( K ) (as in figure l(a)), so that this condition does not answer our second question. However. it may give a reasonable idea of the kind of condition we are going to derive below, as they share the property of containing an infimum over a couple of points (x, J ) of C and over a path (here the [x, y ] interval) connecting them.
Let us now be more technical and turn to the derivation of our sufficient condition. The hypothesis and notation given at the beginning of the introduction will hold throughout the rest of the paper and will not be repeated.

How to recognise the existence of two distinct local minima
Let x. y E C, x # y , be two such local minima (see figure 3). Using the hypothesis that C is C' path connected, we may choose one C' path going from x to y . i.e. one C'  From the properties of x and y , it is clear that one can find e', such that e,, < e; s 8 , From here two cases may occur: (1) interval such that f"(6) < 0.  Proposition 1.
Then there exists Then the first and second derivatives of f ( 0 ) can be expressed as which together with (3.3) and the Cauchy-Schwarz inequality yield We now define a function g : [e,,, 0;J-R by the following 1D elliptic problem: , e;] (3.6) We may remark that this function g is independent of 2 (whereas f was not!), and that it is a po.sitii)e concave function. Plugging (3.6) into (3.5) then yields

which proves that the function O -f ( O ) " ' -g ( O )
is convex, and hence, as ( 3 . 8 ) But on the other hand, from f"(G)<O we get, using (3.3) and the Cauchy-Schwartz inequality, (8) (4.1) Then the problem ( I . 1) has at the most one local minimum with value smaller than y as soon as d(z. @(C)) < y .
This condition does not look very useful. But before simplifying it somewhat and indicating which strategy S to choose, let us explain its meaning with a simple example.
Example 3. Suppose that C is convex (and hence C'-path-connected!) q5 is such that numbers U > 0, ,4 > 0 exist with and Then from proposition 3 we get the following (weaker) sufficient condition. Then the problem (1.1) has at most one local minimum with value smaller than y as soon as d(z, @(C))<y.
This result was already given in Chavent ( 1983)  We come back to the less constraining estimation (4.1) of proposition 3.

Choice of a strategy S
The problem is now to choose the strategy S , which associates to any couple (x, y ) E C x C a C' path s from x to y , in such a waj, that the number 7 defined by (4. I ) is the largest possible (and hence the 'size' condition on C the least restrictive possible).
For given x. y E C, the choice of a path going from x to y can be conceptually split into two steps: (i) choose the geometry of the path; (ii) choose the time law, i.e. the parametrisation of the path. We will choose these two items separately.

For given x, y E C and CI given geometry of a path f r o m x to y , how should the time law be cho~en.?
We consider first a yiirticiilar~~arairietrisatiorz $(e) of the path from x to y where 8 is the curvilinear abscissa on the image path @os(@).

Such a parametrisation satisfies, by definition
iIc(e)ll= Il$'($(6)) * i'(d)lI = 1 (5.1) and will exist as soon as @'(x) is injective everywhere over C. At points where $'(x) is not injective. 6(8) may still exist, but $'(e) will have to be infinite. In order to compare the numbers p and y associated by (4.1) to the two parametrisations S*(i) and s(0) of the same geometrical path, we compare the arguments of the inf in (4.1).
(i) Obviously one gets from (5.4) and (5.6) (ii) In order to compare g ( 0 ) and ~( 6 1 , we set x'(0) =S(x(O)) (5.8) and will compare g ( 8 ) and g ( 8 ) . One first checks easily that g(8) satisfies the following equation : Comparing with the equation defining g ( 8 ) , Conclusion. For a given geometric path going from x to y , the best parametrisation, when y is defined by (4.5). is obtained when 8 is the curvilinear abscissa on the image path. In other cases, in particular when y is defined by (4.1). the curvilinear abscissa is dt least the most intrinsic parametrisation.
In the following, we will omit the hat on s , 8, etc, and 8 will always denote the curvilinear abscissa on the irnage path.

How to choose the geometrical path from x to y
For a given x, y~d C , we are now looking for a path s from x to y , which will be parametrised by the curvilinear abscissa 8 along the image path @os, such that the quantity (5.17) appearing in (4.1) or (5.14) is maximum.
The first remark is that the quantity (5.17) depends only on the geometrical properties of the image path @Os going from @(x) to @ ( y ) : 8 is the curvilinear abscissa along this path, p(8) is the radius of curvature of this path and g ( 8 ) is defined from O,,, 8, and p ( 8 ) .
So we can replace (at least conceptually!) the task of choosing a path s from x to y in C by that of choosing a path S from @(x) to @ ( y ) in @(C) in such a way that the quantity (5.17) is maximised.
In this new setting the mapping @ is used only, together with the set C, for the definition of the set @(C) in which the sought path S has to stay.
The second remark is that, whenever the segment [@(x), @ ( y ) ] is fully included in @(C) then one can choose S = [@(x), @ ( y ) ] , which yields p ( 8 ) = + and g ( O ) = O , hence y = + x so that S is obviously the sought optimal solution! The third remark is that, if one chooses a path S from $(x) to @(y) with both large radii of curvature and a large length 8 , -8,). like the one depicted in figure 7. for which me have Then the function g is of the form and is maximum at the point 6' = ;(@,, + 6 , ) = xR, So we see that paths S with both large p and large length are not optimal.
lerzgtlz path going from @(s) to @(!) in p(C). But this still remains to be proved.
From the two last remarks, we may CoPzjectiire thnf the oprimal S is rhe minimum-We can propose two strategies for the choice of the path s from x to J', Stmtegy 1. Determine a path s from x to y in such a way that S = @ O S is the minimum-length path in @(C) going from ~( x ) to @()I). This procedure may (if our conjecture is true!) yield the optimal number y . and hence the less constraining condition on the size of C. Hokvever. from the practical point of view. such a strategy seems very difficult to implement. as one would have to solve. for each couple x, J E :C. a complicated optimisation problem in a high dimensional space.
Strutegj, 2. Choose s as the minimum length path in C going from x to y . This procedure is surely non-optimal, but will guarantee that the corresponding image path S = @ O S will not be too long, as soon as upper bounds on iiq'(x)1~ are available. Moreover, as the set C is defined by explicit constraints. and is usually of non-void interior. the minimum-length path in C from x t o y can be determined relatively easily (in many cases it will be the [x. y] interval).
To conclude this section. let us see, using the very simple example 2 of the 8 1 how close our final condition (5.14)-(5.15) with strategy 1 or 2 comes to the solution of this example.
We have seen in $1 (see figure 1 and formula (1.7)) that the least-squares functional for the search of a real number x from measurements of its cosine and sine had a unique local minimum with value smaller than y = s i d X (which of course is the global minimum) as soon as we search for x in the interval ([0, X ] with X < 2 n . If we now apply conditions (5.15) and (5.16) to this problem, we have to compute the argument of the infimum in (5.16) only for a path going from 0 to X. Obviously, the path has all the desired properties: 8 is the curvilinear abscissa on the arc of the circle which is the image of the interval [0, X] by the @ function defined in (1.6); s yields the minimal length path as well in the image set as in the parameter set so strategies 1 and 2 are equivalent here. or which is to be compared with the best possible condition ~< 2 n exhibited for this example in the introduction. W e see that the result is not too bad, but as 22/2<2x we cannot conclude whether the condition (5.15)-(5.16) with strategy 1 is optimal or not. But one may remark that 22/2<n, which proves that, for this example, our condition y > 0 yields in fact the uniqueness of all local minima.

Numerical application
For historical reasons, the numerical application we are going to present was not made using (5.14) with the curvilinear abscissa in the data-space parametrisation, but using a weaker version of (4.1) with a time law that has a constant velocity in the parameter space. The geometry of the path going from x to y was given by strategy S2 of $4.3 (minimum length in the parameter space) and C was taken to be convex. I g(@ )I 6 (1/2~)llu//,~,ol,.
(6.2) Thus a sufficient condition for (4.1) to hold is: We applied the condition (6.3) to example 1 of 81. However, rather than checking, for a priori given admissible parameter sets C , whether condition (6.3) holds, we used an alternative approach: supposing we have been given by an engineer some nominal value 2 = ( d , b ) E%+: ' x W' of the unknown parameters, we tried to answer the question 'how large can the parameter set C be chosen around (6, 6 ) while still maintaining the uniqueness of the local minima of problem (1.1) over C?'. This amounts to finding 'around' a given X the 'largest' set ? for which y = 0 so that any set C strictly included in will yield a strictly positive y . This was done by computing the values of the argument of the infimum in (6.3) for segments of increasing length centred at X and lying on a finite number of straight lines going through X, until one reaches the zero value in each direction. A t this stage, all couples [x, 2X-x] E aC that are symmetrical with respect to 1 were tested. Then the couples (x, y ) with y f x were tested, eventually diminishing the length of the [x, 2X-x] interval if the argument of (6.3) happens to be negative for the [x, y ] segment. Of course, this procedure will produce domains dependent on the order in which the (x, y ) segments are tested in the second part of the algorithm.
The numerical results, taken from Charles (1985), are shown in figure 8. The interesting point to be noted is that the size of the 'maximal' sets given by condition (6.3) is already plausible from a practical point of view. Using condition (5.16) would yield still larger sets, with no basic increase in computational time. On the other hand, the use of the much more restrictive condition (4.4) would lead, in this example, to a maximal set of the size of a point in figure 8, and it is thus inadequate for practical use.

Conclusion
We have studied the uniqueness of the local minima of general nonlinear least-squares problems, under the main hypothesis that the mapping to be inverted is regular C' and has an injective derivative. For this case we have derived a sufficient condition that involves a minimisation, over all 'geodesic' curves of the image set, of a quantity that involves the radius of curvature of the 'geodesic' curve and a function related to the radius of curvature through the resolution of an elliptic problem (see (5.16)). This condition has been optimised among a class of possible sufficient conditions, but it is not known whether it is the best possible condition. However, numerical examples have shown that the proposed condition makes it possible to obtain practically interesting results for a two-parameter estimation problem.