On the Wasserstein distance between mutually singular measures

We study the Wasserstein distance between two measures μ, ν which are mutually singular. In particular, we are interested in minimization problems of the form W (μ,A) = inf { W (μ, ν) : ν ∈ A } where μ is a given probability and A is contained in the class μ⊥ of probabilities that are singular with respect to μ. Several cases for A are considered; in particular, when A consists of L densities bounded by a constant, the optimal solution is given by the characteristic function of a domain. Some regularity properties of these optimal domains are also studied. Some numerical simulations are included, as well as the double minimization problem min { P (B) + kW (A,B) : |A ∩B| = 0, |A| = |B| = 1 } , where k > 0 is a fixed constant, P (A) is the perimeter of A, and both sets A,B may vary.


Introduction
In this paper, we consider, for a given probability measure µ on R d , the optimization problem W (µ, A) = inf W (µ, ν) : ν ∈ A , (1.1) where W denotes the p-Wasserstein distance (p ≥ 1 is fixed) and A is a class of probabilities that are singular with respect to µ. For all the background about mass transportation and Wasserstein distances, we refer to the books [14] and [11]. Problems of this kind arise in some models of bilayer membranes, for which we refer for instance to [9] and references therein. Here we consider only the mathematical issues, that appear to be very rich.
When A coincides with the class µ ⊥ of all the probabilities that are singular with respect to µ, the optimization problem (1.1) becomes trivial, in the sense that W (µ, A) = 0, as shown in Proposition 3.1. The same happens when A = µ ⊥ ∩ L 1 and µ is singular with respect to the Lebesgue measure L d (see Proposition 3.2).
On the contrary, when µ has a nonzero absolutely continuous part with respect to L d and A = µ ⊥ ∩ L 1 , the optimization problem (1.1) has a nontrivial generalized solution ν providing a nonzero minimal value W (µ, A) (see Proposition 3.6). If µ ∈ L 1 (or in slightly more general cases, see Remark 3.7) this probability ν can be expressed through the distance function d(x) of x to the boundary ∂S(µ) of a concentration set S(µ) of µ: being # the push-forward operator. In some more regular cases S(µ) reduces to spt µ and some explicit examples are provided in Example 3.8 and Example 3.9. The most interesting situation occurs when in the optimization problem (1.1) we impose an upper bound on the competing probabilities ν; more precisely, we take being φ a fixed nonnegative integrable function, with φ dx > 1. In this case the minimum value W (µ, A φ ) is reached by a characteristic function; more precisely, in Theorem 3.10 we show that for a suitable set A. Under additional assumptions on µ, we show in Theorem 3.13 that this set A has a finite perimeter. This allows us to consider in Section 4 the joint minimization problem with a perimeter penalization min P (B) + kW (A, B) : |A ∩ B| = 0, |A| = |B| = 1 , where k > 0 is a fixed parameter and both A and B may vary. Here we denoted shortly by W (A, B) the Wasserstein distance between the characteristic functions 1 A , 1 B . We show in Theorem 4.1 that an optimal solution A * , B * exists and prove some regularity results, namely that A * has finite perimeter and that B * is a quasi-minimizer of the perimeter.
Finally, in Section 5 we present some numerical simulations in the case p = 2.

Notation and preliminaries
In the following, our ambient space is R d ; we denote by P c the class of all probabilities on R d with compact support. Analogously, we denote by L p the space of p-integrable functions on R d and, for a given nonnegative φ ∈ L p , by L p φ the class of nonnegative functions u ∈ L p with u ≤ φ. We recall the following definitions for measures on R d .
for every Borel set E or equivalently µ(E \ A) = 0 for every Borel set E.
In this case, we use the notation µ ν. By the Radon-Nikodym derivation theorem, when µ ν there exists a unique (up to ν a.e.) nonnegative function h ∈ L 1 ν such that µ(E) = E h dν for every Borel set E.
In this case, we use the notation µ = hν. The function h above can be obtained (ν a.e.) as where B r (x) is the ball of radius r centered at x.
• µ and ν are mutually singular if there exists a Borel set A such that µ is concentrated on A and ν is concentrated on R d \ A, that is In this case we use the notation µ ⊥ ν. For a fixed µ ∈ P c we denote by µ ⊥ the class • The Lebesgue decomposition of µ with respect to ν is the unique way of writing µ = µ 1 + µ 2 with µ 1 ν and µ 2 ⊥ ν. The measures µ 1 and µ 2 are called the absolutely continuous part and the singular part of µ with respect to ν. When the measure ν is fixed in the context, we write µ = µ a + µ s and, by the Radon-Nikodym derivation theorem we then have Often, when no ambiguity is possible, we simply write h instead of hν identifying an absolutely continuous measure (with respect to ν) with its L 1 ν density.
For a fixed p ≥ 1 and µ, ν ∈ P c we denote by W (µ, ν) the Wasserstein distance where Π(µ, ν) is the class of probabilities γ on R d × R d having µ and ν as marginals, that is π # 1 γ = µ and π # 2 γ = ν, being # the push-forward operation defined, for a map f : X → Y between two measurable spaces, as In the sequel, we denote by L d the Lebesgue measure on R d and by δ x the Dirac measure at the point x.

Some optimization problems
In this section, we fix a probability µ ∈ P c and we consider the Wasserstein distance (p ≥ 1 is fixed) from µ to some subclasses A ⊂ µ ⊥ . In other words, we consider the optimization problem The first case we consider is A = µ ⊥ .
Proof. Let µ s be the singular part of µ with respect to the Lebesgue measure L d ; the measure µ s is concentrated on a set N which is negligible with respect to L d . Then we can find a sequence (x n ) in R d \ N which is dense in R d , and a sequence (µ n ) in P c of the form µ n = k∈N a n,k δ x k such that µ n → µ in the weak* convergence. By the choice of the sequence (x n ) the measures µ n are singular with respect to µ and, since µ n → µ weakly*, we have W (µ, µ n ) → 0, as required.
In the next step, we consider the class Proof. Let N be a negligible set with respect to L d where µ is concentrated. We can find a sequence (x n ) in R d \ N dense in R d and a sequence (µ n ) in P c of the form µ n = k∈N a n,k δ x k such that µ n → µ in the weak* convergence. Since L 1 is weakly* dense in P c we can find L 1 functions ρ n with compact support such that W (µ n , ρ n ) ≤ 1/n. Since µ ⊥ L d we have that ρ n ∈ µ ⊥ ∩ L 1 and so The situation becomes more interesting when µ is not singular with respect to L d . Proof. Assume by contradiction that there exists a sequence (ρ n ) in µ ⊥ ∩ L 1 with ρ n → µ in the weak* convergence. Then we have ρ n µ a dx = 0 for every n ∈ N.
Since A is open we have which is impossible, since µ a ≥ δ on A.
Example 3.5. The regularity assumption of Proposition 3.3 cannot be removed. Take indeed in dimension one an open dense set A in (−1, 1) with unit measure and take µ = 1 A c . Since A is dense, by finite sums of Dirac masses at points of A, we can weakly* approximate every measure supported in [−1, 1]; hence approximating a Dirac mass by a smooth function we can construct a sequence ρ n of smooth functions, compactly supported in A, that converge to µ in the weak* convergence of measures. We have then µ a = 0 and ρ n dµ = 0 ∀n ∈ N.
In the next proposition, we characterize the quantity W (µ, µ ⊥ ∩ L 1 ) (which is positive under the assumption of Proposition 3.3).
Proposition 3.6. Let µ ∈ P c . Then there exists ν ∈ P c such that The measure ν is concentrated on ∂S, where S is a concentration set for µ. Moreover, if µ ∈ L 1 we have that ν is unique and given by Then, up to a subsequence, we may assume that ρ n → ν weakly*, for a suitable ν ∈ P c . Hence lim n W (µ, ρ n ) = W (µ, ν).
In order to see that ν is concentrated on ∂S, where S is a concentration set for µ, notice that, heuristically, to achieve the minimal Wasserstein distance, every point on S has to be transported out of the S in the shortest way. Hence ν has to be concentrated on ∂S. To prove this fact in a precise way, since ρ n ⊥ µ there is a set R n such that ρ n is concentrated on R n and µ is concentrated on R c n . If we take S = ∩ n R c n we have which proves that ν is concentrated on (int S) c . We show now that ν is concentrated on the compact set S. Otherwise, we could find a compact K set disjoint from S such that ν(K) > 0, since both K and S are compact there is an open neighbourhood U of K whose closure is still disjoint from K. Hence there is open set U with compact closure such that dist(U , S) = 4δ > 0 and ν(U ) > 0. Since U is open, taking a subsequence if necessary, we can also assume that for some α > 0, we have ρ n (U ) ≥ α for every n. We are going to show that this implies the existence of a sequenceρ n ∈ µ ⊥ ∩ L 1 and of a positive constant C > 0 such that for every n which contradicts (3.3). We first choose an almost optimal map T n such that T # n ρ n = µ and |T n (y) − y| p dρ n (y) ≤ W p (µ, ρ n ) + 2 −n .
For y ∈ U , let F n (y) be a point on the segment [y, T n (y)] lying at distance 2δ from S (measurable selection arguments imply that this map can be chosen measurable). By construction, for ρ n a.e. y ∈ U we then have that Let then η be a smooth compactly supported probability density having its support in the unit ball B δ of radius δ and centered at the origin and define the planγ n by Obviously the first marginal ofγ n is µ and we denote byρ n its second marginal, i.e.ρ n is the sum of the restriction of ρ n to U c and the convolution of η with the image of ρ n restricted to U by F n , it is therefore in L 1 . Alsoρ n ∈ µ ⊥ since F n (y) − z remains at distance δ from S for ρ n -a.e. y ∈ U and every z in B δ . Sincê γ n ∈ Π(µ,ρ n ), we have On the other hand, recalling (3.5) which, together with ρ n (U ) ≥ α > 0, proves (3.4). Assume now that µ ∈ L 1 , so that we may take S as the set of points x such that Then the optimal transport map is given by where d(x) denotes the distance of the point x to ∂S, and so the measure ν is given by (3.2). The uniqueness of ν follows from the fact that µ ∈ L 1 and that ∇d(x) is well defined for a.e. point x ∈ S.
Remark 3.7. When µ has a singular part with respect to L d formula (3.2) still provides a measure ν which verifies (3.1). In this case, writing µ = µ a + µ s it is easy to see that ν = µ on R d \ S(µ a ) where S(µ a ) denotes a concentration set for µ a , hence in this region the Wasserstein cost vanishes. On the other hand, if µ s does not vanish on S(µ a ), it is transported on ∂S(µ a ) by the transport map in (3.6); the only case to be made precise is when on S(µ s ) ∩ S(µ a ) the function d is not differentiable, and so ∇d is not defined. In this case, these singular points x have more than one projection on ∂S(µ a ), and every choice of T (x) as one of these projections (or also as a transport plan, sending x in any subset of its projections) gives a measure ν verifying (3.1). Example 3.8. Let µ = 1 Q be the characteristic function of a rectangle Q with sides a and b, with a ≤ b; we assume for simplicity that the center of the rectangle is at the origin, that is Q = [−b/2, b/2] × [−a/2, a/2]. Using Proposition 3.6 we obtain that the measure ν in (3.1) is concentrated on the boundary of Q and its boundary density ρ is given by: ρ = a/2 − |y| on the vertical sides (b/2 − |x|) ∧ a/2 on the horizontal sides.
In Figure 1, we represent the boundary density on ∂Q as a boundary thickness.  we assume for simplicity that the center is at the origin, that is B + = x 2 + y 2 ≤ 1, y ≥ 0 . Using Proposition 3.6 we obtain that the measure ν in (3.1) is concentrated on the boundary of B + and the optimal transport map is The boundary density ρ can be then obtained by elementary calculations as on the upper boundary, 0 ≤ θ ≤ π.
In Figure 2, we represent the boundary density on ∂B + as a boundary thickness. We consider now, as a class A ⊂ µ ⊥ the class where φ is a given nonnegative integrable function with φ dx > 1.
Theorem 3.10. For every µ ∈ P c there exists a set A with µ(A) = 0 and such that Proof. We have Moreover, since µ is compactly supported, we may reduce ourselves to consider in (3.7) only functions θ supported in a ball B R with R large enough. Then, by the weak* L ∞ compactness of bounded L ∞ classes, in (3.7) the infimum is actually a minimum; we denote by θ a minimizer. We can assume that φ > 0 since on the set where φ vanishes one may take θ = 0 and there is nothing to prove. We want to show that the set {0 < θ(x) < 1} is Lebesgue negligible. There are at most countably many connected components of the support of θφ which are of positive Lebesgue measure; we denote them by A n and we are left to show that for each n, and each δ ∈ (0, 1), the set E δ ∩ A n with E δ = {δ ≤ θ(x) ≤ 1 − δ} is Lebesgue negligible. Assume by contradiction that it has a positive measure. Then, for every h ∈ L ∞ such that for ε small enough θ + εh is admissible for the above minimization problem. Using the dual Kantorovich formulation and denoting by u ε a Kantorovich potential between µ and ν ε := (θ + εh)φ, we have Being Kantorovich potentials uniformly continuous, it may be assumed that, up to extraction of some subsequence, u ε converges to some Kantorovich potential u for W (µ, θφ). A priori any such cluster point may depend on h, but it is also known (see for instance [11]) that Kantorovich potentials 1 are defined uniquely up to an additive constant on each A n . In other words, u does not depend on h up to possibly an additive constant, but with (3.8) we have φhu dx ≤ 0, hence by changing h into −h we in fact have φhu dx = 0. Since h is arbitrary we deduce that u coincides a.e. with a constant on E δ ∩ A n and we then also have ∇u = 0 a.e. on E δ ∩ A n . Therefore the optimal transport T from θφ to µ is the identity map on E δ ∩ A n and this implies that µ ≥ δφ1 E δ ∩An which clearly contradicts the fact that θφµ a dx = 0.
Remark 3.11. It is well-known (see [11]) that when µ ∈ L 1 ∩P c and p > 1, ν → W p (µ, ν) is strictly convex (note that the value of an optimal transport is always convex with respect to the marginals), in this case the optimal θ in (3.7) is unique and then so is the set A.
In general, we should not expect that the set A of Theorem 3.10 has a finite perimeter, as the example below shows.
Example 3.12. Take φ = 1, µ = n∈N c n δ xn where x n is the center of a ball of radius r n = (c n /ω d ) 1/d (being ω d the Lebesgue measure of the unit ball in R d ). We may choose the balls B(x n , r n ) all disjoint, so that the set A of Theorem 3.10 coincides with ∪ n∈N B(x n , r n ). We then have To have L d (A) = 1 and P (A) = +∞ it is now enough to choose c n such that A possible array of the balls B(x n , r n ) is shown in Figure 3. By a suitable approximation of Dirac masses by smooth functions, we may construct a counterexample similar to Example 3.12, of a µ ∈ L ∞ for which the set A of Theorem 3.10 does not have a finite perimeter. Therefore, some extra assumptions on µ are needed in order to have P (A) < +∞. Theorem 3.13. Let p = 2 and φ = 1; let µ ∈ P c ∩BV be such that the set S µ = {µ(x) > 0} has a finite perimeter. Then the set A of Theorem 3.10 has a finite perimeter. More precisely, we have Proof. It is enough to apply Theorem 1.2 of [6] with Ω = R d , g = µ and f = 1 {µ=0} .
Remark 3.14. Both Theorem 1.2 in [6] and Theorem 3.13 are stated in the case p = 2. Similar results hold also in the case p ≥ 1, as it was communicated to us by S. Di Marino [7].

Perimeter penalization
In this section, we consider the minimum problem  Proof. Let (B n , A n ) be a minimizing sequence for the minimization problem (4.2). Since P (B n ) are bounded and B n ⊂ D, possibly passing to subsequences we may assume that B n → B * strongly in L 1 . Analogously, since 1 An are bounded by 1 and compactly supported, we may assume that 1 An θ weakly* in L ∞ for a suitableθ with 0 ≤θ ≤ 1. We By Theorem 3.10, the minimization problem admits a solution which is the characteristic function of a set A * . By the minimality of and, by the lower semicontinuity of the perimeter with respect to the strong L 1 -convergence, we deduce that which concludes the proof In the two-dimensional case, d = 2, we can take D = R 2 in Theorem 4.1.
Theorem 4.2. In the case d = 2 for every k > 0 the minimization problem (4.1) admits a solution.
Proof. We can repeat the proof of Theorem 4.1 as soon as we can show that for a suitable minimizing sequence (B n , A n ) the sets B n remain uniformly bounded. Let then (B n , A n ) be a minimizing sequence for problem (4.1), let B n,k be the connected components of B n and let A n,k be the part of A n that is transported on B n,k . Then we have whereĀ n,k denotes a solution of the minimum problem min W (B n,k , A) : |A| = |B n,k |, |A ∩ B n,k | = 0 .
We can now construct a new minimizing sequence (B n ,Ã n ) by translating the sets B n,k andĀ n,k in such a way thatB n,k is contained in a square of side P (B n,k ) andÃ n,k in a concentric square of side 2P (B n,k ). Arranging these squares in an array we obtain that which shows that the setsB n are uniformly bounded. The fact that (B n ,Ã n ) is still a minimizing sequence for problem (4.1) follows from the fact that that are consequences of (4.3).
Remark 4.3. Even if we expect that a result similar to the one in Theorem 4.2 holds for every dimension, the proof we provided uses the fact that for a connected set its diameter is bounded by its perimeter, which only holds in dimension two. It would be interesting to find an alternative proof of Theorem 4.2 valid for every dimension d.
Example 4.4. In the one-dimensional case d = 1, if W is the p-Wasserstein distance (p ≥ 1), an easy calculation gives that, taking B as the union of n disjoint equal intervals (of length 1/n each) and A which surrounds them symmetrically, Then the optimal solution (B * , A * ) of problem (4.1) is obtained by taking B * given by . . . n intervals if 4(n − 1)n ≤ k ≤ 4n(n + 1).
We now address the regularity of optimal B's. Indeed, let D be a bounded domain, p ≥ 1, k > 0, α > 0 and consider the (slightly more general) shape optimization problem inf B⊂D, |B|=1 If B solves (4.4) it satisfies for every B ⊂ D with |B | = 1 (where we have used the fact that F (B ) is bounded and bounded away from 0). We now have minimizing the right-hand side in A and γ ∈ Π(1 A , 1 B ) gives the desired result.
Thanks to (4.5), Lemma 4.5 and the theory of quasi-minimizers of the perimeter (see the seminal work of Tamanini [12] and Xia [13] for the case of a volume constraint as in the present context), we have: Theorem 4.6. If B solves (4.4), its reduced boundary ∂ * B is a C 1,1/2 hypersurface and the Hausdorff dimension of (∂B \ ∂ * B) ∩ D is at most d − 8.

Numerical simulations
In this section, we give some numerical simulations of problem (3.7) where p = 2, Theorem 3.10 and Remark 3.11 say that there exists a unique minimizer and θ is the characteristic function of a domain multiplied by φ. This problem can be rewritten as where, θ = π # 1 γ and In the discrete setting, µ and θ are replaced by is a discretization of D, a compact subset of R 2 such that spt µ, spt θ ⊂ D. Then problem (5.1) becomes where c i,j = |x i − x j | 2 and the marginal maps are defined by This problem can be easily solved using the well-known entropic regularization method, [3,8,2]. It consists in regularizing W 2 by the entropy of the transport plan. Given a regularization parameter ε > 0, we solve where (η ε ) i,j = e −c i,j /ε and KL is the Kullback-Leibler divergence Solving problem (5.2) is equivalent to solve the proximal problem where G := G 1 + G 2 + G 3 . This problem can be solved using the proximal splitting algorithm introduced by Peyré in the setting of entropic regularization of Wasserstein gradient flows, [10]. This scheme was extended in [5] to unbalanced transport problem and used in [4] to compute Cournot-Nash equilibria. It is well known that the solution of (5.2) is of the form γ i,j = a i (η ε ) i,j b j , where a, b ∈ R N and η ε ∈ R N ×N . The splitting proximal algorithm corresponds to alternate proximal problems with G l , l ∈ {1, 2, 3}, instead of solving directly (5.2). In our special case, we initialize the algorithm by and γ 0 i,j = a 0 i (η ε ) i,j b 0 j , then iteratively, for k ≥ 1, Then b k j = c k j d k j and γ k i,j = a k i (η ε ) i,j b k j . We refer to [10,5] for the convergence of this algorithm to a solution of (5.2). The advantage of this method is that computing prox KL G l can be done easily. Indeed, for all θ ∈ R N , We now present some numerical results obtained using this algorithm. In the sequel, all computations are done with Matlab, using N = 500 × 500 in the discretization of D = [−4, 4] 2 , except for the triangle in Figure 5 where D = [−2, 2] 2 . In Figures 4 and 5, we represent the regularized solution of problem (3.7), θ ε = π # 2 γ k , with k large enough and ε = 0.01. In Figure 4, µ is the white rectangle or the white half circle, φ = 1, and we remark that θ ε is the characteristic function of the black set, as expected. In Figure 5, the initial densities are given by the characteristic functions of a triangle or a non-convex Pacman shape or a disconnected domain given by the union of rectangles and ellipses. The first column represents the regularized solution of problem (3.7) with φ = 1 and in the second column, φ(x, y) = (x + 1) 2 + 1. We remark that the solution θ ε , with initial density the characteristic function of a non-convex Pacman shape, fills the hole in µ, when φ = 1, but not entirely in the case φ(x, y) = (x + 1) 2 + 1 ≥ 1. In both cases, the support of the solution is not connected. As expected, in the second column, the regularized solutions of problem (3.7), are given by φ multiplied by the characteristic function of a set. Figure 4: The optimal density for a rectangle and for a half circle. -