Increasing the robustness and applicability of full-waveform inversion : An optimal transport distance strategy

Full-waveform inversion starts being used as a standard stage of the seismic-imaging workflow, at the exploration scale, for the reconstruction of high-resolution wave velocity models. However, its successful application still relies on the estimation of an accurate enough initial velocity model, as well as on the design of a suitable hierarchical workflow, allowing it to feed the inversion process progressively with data. These two requirements are mandatory to avoid the cycle-skipping or phase-ambiguity problem when comparing observed and synthetic data. This difficulty is due to the definition of the full-waveform inversion problem as the least-squares minimization of the data misfit. The resulting misfit function has local minima, which correspond to the interpretation of the seismic data up to one or several phase shifts. In this article, we review an alternative formulation of full-waveform inversion based on the optimal transport distance we have proposed in recent studies. We propose to use a particular instance of the optimal transport problem, which is adapted to the interpretation of real seismic data and for which we design an efficient lowcomplexity numerical strategy. Numerical results in 2D and 3D configurations (BP 2004, Chevron 2014 benchmark model, SEG/ EAGE overthrust model) show that this reformulation should yield a more convex misfit function, less prone to cycle skipping. In this study, we present a simple illustration on the Marmousi model, which illustrates how this new distance strongly relaxes the requirement on the initial model design. Starting from a rather simplistic approximation of the initial model, the method is able to reconstruct a meaningful estimation of the Marmousi model, while the standard least-squares formulation is trapped into a local, meaningless minimum. Introduction Full-waveform inversion (FWI) is a high-resolution seismicimaging technique based on the nonlinear iterative minimization of the misfit between calculated and observed data. The simple formalism of the method makes it amenable for the estimation of any subsurface mechanical parameters influencing the wave propagation, such as Pand S-wave velocities, density, attenuation, or anisotropy parameters. Nonetheless, both in industrial and academic applications, FWI is used mainly for the estimation of wave velocities (Fichtner et al., 2010; Plessix and Perkins, 2010; Sirgue et al., 2010; Tape et al., 2010; Peter et al., 2011; Zhu et al., 2012; Warner et al., 2013; Vigh et al., 2014; Borisov and Singh, 2015; Operto et al., 2015). In addition, the successful application of FWI on real data requires a kinematically compatible initial estimation of the velocity model (computed beforehand through well-established techniques, such as time tomography, L. Métivier1, 2, R. Brossier2, Q. Mérigot3, E. Oudet1, and J. Virieux2 velocity analysis, and/or geologic interpretation) as well as the design of a specific hierarchical workflow to interpret the data progressively during iterations and to converge to a meaningful subsurface model estimation (Virieux and Operto, 2009). One current trend in FWI is to improve the reconstruction of physical parameters by accounting for a more realistic physics of the wave propagation, including viscosity, elasticity, and anisotropic effects. In particular, this leads to a multiparameter inversion that should provide a much more detailed and reliable subsurface characterization, ultimately yielding the possibility to constrain microscale parameters through downscaling strategies (Dupuy et al., 2016). Another line of investigation consists of increasing the applicability range of FWI, mitigating the dependence on ad-hoc initial model build-up, and on the design of a suitable hierarchical workflow. The optimal transport strategy we promote in this study is related to this second line of investigation. The main reason for the limitation in FWI applicability is related to what is usually referred to as cycle skipping, or phase ambiguity. In standard FWI, the oscillatory seismic data is matched in the least-squares sense, where each observed sample is compared to the synthetic sample at the same position in time and/or in space. This choice is problematic: if the initial model predicts the signal with a shift larger than half a period, minimizing the least-squares distance between observed and calculated data amounts to matching the observed data up to one or several phase shifts. This yields an incorrect estimation of the subsurface model, which cannot be overcome through iterations: the optimization is locked into a local minimum. An illustration of this phenomenon, where the seismic data is considered schematically as a sinusoidal temporal signal, is presented in Figure 1. Overcoming this difficulty has been a recurrent objective since FWI’s introduction by Lailly (1983) and Tarantola (1984). Increasing the initial model’s accuracy through high-resolution tomography methods, as well as designing hierarchical workflows focusing first on low-frequency components of the data, early-arrivals, and/or short offsets, have been initial strategies proposed to challenge this issue (Kolb et al., 1986; Bunks et al., 1995; Pratt, 1999; Shipp and Singh, 2002; Sirgue and Pratt, 2004; Wang and Rao, 2009). They are still the ones implemented for real-data applications to guarantee the success of FWI. This careful tuning is case-dependent; therefore, it reduces the flexibility of FWI and requires an expert usage of FWI and preprocessing tools. Since the cycle-skipping issue results from the use of the least-squares distance to match the seismic data, numerous studies have been proposed to modify this distance within the FWI workflow: cross correlation of observed and predicted data (Luo and Schuster, 1991), separation of envelope and phase (Fichtner 1Laboratoire Jean Kuntzmann (LJK), University Grenoble Alpes, CNRS, France. 2ISTerre, University Grenoble Alpes, France. 3Laborartoire CEREMADE, University Paris-Dauphine, CNRS, France. http://dx.doi.org/10.1190/tle35121060.1.


Introduction
Full-waveform inversion (FWI) is a high-resolution seismicimaging technique based on the nonlinear iterative minimization of the misfit between calculated and observed data.The simple formalism of the method makes it amenable for the estimation of any subsurface mechanical parameters influencing the wave propagation, such as P-and S-wave velocities, density, attenuation, or anisotropy parameters.Nonetheless, both in industrial and academic applications, FWI is used mainly for the estimation of wave velocities (Fichtner et al., 2010;Plessix and Perkins, 2010;Sirgue et al., 2010;Tape et al., 2010;Peter et al., 2011;Zhu et al., 2012;Warner et al., 2013;Vigh et al., 2014;Borisov and Singh, 2015;Operto et al., 2015).In addition, the successful application of FWI on real data requires a kinematically compatible initial estimation of the velocity model (computed beforehand through well-established techniques, such as time tomography, L. Métivier 1, 2 , R. Brossier 2 , Q. Mérigot 3 , E. Oudet 1 , and J. Virieux 2 velocity analysis, and/or geologic interpretation) as well as the design of a specific hierarchical workflow to interpret the data progressively during iterations and to converge to a meaningful subsurface model estimation (Virieux and Operto, 2009).
One current trend in FWI is to improve the reconstruction of physical parameters by accounting for a more realistic physics of the wave propagation, including viscosity, elasticity, and anisotropic effects.In particular, this leads to a multiparameter inversion that should provide a much more detailed and reliable subsurface characterization, ultimately yielding the possibility to constrain microscale parameters through downscaling strategies (Dupuy et al., 2016).Another line of investigation consists of increasing the applicability range of FWI, mitigating the dependence on ad-hoc initial model build-up, and on the design of a suitable hierarchical workflow.The optimal transport strategy we promote in this study is related to this second line of investigation.
The main reason for the limitation in FWI applicability is related to what is usually referred to as cycle skipping, or phase ambiguity.In standard FWI, the oscillatory seismic data is matched in the least-squares sense, where each observed sample is compared to the synthetic sample at the same position in time and/or in space.This choice is problematic: if the initial model predicts the signal with a shift larger than half a period, minimizing the least-squares distance between observed and calculated data amounts to matching the observed data up to one or several phase shifts.This yields an incorrect estimation of the subsurface model, which cannot be overcome through iterations: the optimization is locked into a local minimum.An illustration of this phenomenon, where the seismic data is considered schematically as a sinusoidal temporal signal, is presented in Figure 1.Overcoming this difficulty has been a recurrent objective since FWI's introduction by Lailly (1983) and Tarantola (1984).Increasing the initial model's accuracy through high-resolution tomography methods, as well as designing hierarchical workflows focusing first on low-frequency components of the data, early-arrivals, and/or short offsets, have been initial strategies proposed to challenge this issue (Kolb et al., 1986;Bunks et al., 1995;Pratt, 1999;Shipp and Singh, 2002;Sirgue and Pratt, 2004;Wang and Rao, 2009).They are still the ones implemented for real-data applications to guarantee the success of FWI.This careful tuning is case-dependent; therefore, it reduces the flexibility of FWI and requires an expert usage of FWI and preprocessing tools.
Since the cycle-skipping issue results from the use of the least-squares distance to match the seismic data, numerous studies have been proposed to modify this distance within the FWI workflow: cross correlation of observed and predicted data (Luo and Schuster, 1991), separation of envelope and phase (Fichtner et al., 2008;Bŏzdag et al., 2011), deconvolution approaches (Luo and Sava, 2011;Warner and Guasch, 2014) are instances of such attempts.Image-domain techniques have also been promoted for reflected waves: the velocity accuracy is probed through the analysis of the coherency of migrated images computed in an extended domain.These images are built upon a generalized imaging condition, which introduces a fictitious dimension (time lag or subsurface offsets, for instance) along which the image should be focused at zero if the velocity is correct (see Symes, 2008, for an overview).All these methods are currently under development; none has yet been proved definitely to overcome the cycle-skipping issue.Also, the extension of these strategies to a 3D framework is not always straightforward.This is especially the case for imagedomain techniques, due to the high computational cost associated with a higher dimensional model space when repeating the construction of extended-domain images.
In this study, we review an alternative modification of the misfit function, based on the optimal transport distance we have proposed in Métivier et al. (2016aMétivier et al. ( , 2016b)).This distance comes from a very active research field in mathematics, as testified by the number of textbooks published on this topic recently (Villani, 2003(Villani, , 2008;;Ambrosio et al., 2008;Santambrogio, 2015).Our proposition is inspired by emerging applications of optimal transport in image processing (see Ferradans et al., 2014;Lellmann et al., 2014, and references therein).In Section 2, we present what this distance is, why its properties make it a suitable choice for increasing the robustness of FWI, and how it can be implemented efficiently, in 2D and 3D configurations, within a standard FWI workflow.In Section 3, we present numerical results on the synthetic Marmousi model emphasizing the interest of this approach.More realistic 2D applications on the BP 2004 salt model and the benchmark 2014 Chevron data are presented in Métivier et al. (2016a), while a 3D application on the SEG/EAGE overthrust model is proposed in Métivier et al. (2016b).Conclusion and perspectives are given in Section 4.

Optimal transport for full-waveform inversion
What is optimal transport?Optimal transport finds its roots in 1780 in the work of the French engineer Gaspard Monge, from "École des Ponts."Monge supervised a bridge-building site.Piles of sand needed to be displaced to fill in holes.Monge expressed mathematically how this displacement could be achieved optimally to minimize the effort of the workers.This was the first formulation of the optimal transport problem.More than 150 years later, a modern and suitable mathematical formulation for this problem was introduced by Kantorovich (1942), as a minimal relaxation of the Monge problem.(When the Monge problem has a solution, the Kantorovich formulation provides this solution.)In his description, the initial piles can be associated with N quantities p i , located at points x i , for i = 1, …, N. The configuration of the holes is associated with M quantities q j , located at points y j , for j = 1, …, M.An important assumption is made: the total quantity of sand requested to fill in the holes q j is exactly equal to the total quantity of available sand p i .This corresponds to the mass-conservation assumption (1) Kantorovich considers the ensemble of displacements that make it possible to fill the holes with the quantities q j from the sand piles p i .These displacements can be represented as matrices γ with N rows and M columns.An entry γ ij tells how much from the pile p i should be moved to fill in the hole q j .Mapping the ensemble of piles, p, onto the holes, q, requires that the sum of the elements of the ith row of γ is equal to p i , while the sum of the elements of the jth column of γ is equal to q j .A matrix satisfying this assumption is called a "transport plan."An example of such a transport plan is presented in Figure 2. The matrix representing the transport plan schematized in Figure 2 (2) The matrix γ is a transport plan as the sum of the elements of its rows are equal to p 1 , p 2 , and p 3 , respectively, while the sum of the elements of its columns are equal to q 1 , q 2 , q 3 , and q 4 , respectively.The interpretation of this transport plan is as follows.From the sand pile p 1 , three mass units should be move to q 1 , while two mass units should be move to q 3 .The sand pile p 2 and p 3 should be moved integrally to q 4 and q 2 , respectively.
There is an infinity of possible transport plans that allow movement of the sand piles p to the holes q.The optimal transport problem consists of computing the unique transport plan, which minimizes a function measuring the total displacement cost.This cost is the sum of the elementary costs associated with the elementary displacements.The cost of an elementary displacement between As soon as the initial shift is larger than half a period of the signal, the fit of the signal using a least-squares distance is performed up to one or several phase shifts.One may try to fit the n + 1 dashed wriggle of the top signal with the n continuous wriggle of the middle signal moving to the wrong direction.The bottom dashed signal predicts the n wriggle in less than a halfperiod leading to a correct updating direction (from Virieux and Operto, 2009).
x i and y j is measured as the product between the mass that is actually transferred, γ ij , multiplied by the distance between x i and y j .This measure implies that a balance must be found between the amount of mass that is transported and the distance along which it is transported.Mathematically, this is formulated as the linear programming problem min (3)   where the expression x i − y j denotes a distance between x i and y j (often the Euclidean distance).

Why use an optimal transport distance for full-waveform inversion?
The reason optimal transport can be an interesting tool is that the solution of the linear programming problem (3) defines a distance between the distributions p and q.This distance can be used to measure the discrepancy between any two discrete quantities satisfying the mass-conservation assumption (1).
Engquist and Froese (2014) illustrate the interesting behavior of this distance in a simple numerical experiment, where the optimal transport distance between time-shifted Ricker signals is computed.After discretization, two Ricker signals can be considered as two discrete distributions p and q.The distance between the points x i and y j where masses p i and q j are located is measured in a single dimension, which corresponds to the time axis.As the Ricker signal is not positive, some of the mass p i and q j have negative values.While this could be counterintuitive to define negative mass, it might not be a stringent difficulty for computing the optimal transport distance between the distributions p and q.Engquist and Froese (2014) propose to split the Ricker signals into their positive and negative parts, and to define the optimal transport distance between these signals as the sum of two optimal transport distances: the first computed on the positive part of the Ricker signals, the second computed on the opposite of their negative parts.Working with the dual formulation of the optimal transport distance (which is defined in the next section) also yields the possibility to consider directly positive and negative mass.This is the strategy we employ.Finally, as the Ricker signals are only shifted in time, the distribution p and q satisfy the mass-conservation assumption (1).Therefore, the optimal transport distance between two shifted-in-time Ricker signals can be computed.
When a standard least-squares norm is used, the distance between the Ricker signals with respect to the time shift, presents two local minima aside the global minimum.This is an illustration of cycle skipping: minimizing the misfit between these two signal starting from a too large time shift would result in a misinterpretation of the signal.Conversely, when the optimal transport distance is used, the distance with respect to the time shift is perfectly convex: a single minimum exists.This is illustrated in Figure 3.
How can we interpret this result?The definition of optimal transport is the key: the distance between the two signals is the minimum effort required to transport one signal onto the other.This effort depends on the distance between points from which the mass has to be exchanged.Therefore, as soon as the time shift increases, the transportation effort increases.

What are the difficulties for using an optimal transport distance for full-waveform inversion?
Using an optimal transport distance, beyond simple Ricker signals, for real seismic data, is not straightforward.The first main difficulty is related to the mass-conservation assumption (1).This assumption cannot be guaranteed for real seismic data.Indeed, there is no reason for the mass distributions coming from the discretization of observed and calculated seismic traces to satisfy this assumption.Seismic events recorded in the observed data may not be predicted by the calculated data, resulting in this disparity.This is a serious difficulty.If this assumption is not satisfied, the Kantorovich problem has no solution.In the context of exploration-scale real seismic data, the important density of sources and receivers, yielding large number of seismic traces, also raises the question of how the optimal transport should be used to measure the misfit between observed and calculated data.A straightforward use consists of considering each seismic trace independently and solving as many 1D optimal transport problems as the number of traces.In this framework, the resulting misfit functions should be the sum of the optimal transport misfit computed for each trace.
However, optimal transport yields the possibility of performing more interesting multidimensional comparisons.For years, seismic-imaging practitioners have used representations of seismic data in panels allowing clear identification of the type of waves and seismic events that have been recorded.The identification of these events is performed through the analysis of their coherency in receiver-gather or shot-gather panels.
Figure 2. Example of a transport plan for the solution of the transport problem between a distribution p 1 , p 2 , p 3 toward a distribution q 1 , q 2 , q 3 , q 4 .We may consider that the distribution p is the discrete observed data and that the distribution q is the discrete synthetic data.
Interestingly, wave velocity perturbations are responsible in this case for shifting these events not only in the time coordinate but also in the receiver or shot coordinate.Robustly accounting for the shifts between these events can be achieved by defining an optimal transport distance in a multidimensional space: for time-domain FWI, in 2D configurations, 2D transport problems should be considered (time-axis + 1D receiver array), while in 3D configurations, 3D transport problems should be considered (time-axis + 2D receiver network).An additional dimension along the source dimension could be also considered; however, this would break the standard parallelism on shots commonly used for time-domain FWI implementations.For frequencydomain FWI, however, in 2D configuration, 2D transport problems should be considered (1D source array + 1D receiver array), while in 3D, 4D transport problems could be considered (2D source network + 2D receiver network).Let us note that, for frequency-domain application, the data have to be split into their real and imaginary parts; the resulting distance should be computed as the sum of the optimal transport distance computed separately on these two quantities.This ability to account for the coherency of the seismic events not only in the time dimension but in the whole gather domain is a key feature of the optimal transport distance.
In practice, the discretization of 2D or 3D realistic-scale shot gathers yields mass distributions that can reach several tenths of millions of elements and beyond.This raises another important difficulty.Standard algorithms for solving linear programming problems such as the problem (3) have a high computational cost, not adapted to this problem scale.For instance, all known methods for solving the problem (3) exactly, assuming the data are described by integers, have a computational complexity larger than (N 2 ), where the total number of discrete samples representing a gather is denoted by N. For realistic scale applications, algorithms in linear complexity (N ) or quasilinear comple x it y (NlogN ) a re a l most mandatory.
Our contribution is the design of a method that is able to overcome these difficulties of seismic imaging -namely a strategy to estimate an optimal transport distance between mass distribution that does not satisfy the mass-conservation assumption, with a computational complexity at most quasilinear.

What is the solution we propose?
The method we propose consists of solving a modified version of the Kantorovich problem.This strategy is explained in detail in an article oriented to the geophysical community (Métivier et al., 2016a) and in an article oriented to the applied-mathematics community (Métivier et al., 2016b).Only the main ideas on which this strategy is based are presented here in order to identify salient features of this new distance definition.While the problem ( 3) is referred to as the primal problem, the associated dual problem can be expressed as finding values φ i at each discrete point of the data set subject to inequality constraints: ) , ϕ i − ϕ j ≤ x i − x j ,∀i, j = 1,..., N .(4) Please note that we compute the difference of observed and calculated values at the same point i while the inequality constraint is nonlocal with the two indexes i and j.The solution of this dual problem ( 4) is equivalent to the solution of the primal problem (3).In other words, the distance between the observed data d obs (corresponding to the previously introduced distribution p) and the data d cal (corresponding to the previously introduced distribution q) computed as solution of these two problems is the same.Please note that, in this context, the quantity x i denotes a location in the calculated or observed data sets.For a 1D time signal, it corresponds to a time index.For a 2D shot gather, it corresponds to a couple (receiver/time) of indices.It can be extended also to a 3D seismic cube if needed.The very important duality result is due to Kantorovich; a complete proof can be found in Villani (2003) or Santambrogio (2015).Exactly as for the primal problem, if the mass-conservation assumption (1) is not satisfied, the dual problem has no solution.
Our first contribution is to recognize that, however, this can be relaxed through the addition of the constraints ϕ i ≤ c into the dual problem, where the quantity c is a user-defined constant quantity.The modified dual-transport problem we consider is thus max (5) This dual-transport problem accepts a solution when the massconservation assumption ( 1) is not satisfied.The problem (5) defines the Kantorovich-Rubinstein norm d obs − d cal KR (Lellmann et al., 2014).As a second step, we focus on the particular case where, instead of the Euclidean distance x i − x j , we use the ℓ 1 distance we denote by x i − x j to measure the distance between point x i and x j .For a 2D data set, if we denote x i 1 and x i 2 the two components of the point x i , we would have The transport problem we consider is thus max The complexity of the transport problem ( 7) remains important: even if the size of the variable φ is N, the number of linear inequalities constraints to be satisfied is in (N 2 ).The interest for focusing on the Kantorovich-Rubinstein norm associated with the ℓ 1 distance is related to the fact that an equivalent formulation can be derived, involving only (N) linear constraints.This drastic reduction comes from a particular feature of the ℓ 1 distance (also known as Manhattan distance), which allows satisfaction of the linear constraints only locally.Satisfying these (N) local constraints is equivalent to satisfying the (N 2 ) global constraints.
The following equivalent formulation can thus be deduced from (7), expressed as max The problem ( 8) is a modified dual optimal transport problem, which has a solution even when the mass-conservation assumption (1) is not satisfied, and which can be expressed as a linear programming problem of size N with (N ) linear inequality constraints.This reformulation opens the way to the design of efficient numerical strategies for its solution.To this purpose, the main idea we exploit is to recast it as a nonsmooth optimization problem, which can be solved efficiently through a proximal splitting algorithm named SDMM (Combettes and Pesquet, 2011).Using this method, we design an algorithm for approximating the solution of (8) with linear complexity (N ) or quasilinear complexity (NlogN ), allowing to consider the large-scale problems imposed by full-waveform inversion, both for 2D and 3D data panels associated with 2D and 3D configurations.The full procedure is described in Métivier et al. (2016b).An illustration of the use of this optimal transport distance on the example of the shifted-in-time Ricker signal is provided in Figure 4. Compared to the least-squares distance, a single minimum is recovered.However, compared to the standard optimal transport distance used by Engquist and Froese (2014), the convexity of the misfit function with respect to the time shift is lost.This is probably due to the use of the ℓ 1 norm to measure the spatial distance between points, even if this remains to be proved theoretically.This loss of convexity could appear as a penalizing feature; however, for the practical applications we have considered, this does not seem to have an impact on the convergence of the local optimization solvers.

Implementation and numerical results
Implementation: gradient computation.The FWI problem is solved through local optimization techniques, which are based on the computation of the gradient of the misfit function.Modifying the misfit function thus requires examination of how to compute its gradient.Standard implementations of FWI are based on the adjoint-state approach: the gradient is computed as the zero-lag correlation between incident and adjoint wavefields.The adjoint wavefields are computed as the back propagation of residuals at the receivers location (Plessix, 2006).It is established that a modification of the misfit function only induces a modification of the adjoint source definition (see Brossier et al. [2010]; Luo and Sava [2011] for instance).The question for us is thus to determine what are the residuals associated with the optimal transport misfit function (8) which will act as adjoint source.
The computation of the misfit function requires the solution of the problem (8).The misfit function is the value of the quantity ∑ i ϕ i (d obs,i -d cal,i ) where the function ϕ maximizes this quantity under the constraints discussed earlier.The important result regarding the implementation of our strategy is the following: it is possible to show that the residuals associated with this misfit function is precisely the function ϕ .Therefore, the optimal transport problem has to be solved only once by iteration, which prevents drastic increase in computer time (for instance we observe a 20% increase of the computation time for the gradient, on the Marmousi example presented after).From its solution, we can extract both the value of the misfit function and the residuals which have to be back propagated for computing the adjoint fields and, therefore, building the model gradient.We recover a single minimum.However, compared to the optimal transport distance used by Engquist and Froese (2014), the convexity of the misfit function is lost.This is due to the particular formulation of the Kantorovich Rubinstein problem, which is based on a ℓ 1 measure of the distance between points.
Numerical results: Marmousi case study.We illustrate the Kantorovich-Rubinstein approach on the following Marmousi synthetic case study.The exact Marmousi model is presented in Figure 5a.A synthetic data set is computed in the 2D acoustic constant-density approximation.We use a fixed-spread surface acquisition with 128 sources each 125 m and 168 receivers each 100 m, at 50 m depth.A Ricker source function centered on 5 Hz is used to generate the synthetic data set.The frequency content of the source is high-pass filtered above 3 Hz to mimic realistic seismic data.Below 3 Hz, seismic signal is contaminated by noise, and therefore is unavailable for inversion.Two initial models are considered: the first contains the main features of the exact model, only with smoother interfaces (Figure 5b).The second is a strongly smoothed version of the exact model with very weak lateral variations and underestimated growth of the velocity in depth (Figure 5c).Starting from these two initial models, we compare the FWI results obtained using a leastsquares distance and the optimal transport distance we propose.The minimization is performed using the l-BFGS algorithm (Nocedal, 1980) implemented in the SEISCOPE optimization toolbox (Métivier and Brossier, 2016).
The results we obtain are presented in Figures 5d-5g.The convergence to a correct estimation of the P-wave velocity model is obtained using both the least-squares (Figure 5d) and optimal transport (Figure 5f) distances starting from the first initial model.A slightly better estimation of the low velocity zone at x = 11 km, z = 2.5 km is obtained using the optimal transport distance.A high-velocity artifact can be seen for the least-squares estimation in this zone.More importantly, starting from the second initial model, only the results obtained using the optimal transport distance are meaningful (Figure 5g).The poor initial approximation of the P-wave velocity is responsible for cycle skipping and the least-squares estimation converges toward a local minimum (Figure 5e).The estimation obtained with the optimal transport distance is significantly closer from the true model, despite low-velocity artifacts in the shallow part at x = km, z = 1 km, and in depth at x = 12 km, z = 3.4 km.This example illustrates the potential of optimal transport for FWI: starting from a crude approximation of the P-wave velocity, a meaningful estimation is computed.In the same configuration, FWI based on the least-squares distance fails and produces a heavily cycle-skipped estimation.

Conclusion
Optimal transport distance appears as an interesting tool for FWI based on its ability to provide a more convex misfit function.
In addition, contrary to other misfit function modifications such as cross-correlation techniques, the resolution power seems to be preserved.Its principle is based on a change in the way the distance should be computed between observed and computed seismograms.When a least-squares distance is used to compare the seismic data, only point-to-point comparisons are performed and the information on the coherency between seismic events in the shot-gather panel is not accounted for in the inversion.Optimal transport offers the possibility to account for this coherency in the inversion.Theoretical work is mandatory to better understand the more convex behavior of the modified optimal transport distance we promote.However, the ability of the optimal transport to take into account the global coherency of the data is the reason we consider it as a promising alternative to standard misfit functions for FWI.
Currently, the method has been successfully applied for a salt reconstruction problem on the 2D BP2004 model.Starting from an initial model containing no information about the presence of salt, we were able to design a suitable workflow to progressively reconstruct a complex salt body.This prompts us to continue the investigation of this strategy for imaging subsalt targets.The optimal transport method has been successfully applied also to the Chevron 2014 benchmark data set, for which results comparable with those published in the literature have been recovered, using a very simple workflow.These two applications are presented in Métivier et al. (2016a).A first 3D application on the SEG/ EAGE overthrust model is also presented in Métivier et al. (2016b).Future work will include applications on 2D and 3D real data, for instance on the Valhall data set.Methodological work is also ongoing, as other optimal transport formulation might improve further the method in terms of computational cost and misfit function convexity.

Figure 1 .
Figure 1.Schematic example of the cycle-skipping/phase-ambiguity issue on sinusoidal signals.As soon as the initial shift is larger than half a period of the signal, the fit of the signal using a least-squares distance is performed up to one or several phase shifts.One may try to fit the n + 1 dashed wriggle of the top signal with the n continuous wriggle of the middle signal moving to the wrong direction.The bottom dashed signal predicts the n wriggle in less than a halfperiod leading to a correct updating direction (fromVirieux and Operto, 2009).

Figure 3 .
Figure 3. Computation of the misfit function between two time-shifted Ricker signals depending on the time shift, using a least-squares distance and an optimal transport distance.While the least-squares distance yields a nonconvex misfit function with two local minima aside the global minimum at zero time shift, the optimal transport distance yields a perfectly convex misfit function (from Engquist and Froese, 2014).

Figure 4 .
Figure 4. Computation of the misfit function between two time-shifted Ricker signal depending on the time shift, using a least-squares distance (black) and the Kantorovich-Rubinstein distance (red).We recover a single minimum.However, compared to the optimal transport distance used byEngquist and Froese (2014), the convexity of the misfit function is lost.This is due to the particular formulation

Figure 5 .
Figure 5. Marmousi model case study.(a) Exact model, (b) initial model 1, (c) initial model 2, results obtained with the least-squares distance starting from (d) model 1, from (e) model 2, results obtained with the KR distance starting from (f) model 1, from (g) model 2.