Wasserstein Loss for Image Synthesis and Restoration

. This paper presents a novel variational approach to impose statistical constraints to the output of both image generation (to perform typically texture synthesis) and image restoration (for instance to achieve denoising and super-resolution) methods. The empirical distributions of linear or non-linear descriptors are imposed to be close to some input distributions by minimizing a Wasserstein loss, i.e. the optimal transport distance between the distributions. We advocate the use of a Wasserstein distance because it is robust when using discrete distributions without the need to resort to kernel estimators. We showcase diﬀerent estimators to tackle various image processing applications. These estimators include linear wavelet-based ﬁltering to account for simple textures, non-linear sparse coding coeﬃcients for more complicated patterns, and the image gradient to restore sharper contents. For applications to texture synthesis, the input distributions are the empirical distributions computed from an exemplar image. For image denoising and super-resolution, the estimation process is more diﬃcult; we propose to make use of parametric models and we show results using Generalized Gaussian Distributions.

1. Introduction.The statistical modeling of natural images is a long standing problem which is useful to tackle a wide range of image processing applications.In this article, we focus our attention to problems in both image synthesis and image restoration; we show how to address them with variational approaches using the statistical model as a data fidelity rather than a regularization.This allows us to obtain results that are more faithful to the targeted model (as opposed to traditional variational regularization methods) and to better restore and generate fine scale details.

Previous Works.
Texture synthesis: sampling vs. optimization.The texture synthesis problem is usually framed as a statistical estimation and re-sampling problem.Probably the most complete framework to achieve this goal is the work of Zhu et al. [74] that proposes a maximum-entropy method to estimate a generic class of Gibbs distributions from a given exemplar, leveraging only the stationarity of the thought after solution.The major drawback of this approach is that the resampling stage is slow since it necessitates the use of a Gibbs sampler.At the opposite part of the spectrum, the method of Galerne et al. [27] uses a simple statistical model (a stationary Gaussian distribution).It can only capture "micro-textures" without any geometrical pattern but can be learned and can re-synthesized very efficiently, hence enabling its use in real-time computer graphics pipelines [28].Let us note however that the state of the art in computer graphics rather relies on heuristic methods to learn and re-sample: it typically considers re-copy of groups of pixels or patches from the input exemplar [19,69,38].While these non-parametric approaches lead to high fidelity in the re-sampling, they suffer from a poor understanding of the underlying generated distribution and lead to synthesis results that copy verbatim large chunks of the input image without much randomization or "innovation"-see [1,62] for an experimental exploration of this aspect.
To speed-up (and sometime even improve) the sampling quality, a increasingly large body of literature has considered replacing random sampling from a given distribution by an optimization of a non-convex energy, usually starting from a random initialization.While there is no strong mathematical analysis of the generated distribution, the rationale is that the random exploration of local minima associated to an energy which measures deviations from statistical constraints gives in practice results that are at least as good as sampling from the maximum entropy distribution inside these constraints, as originally advocated in [74].The pioneer work [34] of Heeger and Bergen, that imposes the first order statistics of wavelet coefficients through iterative projections, may be seen as a precursor of such variational approaches.This idea was pushed much further by Portilla and Simoncelli [50] by imposing higher order statistics.Then, patch-based recopy methods have been explicitely reformulated as non-convex random optimization [35].Tartavel et al [64] also integrates patch-based ideas into a variational texture synthesis algorithm by using sparse coding in a learned dictionary.Another recent line of research, also based on the statistical modeling of non-linear image features, makes use of deep neural networks, which are thus also trained on the input exemplar [29,42].
Variational Restoration.While it is natural to think of texture synthesis as a statistical re-sampling problem rather than an optimization one, a large body of literature on image restoration (e.g.denoising or super-resolution) thinks the other way around.The idea of variational regularization, very often derived as a MAP (Maximum A Posteriori estimator), is to recover a clean and high resolution image by minimizing a functional accounting for both a fidelity term (which depends on the noise's properties) and a prior (which takes into account some knowledge about the structures of the image to recover).Two popular classes of priors are the Total Variation (TV) model of Rudin, Osher and Fatemi [57] and the sparse wavelet approximations of Donoho and Jonhstone [18].The TV model favors cartoon images having a sparse gradient distribution, and shares similarity with wavelet models that capture discontinuous images using multiscale decompositions.Total variation has been used in almost all imaging applications, including for instance denoising [57], deblurring [6], interpolation [33], or image decomposition [3].Wavelet approaches can give rise to different kinds of prior, using thresholding in orthogonal [18] or translation invariant [13] dictionaries, and used either as analysis or synthesis regularizers for inverse problems [22].It is also possible to go beyond simple first-order sparse statistical models by accounting for spatial and multiscale dependency between wavelet coefficients [59,55,51].
Beside these historically popular methods, the state of the art in denoising and (to a lesser extend) more general restoration problems is non-local processing, that operates by comparing patches in the image.This is somehow similar to the recopy methods presented above for texture synthesis applications.The non-local mean method [9] introduces this idea for denoising as a non-linear filtering operator-see also [4].It was then recast as a variational regularization [30,23] that can be used for restoration purpose [2,49].The BM3D approach [16] extends one step further these non-local ideas by merging them with sparse approximation, giving rise to state of the art results.More recently, several works have achieved even better results by modeling patches through Gaussian distributions [72,37].
Dictionary learning somehow bridges the gap between these patch-based comparison methods and sparse wavelet (or gradient) regularization.This approach was introduced by Olshausen and Fields [47] and then refined for denoising applications (see for instance [24,21]) and more general imaging problems (see for instance [43]).
Statistical constraints for restoration.A well known issue with variational MAP estimators is that the output of the method is in general not distributed according to the though-after distribution, as explained and analyzed in details in the works of Nikolova [45].A striking class of examples is given in [32] where it is argued that most thresholding operators are adapted to data which are significantly sparser that the initial prior.However, it makes sense for many applications to impose the distribution of the estimator, typically to reach a high fidelity in restoring fine scale details or geometric patterns such as edges.Variational methods usually do not come with such a guaranty on the output, which is problematic in some cases.A well known example is the output of TV regularization: it suffers from the "stair-casing" effect [45], resulting in many pixels having zero gradient values.This is usually not a desirable statistical feature and is in contradiction with a Laplacian prior on the gradient.More generally, the MAP tends to under-estimate the heavy tails of the distributions [71], hence removing some information and often resulting in the degradation of textured parts of the image.
The standard way to solve this issue is to use a conditional expectation (minimum mean square estimator) rather than a MAP.This has been proposed for instance in the case of sparse wavelet priors in [58] and for the total variation prior in [40].The literature on inverse problems using bayesian sampling is huge, and we refer for instance to [60] for an overview.By construction, these approaches are perfectly faithful to the prior distribution, which alleviates some of the drawback of MAP estimators (e.g. the staircasing effect for TV).As already noted above for synthesis application, the main issue is the high computational complexity of the corresponding probabilistic sampling methods (Markov Chain Monte Carlo methods such as Gibbs samplers).
Statistical fidelity and optimal transport.Following several recent works, we propose to explore an alternative way to impose statistical prior constraints on the output of both synthesis and restoration algorithms.This approach is more heuristic but computationally less expensive than sampling-based methods; it integrates the statistical constraints directly into a variational MAP-like approach.
Our approach takes its roots in a series of conference papers where one of us introduced statistical losses over pixels values and wavelet coefficients to perform texture synthesis and restoration [53,54].The authors of [61] use a convexification of the same energy in order to be able to compute global optima at the expense of slower algorithms.An initial version of our framework was presented in the thesis manuscript [62].A related work [20] has been done in parallel with an application to super-resolution.These previous works advocate the use of optimal transport distances (also known as Wasserstein distances) to account in a robust and simple way for discrepancies between discrete empirical distributions.We refer to the monograph of Villani for a detailed description of Wasserstein distance [67].These distances have recently gained some interest in image processing, in particular to perform color and texture manipulation-see for instance [26] and the references therein.
Let us also note that optimal transport methods differ from more standard losses to measure statistical discrepancy.The most routinely used is the Kullback-Leibler (KL) divergence which is reminiscent of estimation using maximum likelihood meth-ods.Of particular interest for our work are the recent works [11,12] that use a KL loss to account for gradient statistics in the context of restoration problems and therefore pursue the same goal as we do in Sect.4: preserving gradient statistics while restoring images.A similar approach is used by [73] to extend the Field of Experts approach of [56] with a Kullback-Leibler divergence to control the gradients' distribution.The idea of imposing the gradients' distribution is also at the core of [31] to tackle inverse problems in biomedical imaging.A chief advantage of the Wasserstein loss is that it can compare pairs of discrete empirical distributions without the need to use kernel density estimators, which is further complicated in practice by the difficult problem of selecting the bandwidth of the kernel.
1.2.Contributions.Sect. 2 introduces a novel framework to impose statistical constraints in imaging using a Wasserstein loss.The key idea is to use an optimal transport between discrete distributions of a set of features, that can easily be optimized using gradient descent methods.Sect. 3 presents a first application of this idea to perform texture synthesis, which is achieved by simply applying a memory-limited quasi-Newton (L-BFGS) optimization scheme to this statistical loss.Several examples on both linear (wavelet-based) and non-linear (sparse coding) features illustrate the ability of the method to generate random samples with prescribed features' distributions.Sect. 4 presents a second application to image restoration and in particular to denoising and super-resolution (inpainting of scattered missing pixels).This approach imposes the gradient distributions of the resulting image on top of a traditional variational estimator (we illustrate here the method with a TV regularizer).The target distributions are estimated from the single noisy observation by restricting the search to a parametric family of Generalized Gaussian Distributions.Numerical results show how this additional statistical fidelity is able to improve the restoration of fine scale details, edges, and textural features.
A Matlab implementation of the proposed framework is available online * .
2. Wasserstein Loss.This section formalizes the statistical fidelity term that is used in the remaining part of this paper both to perform variational texture synthesis (Sect.3) and image restoration with statistical constraints (Sect.4).The main novelty with respect to previous works is the use of an optimal transport distance between empirical measures (i.e.sum of Dirac's).While being non-convex, a chief advantage of this approach is that it is robust (because optimal transport is a well defined geometric distance between Dirac masses) and it is simple to compute (at least for 1-D distributions).
2.1.Wasserstein Distance.We consider discrete probability measures which are normalized sums of N Dirac's where δ t is the Dirac measure at a point t ∈ R J .In practice, such a measure is an empirical distribution of a set of N features extracted from an image, as detailed in Sect.2.2 below.
Optimal transport provides a blueprint to define classes of distances between measures defined on quite general spaces.We refer to the monograph of Villani [67] for an extended description of these distances and their properties.For the sake of simplicity and for its practical relevance to the targeted applications in imaging, we only consider optimal transport with a squared Euclidean cost and between two discrete measures (ν z , ν z ) having the same number N of Dirac masses.In this very specific setting, optimal transport is equivalent to optimal assignment and the corresponding distance, the so-called 2-Wasserstein distance, is defined as where σ runs among the set Σ N of permutations of {1 . . .N }.
In the following, we denote σ z,z an optimal σ in (2.2) and we assume an arbitrary choice if σ z,z is not the unique solution.It corresponds to an optimal assignment between each point z i and the point z σ z,z (i) .
The following proposition recalls several simple but important properties of optimal assignments.We refer to [7] for more details and proofs.It states that the optimal assignment to z can be interpreted as an orthogonal projection on the highly nonconvex and high dimensional set of all possible permutations of z .Another crucial property is that the Wasserstein loss as a function of the Dirac's position (e.g.z or z ) is smooth almost everywhere, namely as soon as the optimal permutation is unique (which is almost surely the case for points in arbitrary positions).Furthermore, the formula (2.4) for the gradient is straightforward since the Wasserstein loss is locally a simple 2 loss as long as the optimal permutation does not change.
Proposition 2.1.Let use denote One has where dist (z, C) and If σ z,z is the unique minimizer of (2.2), then the function (2.4) Computing σ z,z and hence evaluating W(ν z , ν z ) 2 and the gradient (2.4) can be achieved using standard optimal assignment algorithms, which are combinatorial optimization methods with roughly cubic complexity.These methods do not scale to large N , and are thus not usable for imaging applications.The only exception is the case of 1-D distributions (i.e. each z i is in R) as detailed in the following proposition (see [7]).
Proposition 2.2 (1D Wasserstein Distance).For J = 1, one has where σ z sorts the values of z, that is, . . .≤ z σz(i−1) ≤ z σz(i) ≤ . ... This shows that the Wasserstein loss and its corresponding gradient can be computed in O(N log(N )) operations by simply sorting the values.This is the main reason why we consider 1-D distributions in the following.In the presence of higher dimensional distributions, we replace them by a set of 1-D distributions obtained by projections.This corresponds to the "sliced" Wasserstein approximation introduced in [7].A typical illustration of this idea can be found in Sect. 4 where the 2-D distribution of the gradients of an image is replaced by K 1-D distributions of directional derivatives (4.3).

Feature-Space Statistical Loss.
A feature extractor is a map H from an image u ∈ R P having P pixels to a set of N real-value features Very often one has P = N but this needs not to be the case.The most simple feature extractors are linear maps such as: • Pixel values using the identity map H(u) = u.
• Filtering H(u) = ψ u against a "wavelet"-like filter ψ (where is the convolution).It is of course possible to consider more complicated non-linear features such as SIFT vectors [41] or structure tensors [70] to account for more complicated features in an image such as edge curvature or corners.Sect.3.3 studies a different way to extract and use non-linear features through sparse coding.
To perform texture synthesis (Sect.3) or to improve the quality of restoration methods (Sect.4), we make use of a set {H k } 0≤k≤K of feature extractors.A typical example (see Sect. 3.2 which uses it for texture synthesis) is to set H 0 = Id to be the pixel value extractor and to let the others extractors to account for a wavelet transform H k (u) = ψ k u where ψ k is a wavelet kernel at a given scale and orientation.
Given a set of input distributions (ν z k ) 0≤k≤K , we now define our Wasserstein statistical loss as (2.6) The input distributions (ν z k ) k can be learned directly from an exemplar image v by setting z k = H k (v) (as it is the case for texture synthesis in Sect.3).In cases where no input image is available or if v is degraded (as it is the case for image restoration in Sect.4), one needs to consider more advanced estimation procedures.Note that we do not put any weights in front of each Wasserstein term in the sum (2.6) since they can be absorbed in the definition of the to be used in variational methods for synthesis and restoration.We use gradienttype optimization schemes in the following sections to compute local minimizers of synthesis or restoration energies.
A nice feature of this approach is that the gradient of E (with respect to the image u) is simple to compute using the chain rule: where the gradient of the map f z is computed as detailed in (2.4).One thus needs to compute the adjoint [∂H k (u)] * of the differential ∂H k (u) of the feature extractor H k .Each time we introduce and use a new feature extractor, we thus explain how to compute this adjoint map.
3. Texture Synthesis using Linear and Sparse Features.This section applies the Wasserstein loss to the problem of texture synthesis, and explores the use of different classes of features extractors.
3.1.Variational Texture Synthesis Methods.Following several recent works (see Section 1.1) we formulate the problem of texture synthesis from an exemplar v as the one of computing a random stationary point u of a texture fidelity energy min In order to compute a randomized stationary point of this energy, we use a gradient descent method.Starting from a realization u (0) ∈ R P of a white noise, we define the iterates as where ∇E is computed using formula (2.7).
Here L ( ) is a linear operator, which is intended to approximate the inverse of the Hessian of E(•|(ν z k ) k ) at u ( ) .For the numerical applications, we use the L-BFGS algorithm [46], which is a limited-memory version of the BFGS algorithm [52].This algorithm iteratively builds a low-rank approximation L ( ) of the inverse of the Hessian.When dealing with high dimensional problems, this is a good compromise between the tractability of the gradient descent and the efficiency of Newton's algorithm.We use the Matlab implementation [48]; a convergence analysis of this algorithm in the case of a class of non-smooth functions is proposed in [39].

Linear Wavelet Features for Synthesis.
To illustrate the usefulness of the proposed synthesis scheme, we first instantiate it in a simple setting, corresponding to a wavelet transform feature extractor.This allows us to revisit and improve the celebrated texture synthesis algorithm of Heeger and Bergen [34].
Heeger and Bergen's (HB) original algorithm.For the sake of simplicity, we consider a gray-level version of the algorithm.To impose the gray levels of the input image, the first feature is the image pixel values, i.e. we set H 0 (u) = u.A steerable wavelet transform [34] computes wavelet coefficients with a linear transform where H k (u) is a sub-sampled convolution against a rotated and scaled wavelet atom.This feature extractor can be understood as multi-scale directional derivatives (i.e.derivatives of increasingly blurred version of the image u).The computation of the linear map H 1,...,K is obtained in linear time with a fast filter bank algorithm [34].
Heeger and Bergen initial algorithm alternates between matching the spatial statistics and matching the wavelet statistics.From a Gaussian white noise realization u (0) , it first modifies the gray-values statistics of the current image u ( ) so that they match those of the exemplar v by projecting onto C ν0 , i.e. ũ( ) def.
= proj C H 0 (v) (u ( ) ).It then computes the wavelet coefficients of ũ( ) and matches them with those of v The new iterates is then reconstructed from these coefficients (z where H + 1,...,K is the pseudo-inverse of the wavelet transform H 1,...,K (i.e. the reconstruction which yields the image with the closest wavelet coefficients).Note that this pseudo inverse is also computed in linear time with a fast filter bank pyramid [34].
Variational re-formulation and improvement.The iterations of the three steps (3.3), (3.4), and (3.5) should not be mistaken for an alternated projections algorithm.Indeed, since the steerable pyramid is not orthogonal (it is a redundant transform), the two last steps do not correspond to an orthogonal projection.Using the fact that the transform H 1,...,K is a tight frame (i.e. it satisfies a conservation of energy), it can however be interpreted as computing alternated projections on the coefficients (z [62].Another interpretation of this method is that it is in fact an alternated (i.e.alternatively on the pixel fidelity and then the wavelet fidelity) gradient descent of the energy (3.1) using a fixed step size, see [62].
To obtain better results and a faster numerical scheme, it thus makes sense to replace this alternated gradient descent by the L-BGFS iterations (3.2) which integrates second order information about the statistical fidelity into the optimization scheme.The only missing ingredient to implement this descent is the computation of the adjoint ∂H * k of the derivative of the feature extractors.Since these are linear maps, one has ∂H * k = H * k , and these adjoints for k = 1, . . ., K are computed for all k with a fast filter bank, which, up to a diagonal re-scaling, is the same as the one computing H + 1,...,K .Numerical illustrations.randomized by the choice of the initialization, we show the minimum and maximum energy values reached among 100 random white noise realizations for u (0) , which allows one to get a clearer picture of the energy landscape explored by the methods.In these tests, the L-BFGS always yields lower values than the other algorithms; it has also a faster convergence speed.Note that HB algorithm (corresponding to an alternated gradient descent) converges faster than gradient descent, but the final state reached by H-B is only marginally better in term of energy than the one reached by gradient descent.Figure 3.2 shows the resulting image u ( ) computed by the HB [34] and the L-BFGS algorithms after = 25 iterations.The vertical stripes in the exemplar image v give rise to long-range dependencies between the wavelet coefficients, which require many iterations to be set up properly.L-BFGS is able to generate them within 25 iterations, while the original algorithm of [34] requires more iterations to create them.
Let us finally note that while the L-BGFS approach brings some improvements with respect to the original HB algorithm, it suffers from the same drawback in term of texture modeling.Figure 3.3 shows two examples of syntheses obtained by the L-BFGS optimization method.This highlight the (well known) fact that methods based on first order statistics of multiscale decompositions are quite good at reproducing highly stochastic micro-textures (in the sense [27]) such as the exemplar on the left of the figure, but fail at generating geometric patterns, such as in the exemplar on the right.Synthesizing geometric features requires either to use higher order statistical constraints on the wavelet coefficients (see for instance [50]) or more complicated non-linear feature extractors, as exemplified in the next section.

Non-linear Sparse Features.
To improve synthesis quality, we follow the path explored in [63] and propose to use non-linear feature extractors obtained through sparse decompositions.Sparse decompositions are able to better reproduce sparse features (such as edges) and also permit the use of dictionary learning methods to capture these features in an exemplar-driven manner, thus bridging the gap between the HB algorithm and copy-based methods (see Sect. 1.1 for an overview of these approaches).
Sparse coding feature extractor.Given a dictionary D = (d i ) N i=1 of N atoms d i , a sparse approximation Z D (h) of a signal h (which may be an image u or small patches extracted from it for instance) is commonly defined using a penalized 1 regularization where the parameter α controls the sparsity of the coefficients.This corresponds to the celebrated Lasso problem [65], also known as basis pursuit [10].Note that the solution of the optimization problem (3.6) is in general not unique, so we define Z D (h) as being one of the minimizers.
It is important to realize that while the reconstruction operator Dz = i z i d i is linear, the sparse coding operator Z D is highly non-linear.More precisely, the nonsmoothness of • 1 creates a large number of zero coefficients in z, and this sparsity effect increases with α.It is precisely this non-linear effects that is important to drive the generation of sparse features (such as isolated dots or edges) in a synthesis output.
Computing the minimizer of (3.6) requires to solve a convex optimization problem.A popular way to approximate the solution of this optimization for large scale imaging problem is to use a so-called first order proximal scheme.We use in this section the Fast Iterative Soft Thresholding Algorithm (FISTA, [5]) which defines a sequence of iterates Z ( ) D (h) converging to Z D (h) as grows toward +∞.Synthesis with sparse features.We define a sparse synthesis algorithm using the L-BFGS iterations (3.2) to minimize (3.1) with a trivial pixel-domain extractor H 0 (u) = u and a (possibly decimated) set of K sparse features extractors.This is formalized as Here, Π and (S k ) k are fixed linear maps.The role of S k (z) = (z i ) i∈I k is simply to select a sub-class of coefficients indexed by I k , associated to atoms (d i ) i∈I k which are usually translations of the same template (e.g. for wavelet-type or translation invariant dictionaries).The role of Π is to act as a pre-processor.In the following, we explore two possibilities: Π being a patch-extractor to perform the sparse coding on small patches, and Π = Id so that one sparse codes the image u itself.
It is important to note that the feature is defined using the output Z ( ) D of FISTA.It is an approximation of the original Z D , the latter being well defined mathematically but numerically unaccessible.
In order to implement iterations (3.2), one needs to compute the gradient ∇E using (2.7), which in turn necessitates to compute the adjoint of the derivative of the features extractors as follow The adjoint of the selector is simply and Π * is detailed in the two specific cases bellow.
It is important to realize that the adjoint of the differential of FISTA's output [∂Z ( ) D ] * is in general very different from the adjoint of the theoretical sparse coding operator [∂Z D ] * .This is well documented in [17], where the authors argue that the derivative of sparse coding operators are unstable and computationally out of reach.In contrast, ∂Z ( ) D and its adjoint are easy to compute using automatic differentiation technics, which is equivalent to formally differentiating the iterates of FISTA.We refer to [62] for more implementation details.
Dictionary learning methods.The resulting synthesis method is quite general because of the degree of freedom in the choice of the linear maps (D, Π, (S k ) k ) and the regularization parameter α.It is beyond the scope of this paper to explore in full generality this method; we illustrate it in two specific settings bellow.
In these two examples, one first estimates D using standard dictionary learning technics (see for instance [21,25,43]) from the exemplar image v alone.One minimizes the sparse coding energy with respect to the dictionary D, i.e.
Here D is a linear set of constraints (specified bellow in each numerical example) that imposes some structural property on the dictionary (such as translation invariance) and make the problem well-posed.The standard way [21,25,43] to find a stationary point of the (highly non-convex) energy (3.7) is to iteratively minimize with respect to D and to z since the energy is convex with respect to each of these parameters alone, even if it is not jointly convex.
Numerical illustration #1: patch-based dictionaries.A popular class of image processing methods (see Sect. 1.1 for a review) makes use of the redundancy between the patches of the image.This corresponds to the best known usage of dictionary-learning approaches that apply sparse coding to these so-called patches.It reduces the computational complexity as well as the number of degrees of freedoms to design the dictionary.This corresponds to defining Π as a patch extractor, where s ∈ S runs over a fixed set of extracting position (usually located on an uniform grid on the image domain) and τ × τ is the size of the patches.Note that the adjoint operator Π * simply corresponds to reconstructing an image by tiling the patches and adding the overlapping parts: In this setting, we impose the dictionary to handle each patch independently and in the same way.That is, we consider Dz = ( Dz s ) s∈S where D = ( dk ) K k=1 is a "reduced" dictionary made of K atoms d k ∈ R τ 2 .The coefficients are thus z = (z s ) s∈S ∈ R |S|×K where z s ∈ R K are the coefficients of the patch Π s (u).The set of coefficients z = (z s ) s∈S is thus subdivided into K groups {1, . . ., N } = I 1 ∪ . . .∪ I K , each I k corresponding to the |S| coefficients associated to a given atom d k .This construction is usually the one considered in the literature [21,25,43].The learning optimization (3.7) is carried directly over the reduced dictionary D, hence making the overall problem tractable.It makes sense to design our statistical constraint model to impose the first order statistics of each of these group.
Numerical illustration #2: translation invariant dictionaries.In this part, we apply the sparse coding operator to the image itself, that is, we set Π = Id P , which is the identity map on the image space R P .It is not possible to train a generic unstructured dictionary D on the whole image u ∈ R P because optimizing and applying the dictionary would be numerically intractable, and also because the learning problem from a single exemplar image v would not be well posed because of the too many degrees of freedom.The most natural constraint is to impose the dictionary to be translation-invariant, i.e. to constrain it to be obtained by translating a set of atoms {ϕ k } K k=1 .In this case, applying the dictionary to a set of weights z = (z 1 , . . ., z K ) is where is the discrete 2-D convolution.Formula (3.8) defines a linear constraint D on the set of allowable dictionary.Note that similar shift-invariant constraints have been proposed in the dictionary learning literature, see for instance [8].The energy (3.7) can be efficiently minimized directly with respect to the filters (ϕ k ) k using fast convolution computations as detailed in [8] for instance.This specific structure segments the set of coefficients in non-overlapping consecutive indexes {1, . . ., N } = I 1 ∪ . . .I K , so that the total number of coefficients is N = KP where P is the number of pixels.
Numerical results.The synthesis results using the two setups presented above (Π being a patch extractor and Π = Id) are shown in Fig. 3.4.In order to improve the synthesis results, following the previous work [64], we have added to the Wasserstein statistical loss defined in (2.6) a frequency matching term.More precisely, we minimize the function where E(u|(ν z k ) k ) is defined in (3.1) and the second term measures the fidelity to the modulus of the Fourier transform v of the exemplar image v. Figure 3.4 shows several exemplar images, the syntheses obtained by minimizing (3.9) with the patch-based dictionary (described in the numerical illustration #1) and with the translation-invariant dictionary (described in the numerical illustration #2).In the patch-based case, the parameters are τ = 12, K = 192 (redundancy around 1.5), α = 0.1 and β = 1.In the translation-invariant case, we use τ = 12 (size of each ϕ k ), K = 5, α = 2, and β = 1/16.These results show that the switch from a fixed linear representation (as used in Section 3.2) to adaptive non-linear representations enables a more faithfully synthesis of geometrical textural patterns.in the introduction, the goal is to control the statistical properties of the restored image.As a case study, we consider the classical TV regularization.We choose to impose the statistics of the gradients of the image, since this controls both textured aspects and contrast [11].In order to estimate the true distributions of gradients from the noisy observation alone, we assume that they follow some parametric model.The Wasserstein loss is then computed between the estimated distributions and the distributions of the denoised image.We will see that such a framework is efficient to prevent some defaults of the total variation.As a surprising by-product, we also observe that the Wasserstein statistical loss can actually replace the total variation term, therefore acting both as a fidelity and a regularization term.
4.1.Wasserstein Loss for Image Restoration.We consider the problem of recovering an approximation u of some unknown image u 0 from degraded noisy measurements y = Φu 0 + η where Φ : R P → R Q is a linear operator modeling the acquisition process and η is some additive noise.We assume that η is a realization of a white noise of mean 0 and variance σ 2 Id Q .In this section, we show applications to image denoising (corresponding to Φ = Id) and give an example of denoising (that is, Φ is a nonuniform down-sampling operator).
Variational problem.Following the general setting of variational estimators for inverse problems, we consider a recovery obtained by a minimization of the form min where E(u|(ν z k ) k ) is a statistical fidelity term (2.7) and R(u) is a regularization term.The parameters λ , λ ≥ 0 controls the trade-off between the statistical fidelity term and the regularization term.For the sake of normalization, we choose some λ ∈ [0, 1] to define λ def.
= λ √ 2P when using TV regularization (4.2).The constraint set accounts for the presence of a Gaussian noise of variance σ 2 , and the parameter α is typically chosen close to 1.In the following, we use the classical total variation (TV) regularization [57] R(u) where (∂ i,1 u, ∂ i,2 u) is the usual forward finite difference approximation of the gradient of u at pixel i.While the TV penalty is not the state of the art in image denoising and restoration (see Sect. 1.1 for a review), it is a good benchmark to exemplify the main features of our approach.
As detailed in Sect.1.1, traditional variational approaches and in particular TV regularization have difficulties to restore fine-scale textural details.To account for such high frequency content, it thus makes sense to study and to impose the statistical distribution of local derivatives of the image.To define the statistical fidelity E(u|(ν z k ) k ), we thus use gradient-domain linear features extractors H k that corresponds to K directional derivatives In the numerical simulations, we use K = 4 directional derivatives.
A delicate issue to be able to apply the variational method (4.1) is to define the distribution (ν z k ) K k=1 from the degraded observation y alone.We detail how we achieve this in the two considered scenarios (denoising and super-resolution) in Sect.4.2.
Minimization algorithm.The optimization (4.1) is both non-convex (because of the Wasserstein loss E(•|(ν z k ) k )) and non-smooth (because of the 1 norm in the definition of the TV regularization (4.2)).We apply the first order proximal algorithm of [14].While this algorithm is originally designed to handle convex functional, it has been applied with success to non-convex functionals [15] although there is no proof of convergence in this case.In our setting, we also reached the conclusion that this algorithm was able to converge even in the presence of a smooth non-convex term in the energy to be minimized.
The algorithm of [14] minimizes a function of the form F (u) + G(u) + H(L(u)) where F is a smooth function, L is a linear operator, and (G, H) are possibly nonsmooth functions for which one is able to compute the so-called proximal operator.We refer to [14] for the detailed exposition of the iterations of the algorithm.
In our case, we use the splitting is the discretized gradient operator, and H(p 1 , p 2 ) = i p 2  1,i + p 2 2,i is the 1 norm of 2-D vector fields.The proximal operators of G and H are respectively the orthogonal projection on C α and the soft thresholding operator (see [14] for more details).The gradient of F is computed using (2.7), and making use of the fact that where the adjoints (∂ * 1 , ∂ * 2 ) of the forward derivatives are minus the backward derivatives.

Estimation of the Prior Distributions.
In contrast to the texture synthesis problem studied in Sect.3, the estimation of the target distributions (ν z k ) k is a delicate problem for restoration applications since one does not have direct access to a clean image.Ideally, each ν z k should be a good approximation of the (unknown) distribution ν H k (u0) associated with the image to recover.The estimation of ν z k from the noisy observations y is in itself a difficult statistical problem: we do not aim at covering it in full generality.
To make this estimation well-posed, a common practice is to assume some parametric form for the distributions.To account for the sparsity of the derivatives of natural images, a popular parametric model [44] is the Generalized Gaussian Distribution (GGD).This model has been successfully used in particular for image restoration [59] or texture modeling [66].We thus impose the discrete distribution ν z k to be the best match of a continuous GGD density g p k ,s k of parameter (p k , s k ).The GGD density is where C p,s is a normalizing constant, s controls the variance of the distribution, and p is the shape parameter (controlling the sparsity of the distribution, e.g.p = 2 corresponds to a Gaussian distribution).
For each k, the estimation of ν z k is thus achieved in two steps: (i) estimation of the parameters (p k , s k ) from the input y ; (ii) estimation of the quantized positions z k ∈ R N from the estimated (p k , s k ).
Parameters estimation.We first treat the case where Φ = Id so that y = u 0 + η.One thus has H k (y) = H k (u 0 ) + η k where η k def.
= H k (η) is the realization of a (colored) Gaussian noise for which each entry has a variance 2σ 2 (by linearity of the differentiation operator).
The known distribution of the coefficients of H k (y) is thus equal to a convolution between the unknown distributions of H k (u 0 ) and a Gaussian distribution of variance 2σ 2 .The estimation problem is thus a well known deconvolution problem, which has been studied extensively in the case of a 1-D parametric GGD distribution.We use the method described in [62] which performs a matching between the cumulants of the GGD + Gaussian noise and the empirical cumulants computed from H k (y).The cumulants of order 2 and 4 (the two first non-zero cumulants) yield two equations whose unknown are (p k , s k ), hence providing an estimation of these parameters.
In the case of inpainting, Φ is a diagonal masking operator, that is, some pixels are unknown.The parameters (p k , s k ) of the GGD modeling H k (y) are obtained in the same way as before; the only difference is that the empirical cumulants of H k (y) are computed using only the subset of known values of the gradient H k (y).
Thanks to the homogeneity property of the cumulants, the deblurring case could also be handled in a similar way.However, this approach rapidly becomes unstable when the convolution kernel is getting bigger.This is because the gradients are highly damaged as the image is blurred, making their distributions difficult to estimate in the presence of noise.
Quantized positions estimation.The quantized positions z k should be chosen to minimize the discrepancy between the continuous distributions G p k ,s k and the discrete distribution ν z k .For 1-D distributions, the solution to this optimal quantization problem is well-known and corresponds to the posterior expectation on each quantile of the distributions: where t i def.
= G −1 p k ,s k (i/N ) are the quantiles and G p k ,s k the cumulative distribution of the GGD.Then as proved in [62].Hence the Wasserstein distance to the GGD is equal to the distance to the discretization z k , up to an additive constant which has no impact on the minimization problem.

Numerical Results
. This section presents several restoration results obtained using our approach in the case of denoising.We then compare such results to other approaches and in particular to NL-Bayes [37] and to the gradient histogram matching [11].We also study the ability of our approach to preserve the gradient and the color distributions, in contrast with other methods.More results are available in [62].
To be efficient on complex scenes, our approach would require a former segmentation step such as the one presented in [11]: this way, we would be able to estimate a different GGD model for each region of the image, and thus have a texture term which is adaptive with respect to each region of the image.
We first present the result of our algorithm on several uniform textures.In the following experiments, the fidelity constant of (4.1) is set to α = 0.7: this value yields a visually good trade-off between denoising and over-smoothing in the standard TV case.The optimization algorithm is stopped after 200 iterations, which we found sufficient for the considered problem.When not specified, the size of the images is P = 128 × 128 pixels and their values are in [0, 1].
In the set of experiments given in Fig. 4.1, we present from left to right: the original noise-free image; the observed image, contaminated by a white Gaussian noise of standard deviation 15% of the image amplitude, i.e. σ = 0.15 for images whose dynamic is [0, 1]; the result of the TV denoising algorithm [57], which is a special case of our algorithm with λ = 1; the result of our algorithm with only the texture fidelity term (without TV), which corresponds to the case λ = 0; the result of our algorithm with a trade-off 0 < λ < 1 which we called "hybrid" method, and in particular with λ = 1‰ = 1/1000 which is experimentally well suited on our set of images.
The TV approach (3rd column) removes most of the textures, resulting in smooth images, sometimes with a cartoon aspect.On the contrary, the approach corresponding to λ = 0 (4th column) does not destroy any textural content at the cost of some residual noise.It is interesting to note that in this experiment the texture term (Wasserstein statistical loss) is used without additional regularization term.The hybrid approach (5th column) make use of a TV regularization term to eliminate more noise.However, the pure Wasserstein loss (λ = 0) approach and the hybrid one produce very similar results on several images.
A few examples of inpainting are shown on Fig. 4.2: in addition to a white Gaussian noise of standard deviation σ = 0.1, 40% of the pixels has been lost (i.e. the degradation operator Φ is a non-uniform down-sampling).We do not focus more on this case here, instead we refer to the former results presented in [62].
Influence of the trade-off parameter.Figure 4.3 further illustrates the effect of the parameter λ on the denoised image.The parameter varies from λ = 1 (TV algorithm) to λ = 0 (Wasserstein term only, no TV term).The visually optimal result is usually obtained as a trade-off between these two terms, since the TV term over-smooths the image and destroys textures, while the GGD term only does not always provide a strong enough regularization.As stressed in the previous experiment, the TV is not always necessary and the Wasserstein term alone may perform well, especially in the case of textures with fine details.
Comparison with state of the art approaches.We now compare our results to other denoising approaches: first the state-of-the-art denoising algorithm NL-Bayes [37] and the inverse problem algorithm PLE [72]; then we display some comparisons with the results from [11] which introduced the idea of preserving the gradient distribution to restore images.
Patch-based approaches.Figure 4.4 compares our results to both NL-Bayes [37] and PLE [72].Online demos [36,68] are available for these two algorithms.They both produce a final image that is highly denoised.Our approach does not remove as much noise, but it preserves more details and produces a sharper image.
A reason for this behavior is that NL-Bayes and PLE use the self-similarity of the image (i.e. its intrinsic redundancy): they assume that the patches of the images provide a redundant information.They perform denoising by enforcing the intrapatch similarities while removing the singular elements of the patches.Although these methods produce state-of-the-art results, some details are lost as a result of an increase of the self-similarity of the image.Our texture fidelity term enforces the gradient distribution to follow the estimated law: it prevents the image to be smoother than expected by preserving the amount of granularity of the image.
Comparison with the approach of [11].We compare on Fig. 4.5 our results to the ones from [11].The authors of the latter paper propose to enforce the gradient distribution using an approach that is different from ours, based on a Kullback-Leibler divergence or alternatively on an accumulation of deviation terms (see [11] for more details).They argue that these terms succeed in imposing the distribution, however the choice of these terms and their effect are purely empirical.In comparison, the Wasserstein distance is a robust mathematical tool permitting us to set the gradient distribution without requiring additional cost functions.Since no code is available for the approach from [11], the images shown on Fig. 4.5 are directly extracted from the PDF version of [11], hence they may suffer from JPG compression artifacts.Estimator [72], denoising using the NL-Bayes algorithm [37], denoising using our approach (with λ = 1/1000).The average PSNR are 23.5 dB for PLE, 23.8 dB for NL-Bayes, and 22.7 for us.Although the PSNR is lower, our approach is better at preserving the texture and details. whose gradient histograms have too thin tails.Our hybrid algorithm respects the heavy tails of the distribution to preserve details and textures as much as possible, while producing a small peak around zero as the result of the TV regularization.

Conclusion.
In this paper, we have introduced a generic framework to impose statistical constraints in variational problems routinely encountered in imaging sciences.We show in particular applications to image synthesis, denoising and superresolution.We believe that a key aspect and contribution of our work is the use of a Wasserstein loss between discrete probability distributions, which is both simple to use and robust.It is of course possible to consider more advanced image processing problems, and we refer for instance to [54] where the same framework is used to perform texture mixing by minimizing a sum of Wasserstein losses learned from several input exemplars.
The proposed approach is generic and could be associated with other regularization terms for image restoration, for instance the non-local variational approaches from [23,30].It can also act as a post-processing step, possibly in combination with approaches such as NL-Bayes [37].Last, as mentioned above, the approach can be generalized to other restoration tasks than denoising.While the extension to interpolation or super-resolution problems is relatively easy to develop, other tasks such as image deblurring may face some difficult estimation problems.

(3. 3 )Fig. 3 . 1 :
Fig. 3.1: Display of log(E(u ( ) |(ν z k ) k )) as a function of for the three tested algorithms: (i) HB algorithm [34] (in blue, with a display of the half-iterations ũ( ) in black); (ii) gradient descent algorithm (in green); L-BFGS algorithm (in red).Each algorithm is represented with two curves, showing the minimum and maximum energy values obtained among 100 realization of the initialization u (0) .

4 .
Gradient Statistics for Image Restoration.This section applies the Wasserstein loss defined in Sect. 2 to the problem of image restoration.As explained

Figure 4 15 Fig. 4 . 4 :
Fig.4.4:Comparison to state-of-the-art denoising approaches on color images.From left to right: original image, noisy observation, denoising using the Piecewise Linear Estimator[72], denoising using the NL-Bayes algorithm[37], denoising using our approach (with λ = 1/1000).The average PSNR are 23.5 dB for PLE, 23.8 dB for NL-Bayes, and 22.7 for us.Although the PSNR is lower, our approach is better at preserving the texture and details.