Approximating the total variation with finite differences or finite elements

We present and compare various types of discretizations which have been proposed to approximate the total variation (mostly, of a grey-level image in two dimensions). We discuss the properties of finite differences and finite elements based approach and compare their merits, in particular in terms of error estimates and quality of the reconstruction.


The total variation
In mathematical and computational imaging, images are represented by (possibly multi-channel) intensity functions and typical inverse problems aim at recovering such functions from measurements. Such problems are usually ill-posed and require some a priori knowledge of the class of signals one wishes to recover, this is usually expressed by means of a "regularizer", which measures how likely an image could be. The total variation has been proposed as a regularizer for imaging in the 90' [68]. It was introduced as a substitute to more common quadratic penalizations of the gradient, which prevent large jumps in the solution, and therefore are unsuited for recovering intensity functions representing images: indeed, these are discontinuous as they jump across the edges and boundaries of the objects. This regularizer is not so widely used nowadays as its performances are below many state-of-the art reconstruction methods (in particular, based on deep learning or other non-local techniques), yet, being convex and relatively simple, it is still useful and popular for large scale inverse problems (typically, medical imaging applications) or in low noise regimes. We refer to [25,23] for a quick introduction to total variation for imaging and to [32] for more numerical details and algorithms. In fact, these notes are not only focused on imaging applications, as one might need to minimize the total variation for other goals, such as computing minimal surfaces, equilibrium configurations of mixtures with surface tension, etc. Our purpose is rather to discuss, from a numerical analysis point of view, the discretization of the total variation functional. For simplicity, we will mostly stick to the two-dimensional case, yet most of the results we state are not limited to this case. We will particularly focus on the quality (in terms of error bounds, when available) and the isotropic or anisotropic behaviour of various possible schemes. Error estimates are not so standard nor easy to compute for problems involving the total variation, as usual proofs rely on regularity properties of the continous solutions, while the minimizers of the problems we will be interested in are not regular and not even, in general, continuous. We will try to mention as exhaustively as possible what is known in this context.

Definition, basic properties
We consider throughout this note Ω ⊂ R a bounded open set. In most of the practical applications, Ω will be a planar rectangle. Classically, the total variation is defined for a function ∈ 1 (Ω) by duality [2,44,46] as follows: If is a smooth function (or more generally, in the class 1,1 (Ω)), one sees that after integration by part this is nothing else as ∫ Ω |∇ | ; however it is also finite if = is the characteristic of a smooth set. Indeed, using Green's formula the integral is ∫ · H −1 where here, is an inner normal to and H −1 the ( − 1)-dimensional surface ("Hausdorff") measure, see for instance [2,44] for a definition. Then the supremum (which is reached when ≈ ) yields ( ; Ω) = H −1 ( ∩ Ω) (the perimeter of in Ω), and indeed (1) is used, for = , as a defining a generalization of the perimeter for a measurable set [46]. Such a set is then called a "set with finite perimeter" in Ω (or "Caccioppoli set") as soon as ( ; Ω) < +∞. In general, it is clear (by Riesz' theorem) that ( ; Ω) < +∞ if and only if , the distributional derivative of , is a bounded Radon measure; one then For ∈ (Ω), it is shown (see for instance [2,44,74,45]) that the measure decomposes as where: ∇ is the absolutely continuous part of with respect to thedimensional measure , is the "jump set" (a ( − 1)-dimensional subset), on which H −1 -a.e. one can define a normal vector ( ), an upper value + ( ) and a lower value − ( ) (the simplest way is by blow up, defining as the set of points where the functions ↦ → ( + ), > 0, converge in 1 ( 1 ) as → 0 to ↦ → + ( ) { · ( ) >0} + − ( ) { · ( ) <0} ). The remaining part, , or Cantor part, can be seen as made of infinitely many infinitesimal jumps. One has then obviously One sees from the remarks above that this functional (which thanks to (1) is trivially convex, lower-semicontinuous, with respect to the distributional convergence for ) is finite for functions which are possibly discontinuous, which is the main reason for its success in imaging. Indeed, if represents the intensity values of an image (in 2D or 3D), using the total variation as a prior or regularizer for inverse problems will possibly promote images with discontinuities, as expected across the edges or boundaries of objects (we refer to [25,24,23] for a mathematical study of the properties of the total variation as a regularizer).
From the numerical point of view, this raises important difficulties. How easy is it to precisely discretize a discontinuity and faithfully estimate its total variation? This is the main issue we will address in this small note, where we will review a few options proposed in the literature and discuss their merits and drawbacks.

Variational problems involving the total variation
To simplify, in order to assess the quality of a discretization we will concentrate on two simple reconstruction problems: the so-called "Rudin-Osher-Fatemi" or "ROF" denoising problem and the "inpainting" problem. The first was proposed in [68] as an approach for image denoising (from the application's point of view, it is however a bit too elementary to be useful beyond the case of very low noise): assuming an observation ∈ ∞ (Ω) is the sum of a "clean" image ( ) and an oscillatory signal ( ) (in the discrete setting, a random Gaussian noise with known variance 2 ), the idea is to find as the function with minimal total variation ( ; Ω) among functions with ∫ Ω ( − ) 2 = 2 . In practice, it is shown that the problem can be written (for an appropriate, and not obvious choice of parameter > 0, see for instance [29]) as follows: Even if this problem is elementary, one good reason for focusing on it is that its solution corresponds to evaluating the "proximity operator" of the total variation at , and can be used as a basic brick in many other minimization algorithms involving this functional. Additionally, since we will focus mostly on the discretization of the first term in this problem, our study will apply with little or no change to many other second terms, such as appearing in more complex inverse problems. A typical denoising result obtained by (4) is shown in Fig. 1. We will make use also of the convex dual problem to (4), which is easily 5 5 5 shown to be: where here the condition · = 0 on the boundary is understood in a weak sense, namely div ( Ω ) ∈ 2 (R ). The solution to (4) and a solution to (5) are linked through the Euler-Lagrange equation: The last equation in (6) is in the sense of Anzelotti [3]: in case is differentiable it implies that ( ) = ∇ ( )/|∇ ( )| where ∇ ( ) ≠ 0. See [27] for more precise results in dimension 2 and 3.

FIGURE 2
Original image, same image with 65% missing lines, total variation-based inpainting, elastica-based inpainting [33] Variational "inpainting" is a bit more tricky to introduce. Formally, the problem consists in recovering missing data in an image by minimizing some functional, with a constraint that the image is known outside of some mask. Using the total variation for solving this problem is a terrible idea, as one can easily see that the best choice will consist in minimizing the length of each level line inside the reconstructed region, which is not very wise: see Fig. 2 for an 2. thanks to the celebrated "co-area" formula which relates the total variation of a function and the perimeter of its level sets: example, where we compare TV-inpainting with a variant penalizing also the curvature of the level sets, as proposed in [63,57,17,33]. A simplified view of the -inpainting problem consists in minimizing the total variation with a Dirichlet boundary condition, formally: In fact, it is well known that this need to be relaxed (as a minimizing sequence ( ) with = for all on Ω could converge to a limit with a different trace), and should rather be written (assuming here that Ω is for instance a Lipschitz domain -in practice we will consider rectangles or disks -, as things would be more complicated in the presence of interior boundaries): Here, ( ) in the second integral is the trace of on the boundary (see for instance [46,Chap. 2]). The energy in (7) coincides with the total variation if the trace of is equal to on Ω, while if not, it counts in addition, as part of the energy and according to the penalization of the jump part in (3), the jump from the trace ( ) to the value ( ) on the boundary. The dual formulation is a bit different from the definition in (1) as it is natural, in that case, to consider dual fields which do not vanish near the boundary Ω: formally, one sees that where Ω is the outer normal to Ω) and one can show that the energy in (7) can be defined as In order to assert the quality of a discretization we will particularly be interested in (7) whenever Ω ⊂ R 2 is a square and = { · ì ≥ } for some direction ì ∈ S −1 and ∈ R. In this case, it is easily seen that the (unique) solution is = (that is, a straight line is the shortest path between two points...) The current note aims at reviewing a few of the discretization schemes which have been proposed in the literature to numerically solve problems such as (4) and (7). We will consider scalar (grey-level valued) 2 dimensional images. Most of what we will discuss can be extended without pain to vectorial (multispectral or color) images although then, the metric which is used to measure distances in the data space starts to be highly relevant, hence also the norm used for measuring the gradient tensor in (4), see for instance [47,42]. Also, most of what we will describe is trivially extended to three or higher dimensional problems. What we will not address here, or minimally, on the other hand, are • the algorithmic issues, except to discuss the relative complexity of various discretizations from the optimization point of view (see for instance [32] for an overview); • graph based approaches, which in general can be very efficiently optimized by combinatorial optimization techniques [52,26], yet approximate, in the continuous limit, "crystalline" total variations (such as, for instance, . From the numerical analysis point of view, theses approaches yield much better error estimates (see in particular [11] for a study), however the behaviour of the limiting functional can be very different from what is expected with the isotropic total variation (1) (in particular, the set of solutions for a crystalline version of (7) can be very large, contrarily to the isotropic case where generic uniqueness holds). The use of such approaches for approximating isotropic perimeters has been studied by many authors, starting from [14,15] and comparison with "continuous" approaches are found for instance in [53,56].

Standard discretization
In what follows, we consider to simplify a rectangular or square domain Ω and address the issue of discretizing problems such as (4). In most of the inverse problems and imaging literature, the isotropic total variation (1) is simply discretized as Here ℎ > 0 represents the size of the pixels, and = ( , ) =1,..., ; =1,... is a matrix representing the values of a (rectangular) image. We denote the square "pixel" of size ℎ 2 centered in ( ℎ − ℎ/2, ℎ − ℎ/2), for , integers indices. The standard convention in (9) on the boundary is to replace with zero the differences which involve indices which are out of range (e.g., +1, − , = 0 for all , etc), however for addressing the Dirichlet case (8) one should rather replace the values of out of range with the discretization of the boundary condition .
The convergence of ℎ to (1) is in fact quite weaker and needs to be understood in a variational sense (that is, for minimization problems), and more precisely in the sense of "Γ-convergence": Proposition 2.1. Let ℎ = 1/ = 1/ , Ω = (0, 1) 2 , and consider, for ∈ 1 (Ω), the functional given by (9) and +∞ else. Then, as ℎ → 0, ℎ Γ-converges to: The proof of this result is easy and relies in part on a duality argument, as developed in [36,Appendix] and [19] for variants. We refer to [38,16] for the definitions and main properties of Γ-convergence. It means in practice that minimizers of discrete energies such as where ℎ , is any consistent discretization of the function ∈ ∞ (Ω), such as will converge to the minimizer of (4) as ℎ → 0. This numerical analysis result is a bit weak as it does not really specify how fast the discretization converges to the continuous limit. It turns out that one cannot expect a very fast convergence, as typical solutions to (4) have discontinuities [22]. In fact, if ∈ (Ω) and one considers its piecewise constant projection where depends on the dimension. If in addition is bounded (as the solution to (4) is, since one easily shows that For with discontinuities, one cannot hope for a better result, indeed if for instance = { · ì ≥ } , a simple computation shows that − ℎ 2 ≈ √ ℎ, unless ì = (1, 0) or (0, 1) in which case one may have the error reduced to 0 (if the discontinuity is precisely between two pixels), or of lower order.
Error bounds for (12) are significantly larger than this theoretical limit. The most precise study is found in [71,54,55]. The main result of Wang and Lucier, [71,Thm 4.2], shows that if ℎ is the minimizer of (12) and the minimizer of (4), then (here ℎ is identified with the piecewise constant function with value ℎ , on the pixel ℎ where > 0, and is a measure of smoothness of the function . In particular, if ∈ (Ω) ∩ ∞ (Ω), then ∈ Lip( 1 2 , 2 (Ω)) and the rate of convergence (14) is (ℎ 1/6 ). The proof in [71] relies on a refined implementation of a relatively standard method for the numerical analysis of nonsmooth problems: the solution is smoothed at a certain scale and then discretized; thanks to the smoothness, precise error rates can be derived for the discretization of the energy. Then, one selects the optimal scale for the smoothing, which depends on ℎ and , to obtain the bound (14). The same proof extends to many similar discretizations of the total variation, in particular, in [71], the result also applies to symmetrized variants of (9), as well as to an "upwind" variant introduced in [28], see Section 2.2 for details.

FIGURE 4
Original : the characteristic of a disk. Solution of (12). Details showing the anisotropic behaviour. The direction in which (9) approximates best the total variation of the discontinuity is better resolved (right).
On the other hand, it is known that (9) suffers important anisotropy issues and is not very precise. This is illustrated in Fig. 3 where a discrete approximation of the inpainting problem (7) is solved with this approximation in several directions: in some of them, the output is very blurry. This is due to the fact that a sharper result would have a very high discrete energy (and related to the fact that the convergence of ℎ to is not pointwise). Also Problem (4) is affected by this issue. In Fig. 4, is simply the characteristic of a disk, and so is with a slight loss of contrast (cf for instance [58]). However, one sees that the boundary of the disc is sharper in some directions than other. The anisotropy is even more striking if is the characteristic of a square, see Fig. 5, and leads to quite bad results.
A precise study of this issue is presented in [20]. The following experiment is studied: one considers a domain Ω which is a periodic strip, either of the form The idea is that in both cases, by translational invariance, both problems (4) and (12) boil down to a 1D total variation regularization problem which is easy to analyse. It is then shown that while in the second case, the recovery is perfect (the unique solution ℎ is precisely ℎ ), in the first, the difference of between the discrete and continuous energies is of order ℎ 2/3 , and while it is yet not proven that ℎ − 2 ≈ ℎ 1/3 as ℎ → 0, an almost optimal signal which achieves this rate is built (see [20,Sec. 3]).

FIGURE 5
Original : the characteristic of a square. Here the anisotropic character of (9) is striking (right, equalized version of the solution of (12) shown in the middle).
While the anisotropy issue is easily corrected, as we will see in the next Section 2.2, it is yet not clear that this error can be improved by considering simple variants of this discrete energy. Actually, the best reason for using the discretization (9) is that it is simple and easy to implement. In particular, Problem (4) can be easily solved by many standard proximal algorithms for nonsmooth convex optimization, including with acceleration, such as the accelerated primal-dual method proposed by Zhu and Chan [73] (see also [43]) and its variant (with convergence guarantees and rates) [30, Alg. 2], or an accelerated proximal gradient descent on the dual and in particular the "FISTA" algorithm [13], see also [60,62] or the variant in [61]. A slight smoothing of the total variation (replacing ℎ with a "Huber" total variation, quadratic when the differences are nearly zero) makes the problem even more regular and the above algorithms converging with a very efficient linear rate (cf [60, (2.2.19) and Thm 2.2.3]), see [32] for a general description.

Improving the standard discretization
It is not hard to reduce the anisotropic character of discretization (9) by relatively simple techniques, usually at the expense of (slightly) increasing the complexity of the discretization. We mention in this section a few straightforward techniques. All produce energies which are invariant by a 90 • rotation, however without clearly improving the quality of the approximation in terms of precision (all variants in this section could be analysed by the methods in [71] and enjoy similar approximation properties).
The first variant, introduced and studied in [71,54], consists simply in averaging the four possible rotations of (9), by defining ℎ ( ) as: As before, the differences involving indices out of range are set to zero. Notes that this cannot solve the issue of the precision and smoothing of discontinuities, since in this case, all slanted sharp discontinuities are overestimated by the energy. This is visible in both Figures 6 and 7 (left).
The "upwind" variant in [28] is defined by Here, the notation + stands for the positive part max{ , 0}. It is experimentally demonstrated in [28] that this greatly improves the anisotropy issues raised by (9). The error analysis in [71] is also carried on for this variant, with the same result.
In addition, one can see that that convergence to ∫ | | does hold pointwise for the characteristic of a half-plane bounded by a horizontal, vertical or ±45 • diagonal line, so that one could expect a better behaviour in terms of precision (no smoothing is required to approximate such a discontinuity, and probably less in many directions than with (9)) and the analysis in [71] might be suboptimal for this approach, yet a more precise analysis is not available. A strange feature of (16), though, is that in general, ℎ (− ) ≠ ℎ ( ). In Figure 6 we see that using this latter version effectively improves the inpainting result of Fig. 3 in the directions where (9) fails, while using the previous variant (15) gives a very smooth reconstruction in all directions. We mention a last discretization also enjoying this isotropy property, introduced in [31,36] and defined as follows: As before, it does not converge pointwise for oblique discontinuity and it is unclear that one could hope for better error estimates than obtained in [71]. The only result proved for sure is again its Γ-convergence to the total variation (1), see [36,Appendix]. This approach is introduced for optimization purposes, indeed, dualizing (12) with ℎ 4 yields independent problems on "even" and "odd" squares (indexed by indices , both even or both odd), which can be solved in parallel, and an accelerated alternating descent method yields extremely fast results on parallel hardware.   Figure 4. In particular, the last two approaches yield sharper results and are to be preferred.

The "Shannon" total variation [1]
While the (more) isotropic variants of (9) rely on heuristic arguments to make the approximation more isotropic, a few authors have tried to address this problem using a somewhat deeper analysis. An interesting attempt is found in [1] (see also [59]), introduced as a "Shannon" total variation. The idea is to view an image as a low frequency representation of a continuous signal and try to estimate the total variation of the latter. This needs to be approximated: a possibility is to over-sample a × image ( , ) as a × image, by "zero-padding" the discrete Fourier coefficients (that is, considering the Fourier transformˆ , , − /2 ≤ ≤ /2, − /2 ≤ ≤ /2 as the low frequencies of the larger signal, the other frequencies being set to 0). A striking property of such a definition is that images with low total variation can be interpolated via zero-padding without creating further artefacts (this is clearly the main reason to do so, see [1] for many examples). On the other hand (and, in some sense, as a consequence), sharp discontinuities cannot be represented by sharply discontinuous discrete functions (whose interpolation will necessarily be altered by an important "ringing" phenomenon).
Then, one can compute the (true) gradient of by simple derivation under the sign in (18), and given a ≥ 2 an oversampling factor, one lets for ( , ) ∈ × ∇ , = ∇ , and: The paper [1] provide formulas for efficiently computing ∇ and its adjoint −div (see [1,Prop. 8]), by means of two FFTs, and these can be used to implement standard proximal algorithms for solving problems such as the discretizations of (4) or (7).
In Fig. 13e we show a result of denoising obtained with this approach (for = 3). While the result is nice and does not seem to favor any direction, it is a bit less sharp than with other discretizations of the total variation (cf. Figs. 13-14). We refer to [1] for more applications (in particular to zooming, which really is, by construction, where this approach is superior to many others). On the other hand, this clearly is not adapted to inpainting very sharp edges, as such objects can precisely not be "Shannon"-interpolated without a strong ringing effect, and experiments will always produce smooth edges, in all directions.

FINITE ELEMENTS BASED APPROXIMATIONS
In order to introduce discretization with better approximation properties and smaller error bounds, it is tempting to rely on the machinery of finite elements. We now discuss this point of view and show that it can be actually fruitful, provided one does not stick to the most straigthforward approximation spaces.

P1 discretizations
The most natural approaches rely on conforming "P1" representation: in this case, the domain is discretized with a simplicial mesh (triangles in 2D) and the functions are assumed to be piecewise affine, continuous, and defined by the values at the vertices of the triangulation. The natural basis for this finite element space is of course made of the functions which are 1 on one node and zero on all other nodes of the mesh. To simplify, we assume the dimension is 2 (although this does not really matter in this section). Assuming T ℎ = { 1 , . . . , } is a triangulation of Ω (which for simplicity we assume is a polygon, typically a square or rectangle), with triangles with maximal edge of size ≤ ℎ, we denote P1(T ℎ ) the space of such P1 functions. This approximation is "conforming" and the the total variation at scale ℎ is simply given by the total variation itself: where ∇ denotes the constant gradient of on the triangle . This representation enjoys slightly better error estimates than the finitedifferences based approximations, as shown in particular in [10] (see also [4]). We consider as before the minimization of the energy: for ℎ ∈ P1(T ℎ ), with ℎ given by (19). Then, we have the following result:  (20) and as before, the minimizer of (4), and assume Ω is star-shaped. Then one has: Here the assumption that Ω is star-shaped is used to show that an appropriate smoothing of at scale > 0 will satisfy | |(Ω) ≤ | |(Ω) + . The statement follows comparing the energy of ℎ with the energy of the projection of on P1(T ℎ ), for the choice = ℎ 1/2 . It is also observed that if ℎ were replaced with a "TV-diminishing" approximation of the total variation (ie., with ℎ ( ℎ ) ≤ ( ; Ω) for ℎ the projection of onto P1(T ℎ )), then the estimate could be improved to ℎ 1/2 : such an approximation is however not known in this context, see [11] for a diminishing interpolation for the anisotropic ℓ 1 -based total variation.
Numerical methods for solving the discrete problems have been studied and compared in particular in [8,9,69] and strategies for mesh refinement based on a posteriori error estimates are found in [5,9].
One must observe that P1 discretization suffers more or less the same drawbacks as the finite differences based approaches mentioned before. The analysis is very simple: consider a finite-perimeter set ⊂ Ω ⊂ R 2 (that is a measurable set with | |(Ω) < ∞, but to simplify we even consider a smooth set). 5. That is, we assume Ω ⊂ =1 , the closed triangles are non overlapping and share only common edges and/or common vertices.
Consider also triangulations T ℎ with maximal diameter ℎ > 0, and the basic projection ℎ of = onto P1(T ℎ ) consisting in letting ℎ ( ) = ( ) for any vertex, assuming that no vertex is on the boundary (which, up to an infinitesimal translation, can always be assumed). Then, given a triangle ∈ T ℎ crossing the boundary, one will have two vertices with value 0 and one with value 1, or the opposite. It follows that the gradient ℎ ( ) will be a vector orthogonal to the edge of with equal values, and magnitude 1/ℎ the corresponding height. In particular, | || ℎ |( ) = | |/ℎ will be precisely the length of the segment parallel to joining the middle of the two other edges, and in the end, the total variation ∫ Ω | ℎ | will be precisely the perimeter of the polygonal set bounded by these segments, see In general (and in the limit ℎ → 0), the polygonal line ˆ is much longer than , so that one cannot expect a pointwise convergence of ℎ to . This means again that, as with finite differences, a characteristic function (and in general any discontinuity) needs to be smoothed in order to have its energy well approximated at the discrete level. This is visible in the example in Fig. 9 which shows the solution ℎ of (20) with the characteristic of the unit disk, computed using FreeFem++ [48].
In the next sections we describe recent approaches for tackling this issue. The first series is based on non-conforming finite elements representations of  the functions. The second is related, but can be seen as a careful discretization of the dual problem.

Definition. Denoising with non-conforming FEM.
In a recent attempt to improve over the standard P1 discretization, it has been suggested in [34] to consider rather nonconforming P1 discretizations of functions, following similar attempts for various nonlinear problems [49,72,41,65,64,21]. A survey of such elements is found in [18]. The nonconforming-P1 or Crouzeix-Raviart (CR) functions over a mesh T ℎ of a domain Ω are functions ℎ with are affine inside each ∈ T ℎ , but required to be continuous only at the center of each facet of . We will denote CR(T ℎ ) the (finite dimensional) vector space of such functions.
The natural projection ℎ of a function ∈ 1,1 (Ω) onto CR functions is obtained by averaging on each facet of the mesh, and assigning the value to the center of the corresponding facet. It is easily shown, by Green's formula, that it produces a piecewise affine function (not continuous) which is such that on each ∈ T ℎ , denoting ∇ ℎ the (constant) gradient of ℎ on , one has In other words, projecting onto CR functions is equivalent to considering a "P0" (piecewise constant) approximation of ∇ . The situation is almost the same for a function . One just needs to be a bit cautious, as in case the jump of has positive measure on a facet, then "averaging" on this facet does not really make sense (for instance one should average the trace of from either one side or the other, which might differ).
The simplest way to deal with this (technical) issue is to define more precisely the elements ∈ T ℎ as possibly semi closed / semi open simplices (and not just, as usual, open simplices or closed simplices). Precisely, to each pair of elements , with H −1 ( ∩ ) > 0 ( being as before the dimension of the space), we associate (arbitrarily) to one and only one of the elements or , say , the common ( − 1)-dimensional facet = ∩ , that is we assume that ⊂ and ∩ = ∅.
Then, the average of on the facet is understood as the average of the trace of | restricted to . In this sense, if H −1 ( ∩ ) > 0, so that in particular | |( ) > 0 , the value ( ) measures (thanks to the decomposition (2)) the possible discrepancy between the average of the trace of | \ and of the trace of | , one obtains as before the formula using that ( ) = ( \ ) + ( ). Now, the interesting point about this construction, is that it obviously follows (from Jensen's inequality), that given Φ : R → R ∪ {+∞} any convex potential, and in particular, this holds for the standard total variation which corresponds to Φ = | · |. It makes interesting to define, for such Φ, the discrete functional over all CR functions on the mesh T ℎ . Indeed, as we just observed, one has ℎ Φ ( ℎ ) ≤ ∫ Ω Φ( ) if ℎ is the projection of onto CR functions. This is interesting because of the following result: assuming Φ in addition is coercive: The proof of this result is identical to the proof of [34,Prop. 3.1]. A corollary (or part of the proof) is the following pointwise convergence: (Ω), denote ℎ its projection onto CR functions over the mesh T ℎ , defined as above. Then lim ℎ→0 This is interesting, as even if it does not guarantee any precise error estimate, it makes more likely that the solution of a discretized problem is close, or equal to, the solution of the corresponding continuous problem.
We apply this construction to Φ = | · | (so that Φ = (·; Ω)), defining in this way a Crouzeix-Raviart total variation ℎ = ℎ Φ . A remark is that for ∈ (Ω), the CR projection ℎ also satisfies ℎ ∈ (Ω), so one could consider CR functions as "conforming" in the setting. We claim however that the approximation ℎ is still "non-conforming" in the sense that it does not consider the jumps of ℎ , carried by the facets of the mesh, as part of its derivative, as done for instance in [50,67] where the total variation is approximated by the full functional ∫ Ω | ℎ | evaluated on the approximate CR functions. In the latter case, one has to expect a mesh dependency phenomenon similar to what is illustrated in Fig. 8 for the approximation of the total variation and should therefore also expect a smoothing of the discontinuity as observed in Fig. 9.  FIGURE 10 "ROF" denoising of the characteristic of the unit disk with Crouzeix-Raviart finite elements (right: detail). While in Fig. 9 the computed energy estimates a perimeter of 6.37, in this experiments perimeter ≈ 6.29.

IsoValue
The difference between using (20) and the same problem with the variant ℎ of the discrete total variation is striking in Fig. 10. Now, the orientation of the discontinuity is better rendered and the transition region is sharper (in our particular example, about 2 elements, while in Fig. 9 the discontinuity is 21 21 21 smoothed over about 6-7 elements). On the other hand, as a drawback, the values of the function are not well recovered at the discontinuity. However in general, and particularly for this example, the precision of the algorithm is improved in this framework. One finds in particular in [34] the following result: Here, the P0 average is simply the piecewise constant function obtained by averaging ℎ over each element, or equivalently since ℎ is affine on each element, by assigning to each element the value of ℎ in the center of the element. Unfortunately, it is unclear when one can expect (5) to have a Lipschitz continuous solution. Even in 2D, assuming that ∈ ∞ (Ω) is not sufficient, yet it is standard that it is the case in the setting of Fig. 10. The proof of Prop. 3.4 is based on standard convex duality, and makes use in particular of an interesting discrete duality property enjoyed by the Crouzeix-Raviart total variation and that we rapidly mention in the next Section 3.2.2.
Other discontinuous Galerkin (DG) type approximations have been introduced recently and less recently for tackling such problems, see in particular [7] for a very general framework, and further estimates which generalize Prop 3.4. We also mention [50] where is studied a conforming approximation which is a special limiting case of [7], closer to the classical work of Repin [67] (see also [9]), and for which we should not expect an error bound as good as in Prop. 3.4. An interesting variant also based on a DG representation is described [70], where the functions are sampled after convolution with a mask, and mesh adaption is used, leading to an expression which enjoys also strong isotropic properties -and could share similarities with the approximations presented later on (see Sec. 4.2 and 4.3). However, for the latter approaches, it is quite unclear right now how to obtain sound error estimates.

Duality
Let us first introduce some additional notation and definitions. In what follows, given a mesh T ℎ and ∈ 1 (Ω) we denote by Π 0 ℎ ( ) the ( 2 ) projection of on P0 functions, that is functions which are constant on each element ∈ T ℎ . It is defined as usual by We denote P0(T ℎ ) the vector space of the P0 functions over T ℎ . We also define the space of "lowest order Raviart-Thomas" vector field subject to the mesh T ℎ , denoted RT0(T ℎ ). The basis of this space is composed of the affine vector fields which vanish on all but two elements , (or one boundary element), have flux | | through the common facet = ∩ (or the boundary facet in case of a boundary element) and zero through the other facets. If (resp., ) is the vertex of (resp., ) opposite to , this vector field (up to switching the sign and/or exchanging , ), is given by where here, ℎ = | |/| | is the height of the element over the facet , etc. One sees that , is not continuous, however its normal flux through is one (| | after integration on the facet) and does not jump, so that in the sense of distributions. Hence, a generic vector field ℎ : Ω → R is of the form , , + where the first sum is over elements , such that = ∩ ≠ ∅, and the second over elements such that = ∩ Ω ≠ ∅ (and is defined as above, but on only). Here ∈ R are the pointwise fluxes from to across the common facet (oriented in an arbitrary but fixed direction). Such a field ℎ has a true divergence (in the weak sense). We also define the subspace RT0 0 (T ℎ ) of fields ℎ ∈ RT0(T ℎ ) with ℎ · = 0 on Ω, that is fields with only the sum over , above, and for which = 0 for ⊂ Ω. Defining now, for 0 ℎ ∈ P0(T ℎ ), the discrete total variation: one has the following result: We must observe a slight difference between this result and the result as stated in [34], where 0 ℎ is assumed to be already the projection of a CR function. This is due to the fact, recently proved in [12,Cor. 3.2], that the operator Π 0 ℎ maps CR(T ℎ ) onto P0(T ℎ ), so that it does not make any difference to assume 0 ℎ is any P0 function. The above mentioned paper [12] studies more deeply the connections between CR and RT spaces. This duality property has interesting extensions studied by S. Bartels in [6] in the framework [7] of more general DG approximations (see also [34,App. A] for an elementary generalization).

Inpainting with Crouzeix-Raviart finite elements
It turns out that the Crouzeix-Raviart total variation enjoys also a remarkable isoptropic behaviour for inpainting problems. We consider an arbitrary (polygonal) domain Ω and problem (7) with ( ) = { · ì ≥ } for some direction ì ∈ S −1 and ∈ R, that is, the characteristic function of a half-space. Then, one has the following striking result: . Let ℎ ∈ CR(T ℎ ) be the projection of onto CR functions. Then for any ℎ ∈ CR(T ℎ ) with ℎ = ℎ on the middle of the boundary facets (on Ω), one has: ℎ ( ℎ ) ≥ ℎ ( ℎ ).
In other words, the discrete projection of the exact solution of the inpainting problem solves the discrete inpainting problem, for any direction ì . Unfortunately, what the Proposition does not say, is that in general there could be many other solutions and it is very unlikely to numerically find ℎ by solving the discrete inpainting problem. This issue is studied in details in [34,Sec. 3.4] and we refer to that paper for a numerical illustration. We will see later on that Prop. 3.6 relies on general properties which can be shared by other discretization, see Sec. 4.1.

DUAL DISCRETIZATIONS
We now turn to discrete total variations defined rather by a dual formulation and appropriate constraints on dual fields, that is, by an appropriate discretization of (1), rather than by a primal energy. The CR approximation, which enjoys (22), could fall into this category, however for the formulations we will now describe, there is no obvious primal formulation such as (21) (for Φ = | · |).

Raviart-Thomas duals
The first and most natural way to define a discrete variation by a dual formula is to let, given the mesh T ℎ and ℎ ∈ P0(T ℎ ), Observe, actually, that this formula also makes sense if ℎ is replaced with ∈ 1 (Ω), and depends only on Π 0 ℎ ( ). In particular, it is easy to show that (comparing with (1)) one will have, for any ∈ 1 (Ω), It turns out that with this property, and the fact that if ( ) is a Lipschitz dual field for problem (5), then one can easily "project" it as an admissible dual field for the definition (23) (up to some small error), one can reproduce almost verbatim the proof of Prop. 3.4 and get the same estimate for this variant, see [20] for details.
In the same way, one can show also a variant of Prop. 3.6, valid for this approximation. To make the statement precise, we assume Ω is a rectangle (or a polygon) and we consider a mesh T ℎ over Ω, made of simplices (triangles in 2D). Then, we introduce the discrete version of (7)-(8), given ∈ 1 ( Ω). We let: which of course, since div ℎ ∈ P0(T ℎ ), depends only on the 0 projection of and defines, in particular, a Dirichlet discrete total variation on P0(T ℎ ).
Integrating by parts, the expression in the sup is the same as which is the expression in (7), and equality is reached whenever ℎ · = | | (in the sense of measures) and ℎ · Ω ( ) = sign( ( )− ( )) a.e. on the boundary (which is of course ensured when ( ) = ( )). But it turns out that ℎ · Ω is constant on each facet ∩ Ω =: of a boundary element ∈ T ℎ , so that the boundary integral ∫ ( − ) ℎ · ℎ H −1 also vanishes if on average on , = .
It follows that if one considers, as in Prop. 3.6, ( ) = { · ì ≥ } for some ì ∈ S −1 and ∈ R, and one lets¯ ℎ be the piecewise constant function on Ω given, on each boundary element, by the average of on the facet, and ℎ = Π 0 ℎ ( ), then, on the one hand, one has for any other ∈ (Ω), considering the constant vector field ≡ ì and using (8): that is, is minimizing the total variation for its own boundary condition (this is of course well known!). On the other hand, since ≡ ì ∈ RT0(T ℎ ) and has norm less than 1, one also has for ℎ ∈ P0(T ℎ ) that: which is clearly reached for ℎ ≡ ì and has value ℎ ( , Ω). We have proved the following: Proposition 4.1. For any ì ∈ S −1 , ∈ R, for ,¯ ℎ , ℎ defined as above, then ℎ is a solution of min Moreover, the value of this problem is the same as the value ( ; Ω) of the minimization problem min ∈ (Ω) ( ; Ω).
Hence again, in theory, the inpainting problem should be solved perfectly, whatever the direction of the edge, by such a discretization. In practice, it seems this is not necessarily the case, as there could be again (as shown in [34] for Crouzeix-Raviart discretizations) other, more diffuse solutions to the same problem, and for strange reasons this approach is not good at recovering missing edges. This is illustrated later on (Fig. 11), for the variant we now introduce.

Cubic meshes
On the other hand, the implementation of (23) is not straightforward and optimization might be more difficult than with the previous formulations. An easier variant (which is the one actually studied in [20] and was first introduced and analysed, in a slightly different form, in [39,40]) consists in replacing the simplicial mesh T ℎ with a cubic (square in 2D) mesh based on the pixel representation of the images.
We introduce also dual variables defined on the edges of the mesh, and representing average fluxes through these edges: 1 is an horizontal average flux "in between" +1, and , while 2 , + 1 2 is a vertical average flux between , +1 and , . We denote = ( 1 • , 2 • ) this dual variable. This variable is naturally extended into a Raviart-Thomas vector field ( ) = ( 1 ( ), 2 ( )) by letting, in each pixel ℎ , , 1 ( ) be the affine interpolation between 1 and assuming the boundary fluxes vanish: The Raviart-Thomas total variation on cubic meshes is defined as in (23) by observe that this is well defined, again, for any ∈ 1 (Ω) even if it depends only on Π 0 ℎ ( ). This variant on cubic meshes enjoys exactly the same properties as the previous definition based on simplicial meshes, in particular the optimal error bound in Prop. 3.4 holds with the same proof. One also easily defines, as before, a Dirichlet version of this Raviart-Thomas total variation, suitable for inpainting problem: then, Prop. 4.1 is easily seen to hold also in this case. We show in Fig. 13c (see also the detail in Fig. 14) an example of denoising with this variant, where one sees that the reconstruction is very sharp and probably closest to the continuum limit with respect to other discretizations which we introduce later on (with results displayed in Fig. 13d and 13h). Unfortunately, this is not true for the inpainting problem (cf "RT" row in Fig. 11), despite the theoretical result Prop. 4.1. It seems here that there are many possible solution with optimal energy and that the optimization will not necessarily pick the sharpest one.
Using (26), we see that the discrete total variation introduced in this section can also be defined by a fully discrete expression: we have In this expression, the constraints correspond to four norm constraints on the vectors ( 1 ), which are equivalent to requiring that the Raviart-Thomas extension ( ) has norm less than almost everywhere in the pixel ℎ In order to pave the way for the following sections, let us introduce four operators ±,± , acting on the duals and defined at each pixel ( , ), which are given by Then the above definition takes the form where ±,± 2,∞ is the maximum (over , ) of the 2-norm of the 2-vector ( 1 . We now introduce a variant of this definition, first appeared in [51] and further analysed and implemented in [37]

Hintermüller et al / Condat's approach
In [51,37] a variant of (28) is proposed in the following way: using the notation of [37], one introduces the operators  Fig. 2]) Reconstruction quality of various discrezization schemes for the problem of inpainting a straight discontinuity (FD is (9), RT is (28), CD is (29)). The numbers above the images refer to the reconstruction error with respect to the ground truth, measured in PSNR. and one lets ℎ ( ) = sup ℎ , : Enforcing the constraints in this way, it is not guaranteed any longer that the Raviart-Thomas interpolation of the dual field is feasible and in general, ℎ ( ) ≥ ℎ ( ). In particular, up to now, no error bound is known for this approximation, in addition, whether it is consistent was not even clear up to the recent work [35] (see Prop. 4.2 in Sec. 4.3). Yet, it turns out to be much more "isotropic" than any other approximation introduced so far, in particular for the inpainting problem. Numerous experiments are computed in [37] but the most striking results are the inpainting results shown in [35, Fig. 2], which demonstrate this approximation is able to recover sharp missing edges in almost any direction, see Fig. 11.
We then have the following consistency result: We refer to [35] for a more precise statement, in particular concerning the boundary conditions. In addition to Proposition 4.2, a compactness result for sequences bounded in energy is also shown. Now, what is interesting in this framework is that all coefficients which satisfy (31) allow to define a consistent total variation. This means that given a specific task, such as inpainting, one can try to "learn" the best coefficients, which solve this task best. This is a bi-level optimization program, which is hard to solve, but is stated in relatively low dimension (as the parameters which are learned are just the coefficients of the convolution operators ).

FD
RT CD

TABLE 1
Filters corresponding to respectively definitions (9), (28), and (29) of a discrete total variation, with which the inpainting results of Fig. 11 have been computed.  As an illustration, we show in Table 2 pair of filters learned to best solve the inpainting task of Fig. 11. In these experiments, we consider small filters: the horizontal one (coefficients • in (30)) has 2 × 3 components and the vertical one (coefficients • ), 3 × 2. In this setting, the filters corresponding to Fig. 11 are shown in Table 1. The result produced by the learned filters are much sharper than with the ad hoc discretizations, see Fig. 12.
In Figures 13-14 we show the ROF denoising (Problem (4)) of the image in Fig. 13a, with various choices of the discrete total variation. Figs. 13b-13e show the results with the handcrafted discretizations discussed so far. In the three last figures (13f-13h), the discrete total variation is ℎ , with various strategies for learning the filters. In Fig. 13f, the filters are learned by trying to best recover orignal images from their noisy version. Observe that it produces visually the best reconstruction, with sharp edges but a result which does not look too artificial and almost no visible influence of the underlying pixels (the quality of Fig 13d is almost as good, but staircasing is reduced in 13f without altering the important details, see in particular the detail in Fig. 14). The filters in Fig. 13g are learned by trying to best recover the exact minimizer of (4) for the characteristic of disks of various radii. The edges are sharper than in the previous, but the result is a bit less pleasing. In the last experiment, the filters are learned for inpainting (cf.

FIGURE 12
Reconstruction quality for the filters in Table 2. The results are much sharper, in all directions, than in Fig. 11, and the PSNR values much higher.
the same as in the previous figure, or with the Raviart-Thomas total variation (Fig. 13c), with very sharp edges and crisp details, maybe too aggressive for a natural image. See also Fig. 14 which shows an enlarged detail from Fig. 13.
In [35], many more experiments are carried on and the process for learning filters for denoising synthetic or true images is explained with further details, still in the context of consistent discrete total variations. On the other hand, clearly, proving numerical errors for these general filters remains an open question.

CONCLUSION AND PERSPECTIVES
In this note, we have reviewed various approaches to consistently approximate, in an implementable way, the isotropic total variation. We have shown that different points of view can lead to qualitatively very different results, and sometimes provable optimal or non-optimal error estimates. In addition, the choice of a discretization should probably depend on the particular task the functional is used for. These ideas and approaches should be kept in mind when trying