A Low-Rank Approach to Oﬀ-The-Grid Sparse Super-Resolution

,


Introduction
Sparse super-resolution problems consist in recovering pointwise sources from low-resolution and possibly noisy measurements, a typical example being deconvolution. Such issues arise naturally in fields like astronomical imaging [40], fluorescence microscopy [24,42] or seismic imaging [29], where it may be crucial to correct the physical blur introduced by sensing devices, due to diffraction or photonic noise for instance. They are also related to several statistical problems, for instance compressive statistical learning [21], or Gaussian mixture estimation [39].
The tasks we target in this work are framed as inverse problems over the Banach space of bounded Radon measures M(T d ) on the d-dimensional torus T d = R d /Z d . We aim at retrieving a discrete Radon measure µ 0 = r k=1 a 0,k δ x 0,k (a 0,k ∈ C, and x 0,k ∈ T d ), given linear measurements y in some separable Hilbert space H, with y = Φµ 0 + w = y 0 + w ∈ H (1) where Φ : M(T d ) → H is a known linear operator, and w ∈ H some unknown noise.

Beurling LASSO
Although this problem is severly ill-posed, sparse estimates are obtained by solving the following optimization program, known as the BLASSO [10] argmin µ∈M(T d )

2λ
Φµ where the total variation norm |µ|(T d ) of a measure µ ∈ M(T d ) is defined as and extends the ℓ 1 -norm to the infinite-dimensional space of measures, favoring in particular the emergence of Dirac masses in the solution. The parameter λ should be adapted to the noise level; in this work we focus on the case where λ > 0. The main asset here is that no assumption is made about the support of the measure, contrary to early superresolution models where it was assumed to live on some discrete grid [13,14]. This grid-free approach has been at the core of several recent works [3,4,11], and has offered beneficial mathematical insight on the problem, leading to a better understanding of the impact of minimal separation distance [45] or of the signs of the spikes [4,10], and to sharp criteria for stable spikes recovery [12,17].
However, the infinite-dimensionality of (P λ (y)) poses a major numerical challenge, and off-the-grid super-resolution, although analytically better, remains difficult to perform in practice. This paper introduces a new mehod to solve this optimization problem in a scalable way.

Related works
Several approaches to solve (P λ (y)) have been proposed in the literature.

Greedy approaches
In [3], the authors propose to iteratively estimate the support of the initial measure using a conditional gradient (also known as Frank-Wolfe) scheme, which consists in greedy stepwise additions of new spikes. Frank-Wolfe algorithm is well fitted to operate directly over measures and is relatively inexpensive, but converges slowly. To remedy the matter, one can add nonconvex corrective updates as a final step [3,2]. This achieves state-of-the-art result in several applications, among which fluorescence microscopy [2].
Semidefinite approaches (P λ (y)) may also be solved by lifting it to a semidefinite program. Semidefinite approaches for the total variation minimization problem were originally introduced in [4,47], but for unidimensional measures (i.e. d = 1) only. The multivariate case on the other hand raises a far more challenging problem, that may be solved using the so-called Lasserre's hierarchy [32], consisting in a sequence of increasingly better semidefinite approximations of (P λ (y)) or its dual [11,15].
The numerical interest of semidefinite programs holds in that they benefit from efficient solvers; however, these are limited to matrices of size a few hundreds. In our case this represents a major impediment, since the matrices involved in the semidefinite relaxations of (P λ (y)) are typically of size f d c × f d c , where f c is the cutoff frequency of the filter. Up to now, semidefinite relaxations have essentially been successful for combinatorial problems like the MaxCut problem [20], in which the number of variables is high and the degree of the polynomials involved in the constraints is low (it is equal to 2 in the MaxCut case). Imaging problems on the contrary typically involve polynomials with low numbers of variables (essentially the same as the dimension d), but high degrees (f c ), and to our knowledge, no efficient way of solving the hierarchy in those settings has been proposed yet.
Non variational approaches Although we focus here on ℓ 1 -regularization techniques, there is also a vast literature on non-convex or non-varational superresolution schemes. Many of these methods derive from the same idea proposed by Prony to encode the positions of the spikes as zeros of some polynomial, see [30] for a review. This is the case for MUSIC [44], ESPRIT [41], or finite rate of innovation [49], among others. They remain difficult to extend to multivariate settings, see for instance [31].

Contributions
Our first contribution, detailed in Section 2, is a spectral approximation scheme of the forward operator. This is a cornerstone of our approach. The approximation result is formulated in Prop. 1.
Next, in Section 3, we propose a framework of Lasserre-inspired semidefinite liftings for (P λ (y)), that generalizes the approach of [47] to multi-dimensional settings. More specifically, we derive a hierarchy of semidefinite programs approaching (P λ (y)) in arbitrary dimension (Prop. 2). Then, following works by Curto and Fialkow [8] and Laurent [35], Theorem 1 provides a practical criterion to detect the collapsing of the hierarchy.
Our third and main contribution is the FFW algorithm, detailed in Section 5, a Fourier-based Frank-Wolfe approach for these liftings, which requires only O(f d c log f c ) elementary computations per iterations, making it scalable for super-resolution problems in dimension greater than one.
The code to reproduce the numerical results is available online 1 .

Notations
Measures We consider measures defined over the torus T d = (R/Z) d . Note that in practice, in order to avoid periodization artifacts, one can use zeropadding or symmetrization at the boundary of the signal and still consider a periodic setting. We denote by M(T d ) the space of bounded Radon measures on T d , endowed with its weak-* topology. It is the topological dual of the space C (T d ) of continuous functions on T d .

Linear operators
We consider a linear operator Φ : where ϕ : T d → H is assumed to be smooth. We denote by L the set of operators of this form. A typical instance is a convolution operator, where ϕ(x) =φ(· − x), withφ ∈ L 2 (T d ) and H = L 2 (T d ). An important example is ideal low-pass filtering, as considered in [4], which is a convolution with the Dirichlet kernelφ D (x) def.
= k∈Ωc e 2iπ k, x , where for some cutoff frequency f c ∈ N * . This is equivalently modeled in the Fourier domain by taking ϕ(x) = (e −2iπ k, x ) k∈Ωc , and H = C |Ωc| . Φ is then simply the Fourier operator F c , defined as Other commonly encountered imaging problems involve non translation-invariant operators, such as subsampled convolution. Given a sampling domain S ⊂ T d , this is modelled as ϕ(x) = (ψ(s − x)) s∈S . In that case, H = C |S| . Lastly, we also consider in this paper more difficult settings, where the kernel is defined as ϕ(x) = (ψ(s, x)) s∈S , with H = C |S| . An example of non translation-invariant operator, studied in Section 2.2, are foveated measurements [6].
Trigonometric polynomials, Laurent polynomials A multivariate Laurent polynomial of degree ℓ is defined as where z k must be understood as z k1 the set of Laurent polynomials of degree ℓ. The restriction to the unit circle of a Laurent polynomial is a trigonometric polynomial. When working with multivariate polynomials, one need to choose some ordering on the monomials. Typical examples of monomial orderings are the lexicographic order, or the graded lexicographic order. In our case, we use the colexicographic order.

Fourier approximation of operators
In this section we detail a spectral approximation method for the forward operator Φ, that is consistent with semidefinite approaches (see Section 3) to solve (P λ (y)). Our model encompasses deconvolution problems, as well as nonconvolution ones -in particular, we focus on the case of subsampled convolution and foveation (see Section 2.2), which are common in imaging problems.

Spectral approximation operator
Let Φ ∈ L be an operator of the form (2), with kernel ϕ (in the following, we assume that ϕ is sufficiently smooth, namely ϕ ∈ C j (T d ; H), with j ⌊ d 2 ⌋ + 1). It is possible to define a Fourier series expansion of ϕ, using the coefficients We refer to the paper of Kandil [28] for the theory of the Fourier series of Hilbert-valued functions. In particular, as H is separable, it is shown that the Parseval formula holds, i.e. for all ψ ∈ L 2 (T d ; H), As a consequence, the following equality holds strongly in L 2 (T d ; H), Now, given f c ∈ N, we are interested in approximating Φ with some operator Φ c whose kernel ϕ c has spectrum supported on Ω c , i.e.
Definition 2 (Spectral approximation operator). The spectral approximation of Φ is the operator Φ c ∈ L c defined by From Parseval's equality (5), we see that Φ c is the best approximation of Φ in L c in terms of the Hilbert-Schmidt operator norm, Ψ → Ψ def.
The next result shows that the solution of the BLASSO when replacing Φ with Φ c approximate the solutions of (P λ (y)).
For each f c ∈ N, let Φ c ∈ L c be its spectral approximation operator, and let µ fc be a minimizer of E fc (µ) def.
. Then, the sequence (µ fc ) fc∈N has accumulation points in the weak-* topology and each of them is a solution to (P λ (y)).
Proof. By definition of E fc and µ fc , As the total variation ball (of radius 1 2λ ||y|| 2 ) is compact and metrizable for the weak-* topology, the sequence (µ fc ) fc∈N has accumulation points. Let µ ⋆ be any accumulation point and let us denote by (µ n ) n∈N any subsequence which converges towards µ ⋆ in the weak- * sense. We denote by Φ n (resp. ϕ n ) the corresponding operator (resp. kernel). Let µ ∈ M(T d ) be any Radon measure.
We note that, as hence the series k∈Z d c k (ϕ)e k converges uniformly on T d towards ϕ. As a result, ϕ n converges uniformly towards ϕ, and Moreover, for any h ∈ H, The first term vanishes by uniform convergence of ϕ n and boundedness of (µ n ) n∈N , whereas the second one vanishes by the weak-* convergence of µ n towards µ ⋆ . Hence, Φ n µ n ⇀ Φµ ⋆ weakly in H, and To complete the proof, we note that by definition of µ n , we have and passing to the inferior limit, we get, As this is true for any µ ∈ M(T d ), we deduce that µ ⋆ is a solution to (P λ (y)).
By construction, Φ c µ only depends on the Fourier coefficients c k (µ) for k ∈ Ω c . Indeed, for any µ ∈ M(T d ), This leads to the following definition.
Definition 3 (Spectral approximation factorization). The spectral approximation operator may be factorized as = c −k (ϕ). In particular, when H is of finite dimension N , the matrix of A(ϕ) ∈ M N,|Ωc| (C) is given by We call this matrix the spectral approximation matrix of Φ.
In the rest of the paper, we therefore focus on solving the problem (P λ (y))

Examples
We give a few instances of problems for which the matrix A(ϕ) may be computed. Each case discussed below is illustrated in Fig. 1.
Deconvolution The convolution with some kernelφ ∈ L 2 (T d ), obtained with ϕ(x) =φ(· − x), is equivalently obtained by multiplying the Fourier coefficients of the measure with those ofφ. In other words, one might equivalently choose H = ℓ 2 (Z d ), and ϕ(x) = c k (φ)e −2iπ k, x k∈Z d . The spectral approximation matrix of a convolution operator is thus simply the diagonal matrix A(ϕ) = Diag (c k (φ)) k∈Ωc .

Dirichlet Gaussian
Gaussian foveation Subsampled Gaussian Figure 1: Examples of measurements. In the first two cases (left), the observations y live in the Fourier domain, and we plot F * c y. In the last two cases (right), y lives on a grid G, and can be plotted directly.
which is also, by Poisson summation formula,φ(x) = k∈Z dĝ(k)e −2iπ k, x , whereĝ denotes the continuous Fourier transform of g. Hence the approximation matrix reads A(ϕ) = Diag(ĝ(k)) k∈Ωc , withĝ(k) = (2π) The error induced by the spectral approximation in the Gaussian case is described in Fig. 2. It can be made negligible with f c sufficiently large.
Microscopy In practical cases, one often only has access to convolution measurements over some sampling grid G. In fluorescence microscopy for instance [22], the observations are accurately described as subsampled Gaussian measurements. In that case, given some convolution kernelφ ∈ L 2 (T d ) (typically a which leads to Remark 1. When the grid G is regular, multiplication by A(ϕ) or A(ϕ) * may be computed efficiently using fast Fourier transforms, see Section 5.4.
Foveation In more general cases, ϕ may be defined as in which case the lines of A(ϕ) consist in the Fourier coefficients of x →φ(s, x) at frequencies taken in Ω c . Section 6 studies foveation operators [6], whereφ takes the formφ for some smoothing function g, typically a Gaussian, and some (positive) covariance function σ. Fig. 1 shows an example of Gaussian foveated measurements, for which the matrix A(ϕ) has no closed-form but may be approximated numerically using the discrete Fourier transform. The approximated foveation kernel is displayed in Fig. 3.

Semidefinite hierarchies
In this section, we generalize the semidefinite programming formulation of the atomic norm used in [47] to the multidimensional case. As in polynomial optimization [33,11], the multidimensional case is much more involved than the one-dimensional one, and needs the introduction of a hierarchy of semidefinite programs. Loosely speaking, these semidefinite approaches consist in replacing measures with (infinite) moment sequences, assuming they are compactly supported. The so-called semidefinite hierarchies then result from truncating these moments, and, in the case where the domain is furthermore semi-algebraic, invoking semidefinite characterizations of moment sequences, see e.g. [8].
First, to make the connection between (P λ (y)) and the atomic norm minimization problem of [47] explicit, one can see that (P λ (y)) is actually equivalent to min z∈C (2fc +1) d (12) Therefore, given z ∈ C (2fc+1) d , we focus in this section on the constrained problem Note that, by compactness and lower semi-continuity, the above problem has indeed a minimum. Moreover, its value is the so-called atomic norm of z introduced in [5]. The purpose of the present section is to approximate (Q 0 (z)) with problems involving only a finite number of moments of an optimal measure µ and its absolute value |µ|.

Generalized Toeplitz matrices, moment matrices.
Let ℓ f c , and m def.
Definition 4 (Generalized Toeplitz matrix). We say that R ∈ C m×m is a generalized Toeplitz matrix (also called Toeplitz-block Toeplitz, or mulitlevel Toeplitz), denoted by R ∈ T m , if for every multi-indices i, j, k ∈ −ℓ, ℓ d such that ||i + k|| ∞ ℓ and ||j + k|| ∞ ℓ, If R is the trigonometric moment matrix of some measure µ, it obviously satisfies R ∈ T m , as If the ordering on the multi-indices is colexicographical, the generalized Toeplitz property rewrites where Here θ kj denotes the (2ℓ + 1) × (2ℓ + 1) Toeplitz matrix with ones on its k j -th diagonal and zeros everywhere else, and ⊗ stands for the Kronecker product. For instance, with 2 × 2 matrices, for d = 2 and k = (−1, 0), one has

Studied relaxation
Now, we consider for ℓ f c , and m def.
In the rest of the paper, we write Remark 2 (Alternative form). In fact, as noted in [38], it is possible to show by an homogeneity argument that the term τ must be chosen equal to 1 m Tr(R). Indeed the positive semi-definiteness constraint in (a) is equivalent to the following three conditions (see for instance [8]) Therefore, at optimality, τ = α, Rα , and replacing α, R with tα, 1/tR for t > 0 yields another feasible point with energy 1 2 1 tm Tr(R) + tτ . Minimizing that quantity over t yields the equality of the two terms. Therefore, (Q Incidentally, notice that at optimality the rank of the large matrix in (a) and (a') is equal to rank(R).
Moreover, lim ℓ→+∞ min (Q (ℓ) 0 (z)) = min (Q 0 (z)). The proof is a straightforward adaptation of the approach used in real polynomial optimization using Lasserre hierarchies [32]. We include it for the sake of completeness.
Proof. First, we note that if τ , R ′ and z ′ are feasible for (Q (ℓ+1) 0 (z)), then τ , R andz are feasible for (Q (ℓ) 0 (z)), where R and z respectively denote the restrictions of R ′ and z ′ to Ω ℓ . Since R ′ is constant on its diagonals, we get That yields the first inequality. Now, for the second inequality, let µ ∈ M(T d ) such that (Fµ) k = z k , and ξ(x) def.
= dµ d|µ| (x) be its sign (defined |µ| almost everywhere). For any ℓ f c , consider R ∈ C m×m , z ∈ C m and τ defined by which yields (a). As a result, R, z, τ is admissible for (Q (ℓ) 0 (z)) with energy ). To prove that the limit of the sequence is indeed (min (Q 0 (z))), let us consider the dual problem to (Q 0 (z)), It is possible to check that (D 0 (z)) always has a solution [16] and that strong duality holds (see for instance [4]), max (D 0 (z)) = min (Q 0 (z)).
On the other hand, one may show that a dual problem to (Q (ℓ) 0 (z)) is given by As before, there exists a solution to (D (ℓ) 0 (z)) and max (D ). Now, let ε > 0, let p be a solution to (D 0 (z)) and let p ε def.
which yields the claimed convergence.

Tightness of the relaxation and low rank property
The next proposition discusses the equality case between (Q (ℓ) 0 (z)) and (Q 0 (z)), referred to as collapsing of the hierarchy, by interpreting R as a moment matrix.
for all i, j ∈ Ω ℓ . In particular, if µ is a discrete measure with cardinal r, then rank R r.
Proof. Assume that min (Q (ℓ) 0 (z)) = min (Q 0 (z)), and let µ be a solution to (Q 0 (z)). Define R, z and τ by (17). As in the proof of Proposition 2, we see that (R, z, τ ) is admissible for (Q (ℓ) 0 (z)), with energy 1 2 Thus R is a sum of at most r rank one matrices, and rank R r.
Remark 3 (Low rank solutions). It is important to note that, as proved in [47,Prop. 2.1], the equality min (Q (fc) 0 (z)) = min (Q 0 (z)) always holds for d = 1. Therefore, if one seeks to recover a sparse measure µ = r i=1 a i δ xi solution to (Q 0 (z)), then some solution to min (Q (fc) 0 (z)) has rank (at most) r. In dimension d = 2, it is known that the Lasserre hierarchy collapses at some order ℓ that may be arbitrarily large [11,Section 4]. In our experiments however, we have always observed min (Q (fc) 0 (z)) = min (Q 0 (z)). This is why our approach, exposed in Section 5, focuses on capturing a low-rank solution to (Q (fc) 0 (z)). To conclude this section, we give a practical criterion to detect collapsing. As observed by Curto and Fialkow [8], the flatness property (see definition below) is essential when trying to determine whether some matrix is the moment matrix of a measure.
Given a matrix R ∈ H m , we write the block decomposition Definition 5 (Flat matrix). Let R ∈ H m . In the decomposition (20), we say that R is flat (or that R is a flat extension of A) if rank R = rank A.
The following result adapts Theorem 7.7 in [8] to our setting. As it is not exactly the one used in [8] (especially in terms of multi-dimensional degree), we provide a proof in Appendix A. Theorem 1. Let ℓ 2, and R ∈ H + m ∩T m be a positive semi-definite generalized Toeplitz matrix. If R is flat, then there exists a positive (rank R)-sparse Borel measure ν such that R is the moment matrix of ν.

Support reconstruction
We now discuss the problem of recovering the support {x 1 , . . . , x r } of the underlying positive measure ν = r j=1 a j δ xj from its moment matrix R ℓ (ν).
A first approach, as in [27], is to use the method proposed in [23]. In this section, we present a slightly different procedure, leveraging the low-rank factorization R ℓ (ν) = U U * of the moment matrix that our algorithm provides, see Section 5. It follows the method exposed in [33,Section 4.3], and is summarized in Algorithm 1.
By construction of moment matrices, one has , and D = diag(a 1 , . . . , a r ). In the following, we write v(x 1 ), . . . , v(x r ) the columns of V (x). LetŨ be the reduced column echelon form of U , i.e.Ũ has the form It is obtained by Gaussian elimination with column pivoting. The matrices V (x) andŨ span the same linear space. In particular, for any 1 j r, v(x j ) ∈ ImŨ . Therefore, there exists w( and one can verify that where γ 1 , . . . , γ r are the indices of the lines containing the pivots elements inŨ . The recovery procedure thus reduces to solving the system of (trigonometric) for x ∈ T d . A popular method to solve such problems is the Stetter-Möller method [37], also known as the eigenvalue method, which relates the solutions of (22) to the eigenvalues of so-called multiplication matrices, that can be build directly from the matrixŨ . The multiplication matrices in the basis B def.
= {e −2iπ γ1, · , . . . , e −2iπ γr, · } are the matrices N n , 1 n d, that satisfy In practice, the matrices N n can be read straightforwardly using (21) by selecting the rows indexed by the monomials e −2iπ γj +en, x inŨ , for j = 1, . . . , r. Then, using (23), one can retrieve the coordinates of each point of the support by computing the eigenvalues of each matrix N n , n = 1, . . . , d. However, it is not clear how to put these coordinates together for recovering the positions x 1 , . . . , x r ∈ T d . To this end, it is better to consider the eigenvectors w(x j ), j = 1, . . . , r. Indeed, these eigenvectors are common to each matrices N n . Therefore, if one consider the matrix N = d n=1 λ n N n , where λ n are random real numbers such that n λ n = 1, then all the eigenspaces of N are 1-dimensional with probability 1, and spanned by the vectors w(x j ), j = 1, . . . , r. Thus, writing the Schur decomposition N = QT Q * , where Q = q 1 , . . . , q r is orthogonal and T is upper triangular, yields from where we deduce the positions x j , for 1 j r. = q ′ j N n q j , j = 1, . . . , r, n = 1, . . . , d return x j,n = − 1 2π arg z j,n mod 1, j = 1, . . . , r, n = 1, . . . , d

Semidefinite relaxation for the BLASSO
In view of Remark 2, concatenating the minimization in z and in (R, τ ), we are led to solve the following problem where m = (2ℓ + 1) d . The problem (P (ℓ) λ (y)) is the semidefinite relaxation of (P λ (y)) at order ℓ.
As it appears, the size of the above semidefinite program is m 2 (2f c +1) 2d . Therefore, usual interior points methods are limited when d > 1, or even when d = 1 for large values of f c . In the rest of this paper, we introduce a method which scales well with the dimension d.
Thanks to the low-rank property highlighted in Section 3.3, the search space of (P (ℓ) λ (y)) may be restricted to rank-deficient matrices. Such geometry is well exploited by conditional gradient algorithms. However, these methods are not able to handle the SDP contraint (a) together with the linear constraint (c). Furthermore, the intersection between manifolds of fixed rank matrices and the linear space defined by (c) quickly becomes unanalyzable for matrices larger than 2 × 2, and non-convex optimization schemes on this search space would likely be difficult to implement. Instead, we propose to smooth the geometry of the problem.

Toeplitz penalization
To overcome the difficulty induced by constraint (c) in (P (ℓ) λ (y)), we introduce a penalized version of (P (ℓ) λ (y)) -which can also be seen as a perturbation of the atomic norm regularizer used in [47,46].
Let P Tm be the projector on the set T m ; when working with the colexicographical order, as we do in our numerical simulations, P Tm takes the form where the matrix Θ k are introduced in Section 3.1. In dimension one, the operator P Tm replaces each entry of R by the mean of the corresponding diagonal. We consider the following program: where the parameter ρ controls the penalization of the Toeplitz constraint. We write f λ,ρ the objective of this problem, and R λ,ρ a solution. We study numerically the validity of this approach. For simplicity, the numerical experiments of this section are all in dimension one. , with respect to ρ. Results are averaged over 100 random trials of positive 3-sparse initial measures; the minimal separation distance, i.e. the minimal distance between two consecutive spikes, is larger than 1/(10f c ) in all the cases.

Sensitivity analysis
Although R λ,ρ obviously differs from the true solution R λ , it is possible to show, following an approach similar to [48], that under some mild nondegeneracy hypothesis on η λ,ρ , and for small enough values of ρ, R λ,ρ is sufficiently close to R λ to allow accurate support reconstruction. In particular, both matrices have the same rank. Numerical observations confirm that this regime exists: Figure 4 shows the evolution of rank R λ,ρ with respect to ρ. We see that the rank of R λ,ρ remains stable for low values of ρ, and equal to the sparsity of the initial measure.

Support recovery
We observe numerical evidences of the robustness of the extraction procedure described in Section 3.4 in this penalized setting: although the solutions of (P (ℓ) λ,ρ (y)) do not exactly satisfy the generalized Toeplitz property, and therefore are not moment matrices, Algorithm 1 still yields a good estimation of the support of µ λ . In particular, in a satisfyingly large regime of values of parameter ρ, the eigenvalues of the multiplication matrices (see Section 3.4) remain very stable: in Fig. 5, we consider a 1D setting, and we plot arg z j (ρ), where z j (ρ), 1 j r are the eigenvalues of the multiplication matrix of Algorithm 1, extracted from a solution of (P (ℓ) λ,ρ (y)). The width of the line is defined as | log(|1 − |z j ||)|, so that the thicker the line is, the closer z j (ρ) is to the unit circle. We see that the extraction procedure is very stable, up to a certain point, which coincides with the first rank drop of R λ,ρ .

FFT-based Frank-Wolfe (FFW)
In this section, we propose an efficient numerical scheme for solving (P (ℓ) λ (y)), that takes advantage of the low-rank property of the solutions as well as of the convolutive structure of the Toeplitz constraint. Given a matrix R of the form (15), we consider in the rest of the paper the normalized objective function where C 0 = 2λ/ y 2 H .

Frank-Wolfe
The Frank-Wolfe algorithm [19], a.k.a. conditional gradient, aims at minimizing a convex and continuously differentiable function f over a compact convex subset K of a vector space. The essence of the method is as follows: linearize f at the current position R t , solve the auxiliary linear problem of minimizing S → ∇f (R t ), S on K, and move towards the minimizer to obtain the next position. This scheme ensures the sparsity of its iterates, since the solution after k iterations is a convex combination of at most k atoms. We refer to [26] for a detailed overview of the method. Since no Hilbertian structure is required, it is a good candidate to solve problems in Banach space [3]. Moreover, in many cases, the linear minimization oracle may be computed efficiently, as it amounts to extracting an extremal point of the set K.
Over the semidefinite cone In our case, K is the positive semidefinite cone, which is linearly spanned by unit-rank matrices. It is not bounded (hence not compact), but one can restrict (P (ℓ) λ,ρ (y)) over a bounded subset of the cone by noticing that for any solution R ⋆ of (P (ℓ) λ,ρ (y)), one has with equality if and only if S ′ = ( α i )e 1 e h 1 , hence the desired result.

The FFW algorithm
To solve (P (ℓ) λ (y)), we apply a plain Frank-Wolfe scheme, to which we also add a non-convex corrective step similar to [2], see Algorithm 2. This last step compensates the slow convergence of Frank-Wolfe.
Linear minimization oracle As mentioned above, the linear minimization step of Frank-Wolfe in our case simply amounts to computing a minor eigenvector of the gradient of f (or more specifically of J −1/2 m ∇f J −1/2 m , but we omit this scaling in the following for simplicity) at the current iterate. In practice, we perform this step efficiently with power iterations, see Section 5.3. In order to determine the lowest singular value of ∇f (simultaneously with a corresponding singular vector), we often need to run the algorithm two times consecutively: if after the first run, a negative eigenvalue λ 1 < 0 is returned, this is indeed the lowest one (since power iterations retrieve the eigenvalue whose magnitude is the greatest); if not, we re-run the algorithm on ∇f −λ 1 I, and add the resulting value to λ 1 to obtain the lowest singular value of ∇f .
Low-rank storage To take advantage of the low-rank structure of the solutions, we store our iterates as R = UU * , and work only with the factor U . This is a cornerstone of our approach, since in practice the matrix R is too large to be stored entirely. Furthermore, this factorization allows an efficient implementation of several steps of FFW (see Section 5.3) that considerably lowers the complexity of the algorithm.
Consequently, at each step of the algorithm, the update consists in adding a column (namely a leading eigenvector of ∇f , as mentioned above) at the end of the matrix U , thus increasing by one the rank of R each time.
Non-convex corrective step The non-convex step that we add after each Frank-Wolfe update consists, as in [2], in a gradient descent on F : U → f (UU h ). The idea is to continuously move the eigenstructue of the iterate in the manifold of fixed rank matrices to improve the value of the functional. This is similar to the celebrated Burer-Monteiro non-convex method for low-rank minimization , which has proven to be vey efficient in practice [1]. We use a limited-memory BFGS descent in our implementation.
Stopping criterion It is known [26] that if S is a solution of the linear minimization oracle, then it satisfies the inequality R − S, ∇f (R) f (R) − f (R ⋆ ) for any R. We use this property as a stopping criterion, ceasing the iterations if R − S, ∇f (R) goes below some tolerance ε.

Fast-Fourier-Transform-based computations
In what follows, we consider a variable R of the form (15), of rank r, such that R = UU * where with U 1 ∈ C m×r and u ∈ C r . In particular, this yields R = U 1 U * 1 , z = U * 1 ζ and τ = ||ζ|| 2 . For clarity, we keep using R, z and τ in our expressions, but in practice only U 1 and ζ are stored. Finally, we assume that the matrices are indexed following the colexicographic order, so that generalized Toeplitz matrices may be written in the form (14). In this section only, we use the same symbol F to refer to the discrete Fourier transform (instead of the discrete-time Fourier transform).
Power Iterations Computing a minor eigenvector of ∇f can be done using power iterations, which consist in recursively applying ∇f to a vector. Given the form (24) of the objective, its gradient at R reads so that multiplying it with a vector w = w 1 ω (w 1 ∈ C m , ω ∈ C) yields (∇f ) w = C 0 2m While the first three terms in the sum above are quite straightforward to compute (remember that Rw 1 is computed as U 1 (U * 1 w 1 )), evaluating P Tm (R)w 1 on the other hand can be costly. The next two propositions show that it can actually be performed in O(m log m) operations using only fast Fourier transforms.
Proof. Since ||P Tm (U 1 U * 1 )|| 2 2 = s,t∈Ω ℓ |u s−t | 2 , we have: Minimizing this quantity with respect to u leads to which yields the desired result.
Proof. The product T w 1 is the (aperiodic) convolution of u and w 1 , and may be formulated as a periodic convolution between u and a zero-padded version of w 1 , hence the result.
We conclude this section by giving the closed-form expression for the linesearch coefficients α r and β r in step 2 of Algorithm 2, assuming in our notations that U r = U and v r = w. Proposition 7. With the same notations as before, let and let ∆ = α, β ∈ [0, 1] 2 ; α + β 1 . Then, the solutions α r , β r of the linesearch of step 2 in Algorithm 2 are given by Proof. This is a straightforward consequence of minimizing the quadratic form f (αR+βww * ) = c 11 α 2 +c 22 β 2 +c 12 αβ+c 1 α+c 2 β+ 1 2λ ||y|| 2 H over ∆. Furthermore, it is realistic to consider only the three cases above (in particular, one never has c 11 = c 22 = 0 or 4c 11 c 22 − c 2 12 = 0 in practice).

Implementation of the approximation matrix A
When dealing with problems that are not deconvolution, such as subsampled convolution or foveation,the approximation matrix A(ϕ) is a full matrix of size |G| × (2f c + 1) d , G being the grid on which the measurements live, making it numerically difficult to handle in practice. We give an efficient way to implement multiplication with A(ϕ) in the case of subsampled convolution, when the grid is a regular lattice. In other cases (foveation in particular), one needs to store the full matrix.
We consider convolution measurements over the grid G def.
The approximation matrix has the form A(ϕ) = (c −k (φ)e 2iπ k, t ) t∈Γ,k∈Ωc , see Section 2. Let S ↑ q and S ↓ q be respectively the upsampling and downsampling (by a factor q) operators. For K 2f c + 1, let Pad K : C |Ωc| → C K d , be defined as and let Restr Ωc : C K d → C |Ωc| be its adjoint. Finally, let q ∈ N, such that q ⌈ 2fc+1 L ⌉, and let e c def.
= e − 2iπ L fc, n n∈ 0,Lq−1 d . The following proposition gives an efficient O(L d log L) implementation of the multiplication by A(ϕ) or A(ϕ) * . Proposition 8. Given z ∈ C |Ωc| and y ∈ C |G| , one has where D(ϕ) def.
Proof. For t ∈ Γ, we write t = n L , with 0 n L − 1. Then Lq k, qn (Pad Lq (Dz)) k = (Lq) (e c ⊙ iDFT(Pad Lq (Dz))) qn which yields the first equality. The second equality on the other hand is obtained by passing to the adjoint.

Complexity
Using the FFT implementations described in the two previous sections, we are able to decrease the computational cost of the two elementary operations in FFW: evaluating f ′ (UU * )w and evaluating F ′ (U ), for U ∈ C (m+1)×r and w ∈ C m+1 (and m = (2ℓ + 1) d , ℓ f c ). Table 1 summarizes the costs of both these operations in the three settings we consider in this paper, i.e. convolution (for which the approximation matrix A(ϕ) is diagonal, see Section 2.2), subsampled convolution (for which A(ϕ) is implemented using Prop. 8) and foveation (for which A(ϕ) is a dense matrix).

Numerics
We study in this section the behavior of FFW with respect to parameters such as sparsity, minimal separation distance, λ, ρ or the number of BFGS iterations.
Power Iteration step The tolerance for the power iteration step is set to 10 −8 , with a maximum of 2000 iterations. Iterations are stopped when the angle between the eigenvectors returned by two consecutive steps goes below the tolerance.

BFGS step
We use Mark Schmidt's code for the BFGS solver [43]. The tolerance for this step is set to 10 −11 (in terms of functions or parameters changes), with typically 500 as the maximum number of iterations. This step is crucial to ensure the finite convergence of the algorithm. When ρ tends to zero, plain Frank-Wolfe steps become significantly insufficient, and the number of BFGS iterations necessary to converge at each step increases, see Fig. 8.
Stopping criterion As explained in Section 5.2, the linear minimization oracle in Frank-Wolfe gives access to a bound on the current duality gap, which can then be used to decide when to stop the algorithm. However, it is difficult to define a stopping criterion based on this property that remains stable from one kernel to another. Therefore, in our implementation, we rather use the objective decrease as stopping criterion, and stop the iterations when |f (R t+1 ) − f (R t )| goes below some tolerance ε (where f is the normalized objective). In the tests presented in this section, ε is set to 10 −8 .

Support extraction
In the tests presented here, the extraction step (see Section 3.4) is perfomed following the approach of [27], using the implementation of Cédric Josz, that he kindly let us use.

Tests on synthetic data
To generate our measurements, we follow (1). The ground-truth measures µ 0 are randomly generated: specifically, we draw random positions uniformly over T d , and random amplitudes uniformly over [−1, 1]. Figure 6 gives instances of reconstruction in each setting described in Section 2.2, with reconstruction errors with respect to µ 0 .
Finite convergence Remarkably, FFW converges in very few steps, usually as many as the number of spikes composing the solutions. Fig. 7 shows the number of FFW iterations with respect to the sparsity of the initial measure, averaged over 200 random trials, i.e. random positions and random amplitudes. The red curve correspond to a subset of these trials for which the minimal separation distance is greater than 1/f c . We see that in these simpler cases,  FFW converges exactly in r-steps, r being the number of spikes composing the solution.
Performance Several metrics can be used to measure the recovery performance of FFW. We use two of them in our tests: the Jaccard index [25], and the flat norm [18], also called dual bounded Lipschitz norm. The Jaccard index measures the similarity between the initial (finite) support S 0 ⊂ T d and the (finite) reconstructed support S r ⊂ T d . It is defined as The value |S 0 ∩S r | is determined with respect to some tolerance δ: given x 0 ∈ S 0 , we consider that x r ∈ S r belongs to S 0 ∩ S r if ||x 0 − x r || δ (x r is called a true positive), and x 0 cannot be associated to more than one point in S r . In our tests using the Jaccard index (see Section 6.2), we set δ = 10 −2 . The flat metric on the other hand is an optimal transport based metric, which we use to measure how close the reconstructed signal is from the source signal (with respect to both positions and amplitudes). It can be computed using linear porgramming. The first plot in Fig. 8 shows the flat distance between the measure µ λ,ρ reconstructed by FFW, and a solution µ λ of (P λ (y)) computed using MOSEK. The results are averaged over 680 trials (random positions, random amplitudes and random sparsities in 2, 8 ), and sorted with respect to the minimal separation distance, either lower than 1/f c (dashed line, 417 cases) or greater (solid line, 263 cases). As expected, the quality of the reconstruction decreases as the relaxation parameter ρ increases. On the other hand, the computational cost, represented in the second figure in terms of total number of FFT performed, decreases as ρ increases. For this experiment, the maximum number of BFGS iterations was set to 1000. This cost comes essentially from the BFGS iterations. In all our 1D tests, setting ρ between 1 and 10 gave good performance. The sweet spot seems to strongly depend on the dimension d: for d = 2, better

Tests on SMLM data
We present some results of FFW applied to data taken from the SMLM challenge [22]. Fig. 9 shows an example of reconstruction for one image of the challenge. On these data, the performance is measured by the Jaccard index, since the challenge gives information only about the locations (and not the amplitudes) of the Dirac masses.
In FFW, the most costly step is the BFGS step. Fig. 10 shows the impact of diminishing the maximum number of BFGS iterations on the quality of the reconstruction (measured in terms of Jaccard index). The red solid line represents the time taken by FFW, in seconds, and the dashed line the time spent in the BFGS iterations. Results are averaged over 20 random images taken from the challenge. We see that a low bound on the number of BFGS iterations strongly deteriorates the performance. On the other hand, we do not gain much  Figure 11: We measure the performance (in terms of Jaccard index) of FFW with respect to parameters λ and ρ. Each pixel is obtained by averaging over 20 images. by setting this bound higher than 250. Finally, Figure 11 shows the Jaccard index with respect to parameters ρ and λ 0 . Each pixel is obtain by averaging over 20 random images taken from the challenge. This gives an idea on the range of choices for ρ and λ 0 in which FFW performs well. Although the choices for λ does not change much following the different settings (kernel, dimension), the best values for ρ seems to depend on d, as mentioned above.

Conclusion
In this paper, we introduce a numerical solver for the sparse super-resolution problem in dimension greater than one and in various settings, including deconvolution. Our approach is based on the one hand on a spectral approximation of the forward operator, and on the other hand on a hierarchy of semidefinite relaxations of the initial problem, inspired by the Lasserre hierarchy [33].
Low-rank solutions of the resulting SDP are then extracted using a FFT-based Frank-Wolfe method, which is scalable with the dimension. An interesting line of future works could be to extend the FFW algorithm to other problems that fit into the semidefinite hierarchies framework, such as optimal transport.

Acknowledgements
This work was supported by grants from Région Ile-de-France and the ERC project NORIA.

A Proof of Theorem 1
In this section, we prove Theorem 1, that is, we explain the connection between flat extensions (in the sense of Definition 5) and the existence of a representing measure ν. The proof essentially relies on the ideas of Curto and Fialkow [7,9,8] for the truncated K-moment problem (see also the variant [34] by Laurent) though our problem does not exactly fit into the framework of the aforementioned articles. While experts in this topic should have no difficulty in filling he gap, we present here some detail which might help the non-specialists.
In the monograph [8], the authors consider moments matrices with multivariate polynomials in (z, z). Seeing the torus T as the complex unit circle makes it possible to reformulate our problem to their setting, except that their moment matrices are indexed by monomials with total degree less than ℓ, = j 1,1 + j 2,1 + . . . + j 1,d + j 2,d ℓ, (26) whereas the indices in our framework are naturally selected by their maximum degree, On the contrary, [36] handles moment matrices indexed with more general sets of indices, but their analysis is given in the real case.
In the following, in order to derive Theorem 1, we combine the ideas of [36] and [9]. We consider a positive Borel measure ν defined on the torus, and we see it alternatively as a measure in the complex plane or in the real plane, going through the following different moment matrices.
Trigonometric moment matrices The moment matrices considered in this paper are mainly trigonometric moment matrices. Given a positive measure on T d , its (trigonometric) moment matrix M T has entries for i, j ∈ Z d , |i| ∞ , |j| ∞ ℓ. Such matrices are generalized Toeplitz in the sense that for all admissible multi-indices, For fixed j, as the column (M T ) ·,j contains the moments of the measure e 2iπ j, t dν T (t), we say that it corresponds to the monomial Z j . The flatnesss property (Definition 5) is equivalent to the fact that for all j ∈ −ℓ, ℓ d \ −(ℓ−1), ℓ−1 d the column Z j is a linear combination of the set of columns Z j ′ ; j ′ ∈ −(ℓ − 1), ℓ − 1 d .
Moment matrices in the complex plane Given a measure ν C defined on C d , one may consider its moments against the variable Z and its conjugate Z, namely for i 1 , i 2 , j 1 , j 2 ∈ N d such that max(i 1 + i 2 ) ℓ and max(j 1 + j 2 ) ℓ. Such matrices have a structure property recalling that of Hankel matrices, Similarly as above, we note that the columns (M C ) ·,(j1,j2) corresponds to the monomial Z j1 Z j2 . We say that M C is flat if the columns corresponding to Moment matrices in the real plane Given a measure ν R 2 defined on (R 2 ) d , we consider its moments against the variables X and Y, namely ℓ and max(j 1 + j 2 ) ℓ. Such moment matrices have the generalized Hankel property, that is For fixed j 1 , j 2 ∈ N d , the column (M R 2 ) ·,(j1,j2) corresponds to the monomial X j1 Y j2 . As above, we say that M R 2 is flat if the columns corresponding to X j1 Y j2 , where max(j 1 + j 2 ) = ℓ, are linear combinations of the columns A.1 From the torus to the complex plane = ν T on T d , we may see it as a measure in C d by considering its image measure by T , ν C def.
The resulting measure has support in (S 1 ) d , where S 1 def. = {z ∈ C ; |z| = 1}. We recall that the image measure ν C = T # ν T is characterized by ν C (B) = ν T (T −1 (B)) for all Borel sets, so that for all ν C -summable function ψ, Obviously, the moment matrices M T and M C have different sizes. However, the following is a first step in relating them. Lemma 1. Let ℓ 2, and let M C be a moment matrix representing some positive Borel measure ν C on C d . Then Supp ν C ⊂ (S 1 ) d if and only if the columns corresponding to Z j1 Z j2 and Z j ′ 1 Z j ′ 2 are equal for all multi-indices such that and the two columns are equal. Conversely, if the above-mentioned columns are equal, let k ∈ 1, d , and j ∈ N d such that j k ′ = 1 for k ′ = k, 0 otherwise. Then, using the equality between the columns 1 and Z j Z j and the corresponding relations between their entries. Since the integrand is nonnegative, we deduce that ν charges only points where z j z j = 1, i.e. z k z k = 1. As a result, Using Lemma 1, we see that for all j ∈ Z d such that |j| ∞ ℓ, all the columns (M C ) ·,(j1,j2) such that j 2 − j 1 = j (and max(j 1 + j 2 ) ℓ are equal. In fact, from (35) we see that those columns are obtained by "repeating" the column j of M T at all indices such that j 2 − j 1 = j (and similarly for the rows).
More precisely, given a maximal degree ℓ, let us denote the set of Laurent polynomials by the set of polynomials by and let J denote the matrix of the operator where Then, from (35), the following relation holds, We immediately deduce: A.2 From the complex plane to the real plane Now, given a positive Borel measure ν C on C d , we see it as a measure on (R 2 ) d by considering its image measure by S, ν R 2 def.
(42) We consider the following subspace of polynomials in the variables X 1 , Y 1 , . . . , X d , Y d , Obviously, C ℓ [Z, Z] and C ℓ [X, Y] are isomorphic as vector spaces. We are interested in the relations between the moment matrices M R 2 and M C when changing variables with S, that is = X k − iY k , or conversely, (44) ∀k ∈ {1, . . . , d}, X k def.
We note that for all k ∈ {1, . . . , d}, given some indices i k , j k , where c r1,r2 ∈ C and the sum is over all the indices r 1 , r 2 ∈ N such that r 1 +r 2 = j 1,k + j 2,k . As a result, given j 1 , j 2 ∈ N d , where the sum is over all the multi-indices r 1 , r 2 ∈ N d such that, for all k, r 1,k + r 2,k = j 1,k + j 2,k . As a result, the change of variable (44) induces a linear map L : C ℓ [X, Y] → C ℓ [Z, Z] which admits a block decomposition, mapping surjectively (hence bijectively), for each t ∈ N d with max(t) ℓ, the space Confronting (30) and (32), in view of the change of variable formula ψ(x, y)dν R 2 (x, y) = C d ψ(S(z))dν C (z), we deduce that where L is invertible.
Lemma 3. The relation ν R 2 = S # ν C defines a one-to-one correspondence between (positive Borel) measures ν C on C d which represent M C ad measures ν R 2 on (R d ) 2 which represent M R 2 . That correspondence preserves the cardinality of the support and S(Supp ν C ) = Supp ν R 2 . Conversely, let M C and M R 2 be matrices (indexed by C ℓ [Z, Z] and C ℓ [X, Y] respectively) such that (48)

A.3 Existence of a representing measure
The rest of the proof of Theorem 1 relies on the results of [36]. The matrix R is a positive semi-definite matrix indexed by the set of monomials Z j ; j ∈ Z d , |j| ∞ ℓ which satisfies (29).
We note that C ℓ−1 is closed under taking divisors, hence connected to 1 (see [36,Sec. 1.3] for the definition). Moreover, its closure,  Proof. Let us write A first step is to show that B = AQ for some matrix Q which only depends on N | C + ℓ−1 , the restriction of N to C + ℓ−1 × C + ℓ−1 . Then, since N 0 is flat, we deduce by [8,Prop. 2.2] that C is uniquely determined from A and B, as C = Q * AQ.
In a second step, we consider the moment matrix M R 2 of ν R 2 on C ℓ . As it is flat, positive semi-definite and generalized Hankel, its block satisfy a similar property as those of N , involving some matrixQ which depends on M R 2 | C + ℓ−1 .
The key point is that since M R 2 and N coincide in C + ℓ−1 × C + ℓ−1 , the matrices Q andQ are equal, hence the whole matrices N and M R 2 are equal.
Let j 1 , j 2 ∈ N d with max(j 1 + j 2 ) ℓ − 1. Since N | C + ℓ−1 is a flat extension of N | C ℓ−1 , any column of N | C + ℓ−1 of the formX j+e (k 1 ) is a linear combination of the columns in C ℓ−1 . Let q j,e (k 1 ) (resp. q j,e (k 1 ) (X, Y)) denote the coefficients of that combination (resp. the corresponding polynomial 2 ). In other words, the polynomial f (1) def.
Hence, proceeding as in [35,Lem. 5.7], we get (N f (n) ) i = (N f (1) ) i+e (k 2 ) +...+e (kn ) = 0, since f (1) ∈ ker N . From (53) and the definition of f (n) , we deduce that the column (X) j+e (k 1 ) +...+e (kn ) of B is a linear combination of columns of the form (X) j ′ +e (k ′ 1 ) +...+e (k ′ n−1 ) with max(j ′ ) ℓ−1. By an easy induction we obtain that the column (X) j+e (k 1 ) +...+e (kn ) of N is thus a linear combination of columns of the form (X) j ′ +e (k ′ 1 ) . By flatness of C + ℓ−1 , it is thus a combination of columns corresponding to C ℓ−1 . As a result, there exists a matrix Q such that B = AQ, and Q is uniquely determined from th e polynomials q j,e (k 1 ) for j ∈ C ℓ−1 , 1 k 1 d. Since the same polynomials q j,e (k 1 ) can be used for M R 2 , we obtain thatQ = Q hence M R 2 = N . Now, we may go back to the torus. We observe that for 1 k d, the polynomial X 2 k + Y 2 k − 1 is in the kernel of N = M R 2 , hence ν R 2 is supported in d k=1 (x, y) ∈ (R d ) 2 ; x 2 k + y 2 k = 1 . Applying Lemma 3 and 2 above, we obtain the existence of a measure ν T such that R is the moment matrix of ν T on T d , which is the claimed result.