Finite-Memory Strategies in POMDPs with Long-Run Average Objectives

Partially observable Markov decision processes (POMDPs) are standard models for dynamic systems with probabilistic and nondeterministic behaviour in uncertain environments. We prove that in POMDPs with long-run average objective, the decision maker has approximately optimal strategies with finite memory. This implies notably that approximating the long-run value is recursively enumerable, as well as a weak continuity property of the value with respect to the transition function.


Introduction
In a Partially Observable Markov Decision Process (POMDP), at each stage, the decision-maker chooses an action that determines, together with the current state, a stage reward and the distribution over the next state. The state dynamic is imperfectly observed by the decision-maker, who receives a stage signal on the current state before playing. Thus, POMDPs generalize the Markov Decision Process (MDP) model of Bellman [3].
POMDPs are widely used in prominent applications such as in computational biology [10], software verification [8], reinforcement learning [16], to name a few. Even special cases of POMDPs, namely, probabilistic automata or blind MDPs, where there is only one signal, is also a standard model in several applications [20,19,7].
In many of these applications, the duration of the problem is huge. Thus, considerable attention has been devoted to the study of POMDPs with long duration. A standard way is to consider the long-run objective criterion, where the total reward is the expectation of the inferior limit average reward (see [1] for a survey). The value for this problem is known to coincide with several classical definitions of long-run values (asymptotic value, uniform value, general uniform value, long-run average value, uncertain-duration process value [23,21,22,18,27]) and has been characterized in [22]. In this paper, we will simply name this common object value. Thus, strong results are available concerning the existence and characterization of the value. This is in sharp contrast with the study of long-run optimal strategies. Indeed, before our work, little was known about the sophistication of strategies that approximate the value. It has been shown that:(i) stationary strategies approximate the value in MDPs [4]; and (ii) beliefstationary strategies approximate the value in blind MDPs [23] and POMDPs with an ergodic structure [6].
Our main contributions are: • Strategy complexity. We show that for every POMDP with long-run average objectives, for every ε > 0, there is a finite-memory strategy (i.e. generated by a finite state automaton) that achieves expected reward within ε of the optimal value. In the case of blind MDP finite memory is equivalent to finite recall (i.e. decisions are defined using only the last actions), but finite recall cannot achieve ε-approximations in general POMDPs.
• Computational complexity. An important consequence of our above result is that the decision version of the approximation problem for POMDPs with long-run average objectives (see Definition 3.1) is recursively enumerable (r.e.) but not decidable. Our results on strategy complexity imply the recursively enumerable upper bound and the lower bound is a consequence of [17].
• Value property. The long-run reward of a finite-memory strategy is robust upon small perturbations of the transition function, where the notion of perturbation over the transition function is defined as in Solan [25] and Solan and Vieille [26]. This implies lower semicontinuity of the value function upon such small perturbations. This result is tight in the sense that there is an example with a discontinuous value function (see Example 4.4).
A natural question would be to ask for an upper bound on the size of the memory needed to generate ε-optimal strategies, in terms of the data of the POMDP. In fact, a previous undecidability result [17] shows that such an upper bound can not exist (see Subsection 3.1). Thus, the existence of ε-optimal strategies with finite memory is, in some sense, the best possible result one can have in terms of strategy complexity.
2 Model and statement of results

Model
Throughout the paper we mostly use the following notation: (i) sets are denoted by calligraphic letters, e.g. A, H, K, S; (ii) elements of these sets are denoted by lowercase letters, e.g. a, h, k, s; and (iii) random elements with values in these sets are denoted by uppercase letters, e.g. A, H, K, S. For a set C, denote ∆(C) the set of probability measure distributions over C, and δ c the Dirac measure at some element c ∈ C. We will slightly abuse notation by not making a distinction between a probability measure (which can be evaluated on events) and its corresponding probability density (which can be evaluated on elements). Consider a POMDP Γ = (K, A, S, q, g), with finite state space K, finite action set A, finite signal set S, transition function q : K × A → ∆(K × S) and reward function g : K × A → [0, 1].
Given p 1 ∈ ∆(K), called initial belief, the POMDP starting from p 1 is denoted by Γ(p 1 ) and proceeds as follows: • An initial state K 1 is drawn from p 1 . The decision-maker knows p 1 but does not know K 1 .
• At each stage m ≥ 1, the decision-maker takes some action A m ∈ A. This action determines a stage reward G m := g(K m , A m ), where K m is the (random) state at stage m. Then, the pair (K m+1 , S m ) is drawn from q(K m , A m ). The next state is K m+1 and the decision-maker is informed of the signal S m , but neither of the reward G m nor of the state K m+1 .
At stage m, the decision-maker remembers all the past actions and signals, which is called history before stage m. Let H m := (A×S) m−1 be the set of histories before stage m, with the convenient notation (A×S) 0 := {∅}. A strategy is a mapping σ : ∪ m≥1 H m → A. The set of strategies is denoted by Σ. The randomness introduced by the transition function, q : suggests that a history h m ∈ H m can occur under many sequences of states (k 1 , k 2 , . . . , k m−1 ). The infinite sequence (k 1 , a 1 , s 1 , k 2 , a 2 , s 2 , . . .) is called a play, and the set of all plays is denoted by Ω. For p 1 ∈ ∆(K) and σ ∈ Σ, define P p 1 σ the law induced by σ and the initial belief p 1 on the set of plays of the game Ω = (K × A × S) N , and E p 1 σ the expectation with respect to this law. For simplicity, identify K with the set of extremal points of ∆(K). Let The term γ p 1 ∞ (σ) is the long-term reward given by strategy σ and v ∞ (p 1 ) is the optimal long-term reward, called value, defined as the supremum long-term reward over all strategies.
Remark 2.1. It has been shown that v ∞ coincides with the limit of the value of the n-stage problem and λ-discounted problem, as well as the uniform value and weighted uniform value (see [23,21,22,27]). In particular, we have: Remark 2.2. In the literature, the concept of strategy that we defined is often called pure strategy, by contrast with behavior strategies that use randomness by allowing strategies of the form σ : ∪ m≥1 H m → ∆(A). By Kuhn's theorem, enlarging the set of pure strategies to behaviour strategies does not change v ∞ (see [27,11]), and thus does not change our results.

Definition 2.3 (Blind MDP). A POMDP is called blind MDP if the signal set is a singleton.
Note that in a blind MDP, signals do not convey any relevant information. Therefore, a strategy is simply an infinite sequence of actions (a 1 , a 2 , . . . ) ∈ A N .

Contribution
We start by defining several classes of strategies. Recall that Γ(p 1 ) is the POMDP Γ starting from p 1 , which is known to the player.  [24]. In this framework, none of these strategies is enough to approximate the value, and a long-standing open problem is whether finite-memory strategies with a clock are good enough (see [14,13] for more details on this topic).
Our main result is the following theorem.
Remark 2.10. A previous complexity result [17] shows that the size of the memory can not be bounded from above in terms of the data of the POMDP (see Subsection 3.1).
Corollary 2.11. For every blind MDP Γ, initial belief p 1 and ε > 0, there exists an ε-optimal finite-memory strategy in Γ(p 1 ), and thus the strategy is eventually periodic and has finite recall.
Lastly, finite-recall is not enough to ensure ε-optimality in general POMDPs.
The rest of the paper is organized as follows. Section 3 explains the consequences of our result in terms of complexity and model robustness. Section 4 introduces examples used to prove negative results and to illustrate our techniques. Section 5 introduces two key lemmata, and shows that they imply Theorem 2.9. Section 6 proves one of the two lemmata and develops what we call super-support based strategies in details. Missing proofs are in the appendices.

Complexity
Decidability. A decision problem consists in deciding between two options given an input (accepting or rejecting) and its complexity is characterized by Turing machines. A Turing machine takes an input and, if it halts, it either accepts or rejects it. If it halts for all possible inputs in a finite number of steps, then the Turing machine is considered an algorithm. An algorithm solves a decision problem if it takes the correct decision for all inputs. The class of decision problems that are solvable by an algorithm is called decidable. Two natural generalizations of decidable problems are: recursively enumerable (r.e.) and co-recursively enumerable (co-r.e.). The decision problems in r.e. (resp., co-r.e.) are those for which there is a Turing machine that accepts (resp., rejects) every input that should be accepted (resp., rejected) according to the problem, but, on other inputs, it needs not to halt.
Notice that the class of decidable problems is the intersection of r.e. and co-r.e. In this work, the algorithmic problem of interest is the following.
the problem consists in deciding which one is the case: to accept means to prove that v ∞ (p 1 ) > x + ε holds, while to reject means to prove the opposite.
Previous results and implication of our result. It is known that the decision version of the approximation problem is not decidable [17] (even for blind MDPs). However, the complexity characterization has been open. Thanks to Theorem 2.9, we can design a Turing machine that accepts every input that should be accepted for this problem.
Consider playing a finite-memory strategy σ. Then, the dynamics of the game can be described by a finite Markov chain. Therefore, the reward obtained by playing σ (i.e. γ p 1 ∞ (σ)) can be deduced from its stationary measure, which can be computed in polynomial time by solving a linear programming problem [12, Section 2.9, page 70]. Our protocol checks the reward given by every finite-memory strategy to approximate the value of the game v ∞ (p 1 ). By Theorem 2.9, if v ∞ (p 1 ) > x + ε holds, a finite-memory strategy that achieves a reward strictly greater than (x + ε) will be eventually found and our protocol will accept the input. On the other hand, if v ∞ (p 1 ) < x − ε, the protocol will never find out that this is the case because there are infinitely many finite-memory strategies, so it will not halt. Thus, our result establishes that the approximation version of the problem is in r.e., and the previously known results imply that the problem is not decidable. Formally, we have the following result.
Corollary 3.2. The decision version of approximating the value is r.e. but not decidable. Remark 3.3. The former paragraph shows that no upper bound on the size of the memory used by ε-optimal strategies can be proved. Indeed, if such a bound existed, one could modify the previous algorithm in the following way: reject the input if every finite-memory strategy of size lower than the bound has been enumerated. This would imply that the decision version of approximating the value is decidable, which is a contradiction.

Objective comparison
In this section, we contrast our results with other natural objectives.
Recall that the value of Γ(p 1 ) is defined as We say this is a liminf-average objective. Consider replacing lim inf n→∞ 1 n n m=1 G m by: (i) lim sup n→∞ 1 n n m=1 G m , which we call limsup-average objective; (ii) lim sup n→∞ G n , which we call limsup objective. Proposition 3.4. For both limsup-average and limsup objective, there exists a POMDP, and ε > 0, with no ε-optimal finite-memory strategy.
This negative result, proved in Section 4.1.2, does not imply any computational complexity characterization for the limsup-average or limsup objective, and whether the approximation of the value problem for limsup-average objectives is recursively enumerable remains open. However, it shows that any approach based on finite-memory strategies cannot establish recursively enumerable bounds for the approximation problem.
Let us focus on the limsup objective. Limsup objective is arguably simpler than the liminfaverage objective and, to formalize this statement, we can compare the complexity of the objects themselves irrespective of any particular context or model (such as POMDPs). The Borel hierarchy describes the complexity of an objective by the number of quantifier alternations needed to describe it. Its construction is similar to that of the Borel σ-algebra, or σ-field, and is defined as follows.
For example, limsup objective can be described as countable intersection of countable unions of rewards: given a family of sets (C n ) n≥1 , lim sup n→∞ C n = ∩ n≥1 ∪ m≥n C m . The formal result is the following (see [9]). While the notion of Borel hierarchy characterizes the topological complexity for objectives, a similar notion of Arithmetic hierarchy characterizes the computational complexity for decision problems.
Definition 3.7 (Arithmetic hierarchy). Denote Σ 1 0 the class of r.e. problems and Π 1 0 the co-r.e. problems. For i > 1, define Σ i 0 as the class of problems solved by Turing machines with access to oracles for Π i−1 . On the other hand, it was shown in [2,5] that POMDPs with limsup objective with boolean rewards is Σ 2 0 -complete. We conclude this section with a summary chart contrasting liminf-average and limsup objectives. The surprising result is the complexity switch: limsup objective has lower Borel hierarchy complexity but higher Arithmetic hierarchy complexity in the context of POMDPs.

Objective comparison in POMDPs Objective
Borel hierarchy Arithmetic Hierarchy limsup

Robust ε-optimal strategies
Consider a POMDP Γ = (K, A, S, q, g). It is well known that the value function is continuous with respect to perturbations of the reward function g and the initial belief p 1 . Now, we show a robustness result concerning the transition function q.
In applications, just as in any stochastic model, the structure of the model is decided first, and then the specific probabilities are either estimated or fixed. The values of transition probabilities are approximations: an ε-perturbation of these probabilities are not expected to have an impact on the modelling. In our setting, the transitions are encoded in the function q : K×A → ∆(K×S) and we would expect some robustness against perturbations of the values it takes.
The notion of perturbation over q is measured as in Solan [25] and Solan and Vieille [26], where perturbations in each transition probability are measured as relative differences, not additive differences. Formally, define the semimetric Under this notion, and taking q and q ′ close to each other, we can prove the existence of strategies which are approximately optimal for the POMDP corresponding to q and perform almost as well when they are applied to the POMDP corresponding to q ′ . To formally state this notion of robustness, let us give the following definition.
Definition 3.8 (Robust strategies). Given a POMDP Γ with transition function q, an initial belief p 1 ∈ ∆(K), we say that σ is a robust strategy for Γ(p 1 ) if the following condition holds: where γ ∞ is the long-term reward in Γ and γ ′ ∞ is the long-term reward in Γ ′ = (K, A, S, g, q ′ ).
Lemma 3.9. Any finite-memory strategy is robust. Thus, in any POMDP and for any ε > 0, there exists a robust ε-optimal finite-memory strategy.
Corollary 3.10. Let K, A, S be finite sets, g : K × A → R a reward function, and p 1 ∈ ∆(K) an initial belief. The mapping from (∆(K × S) K×A , d) to R that maps each transition function q to the value at p 1 of the POMDP (K, A, S, g, q) is lower semi-continuous.
Lower semi-continuity of the value function is the best result one can achieve in the following sense.
Proposition 3.11. There is a POMDP such that the mapping from (∆(K × S) K×A , d) to R that maps each transition function q to the value at p 1 of the POMDP (K, A, S, g, q) is discontinuous.

Examples
In this section, we introduce examples to prove negative results (Propositions 2.12, 3.4 and 3.11) and to illustrate our techniques later on.

Negative results
Let us prove Propositions 2.12 and 3.4 by presenting an example for each statement.

Proof of Proposition 2.12
We will prove that there exists a POMDP and ε > 0 with no ε-optimal finite-recall strategy by an explicit construction. Recall that a strategy has finite recall if it uses only a finite number of the last stages in the current history to decide the next action (see Definition 2.6). Therefore, our construction should have the property that, for any finite-recall strategy, there is a pair of finite histories such that: 1. The last stages are identical, i.e., the player did the same actions and received the same signals in the last part of both histories (but the starting point was different).
2. Taking the same decision in both histories leads to losing some reward that can not be compensated in the long-run.
3. The previous loss does not decrease to zero by increasing the amount of memory.
Example 4.1. Consider the POMDP Γ = (K, A, S, q, g) with five states: k 0 , k 1 , . . . , k 4 . The initial state is k 0 and players know it (formally, the initial belief is δ k 0 ). The state k 4 is an absorbing state from where it is impossible to get out and rewards are zero. The states k 1 and k 2 form a sub-game where the optimal strategy is trivial. This is the same for the state k 3 . From k 0 a random initial signal is given indicating which sub-game the state moved to. The key idea is that there is an arbitrarily long sequence of actions and signals which can be gotten in both sub-games, but the optimal strategy behaves differently in each of them. Therefore, to forget the initial signal of the POMDP leads to at most half of the optimal value. Figure 2 is a representation of Γ: first under action a and then action b. Each state is followed by the corresponding reward, and the arrows include the probability for the corresponding transition along with the signal obtained. The sub-game of k 1 and k 2 has a unique optimal strategy: play action a until receiving signal s 2 , then play action b once and repeat. The value of this sub-game is 1 and deviating from the prescribed strategy would lead to a long-run reward of 0. Similarly, the value of the sub-game of k 3 has a unique optimal strategy: to always play action a. Again, the value of this sub-game is 1 and playing any other strategy leads to a long-run reward of 0.
By the previous discussion, the value of this game starting from k 0 is 1. On the other hand, the maximum value obtained by strategies with finite recall is only 1/2, by playing, for example, always action a. Finite-recall strategies achieve at most 1/2 because, no matter how much finite recall there is, by playing the game the decision-maker faces a history of having played always action a and always receiving a signal s 1 , except for the last signal which is s 2 . Then, if action b is played, the second sub-game is lost; if action a is played, the first sub-game is lost. That is why, for any 0 < ε < 1/2, there is no ε-optimal finite-recall strategy for this POMDP.

Proof of Proposition 3.4
We will show that for the limsup-average and limsup objectives there is a blind MDP where there is no ε-optimal finite-memory strategy. For both cases, the example is constructed with the following idea in mind. To achieve the optimal value, the decision-maker needs to play an action a 1 for some period, then play another action a 2 and repeat the process. The key is to require that the length of the period gets longer as the game progresses. This kind of strategy can not be achieved with finite-memory strategies.
For the limsup-average objective, the blind MDP example is due to Venel and Ziliotto [28] and is presented below.
Example 4.2. Consider two states k 0 and k 1 and the player receives a reward only when the state is k 1 . To reach k 1 , the decision-maker can play action change and move between the two states. By playing action wait, the state does not change. Figure 3 is a representation of the game. Consider the initial belief p 1 = 1 2 · δ k 0 + 1 2 · δ k 1 , the uniform distribution. It is easy to see that finite-memory strategies (or equivalently finite-recall strategies) can not achieve more than 1/2. On the other hand, the value of this game with the limsup-average objective is 1, and is guaranteed by the following strategy:  Consider four states (k 0 , k 1 , k 2 and k 3 ) and two actions: wait (w) and change (c). The initial state is k 1 and players know it. In k 1 , if w is played, then the state moves to k 2 with probability 1/2 and stays with probability 1/2; if c is played, then the absorbing state k 0 is reached. From state k 2 , if we play w, we stay in the same state; if we play c, we move to state k 3 . From k 3 , the only state that has a positive reward, if we play any action, we return to the initial state k 1 . Figure 4 is a representation of the game.  In this blind MDP (see [5,2]), for the limsup objective, for any ε > 0, there is an infinitememory strategy that guarantees 1−ε, so the value of the game is 1. On the other hand, applying any finite-memory strategy (or equivalently finite-recall strategies) yields a limsup reward of 0. Hence, finite-memory strategies do no guarantee any approximation for POMDPs with limsup objective.

Proof of Proposition 3.11
We will show that there is a POMDP with discontinuous value with respect to the transition function. The idea is to have two possible scenarios where signals are slightly different. By analyzing a long sequence of signals, the player is able to identify which is the scenario of the current state and so take a better strategy. The following example considers a transition function parameterized by ε ≥ 0.
Example 4.4. Consider two states (k u , k d ) and three actions: up (a u ), down (a d ) and wait (a w ). Signals are relevant only for action a w : under a non-symmetric transition function, they inform about the underlying state. More concretely, there are two signals s u and s d . Playing actions a u or a d will give signal s u or s d respectively, adding no information. Playing action a w leads to signals s u and s d with slightly different probabilities if the state is k u or k d . In terms of actions and rewards, a u leads to positive reward only if the state is k u , similarly, a d leads to positive reward only if the state is k d . Finally, a w leads to null reward in both states. Figure 5 is a representation of this POMDP where transitions are specified. Consider an initial belief p 1 = 1/2 · δ ku + 1/2 · δ k d . If ε = 0, the value is 1/2, achieved for example by the constant strategy σ ≡ a u . Note that playing action a w leads to no information since the random signal the player receives is independent of the underlying state. In contrast, if ε > 0, playing action a w reveals information about the underlying state. If action a w is played sufficiently many times, the player can estimate the state by comparing the number of signals s d against s u : if s d appear more than s u , then it is more probable that the underlying state is k d . Therefore, by playing a w , the player can estimate the state with increasing probability. That is why the value for ε > 0 is 1. This proves that this POMDP is discontinuous with respect to the transition function since

Illustrative examples
We show an example of POMDP that will be analyzed in Section 6.2 in light of our technique. This example comes in two variants differing in sophistication.

Simple version
Let us explain the easiest version.
Example 4.5. Consider two states (k u , k d ) and two actions: up (a u ) and down (a d ). All transitions are possible (including loops) and they do not depend on the action. Signals inform the player when the state changes. In terms of actions and rewards, by playing a u the player obtains a reward of 1 only if the current state is k u . Similarly, by playing a d the player obtains a reward of 1 only if the state is k d . Figure 6 is a representation of the game with specific transition probabilities.
Consider an initial belief p 1 = 1/4 · δ ku + 3/4 · δ k d . During a play, the decision-maker can have two beliefs, 1/4 · δ ku + 3/4 · δ k d or 3/4 · δ ku + 1/4 · δ k d , because the signals notify when there has been a change. The value of this game is 3/4. An optimal strategy is to play action a d until getting a signal s c , then playing action a u until getting a signal s c , and repeat.

Involved version
Let us go to the more complex version. Now the transition between the two extremes includes more states, instead of being a direct jump. States can be separated into two groups: extremes (k u and k d ) and transitional (k l 1 , k r 1 , k l 2 and k r 2 ). Furthermore, transitional states can be divided into two groups: left states (k l 1 and k l 2 ) and right states (k r 1 and k r 2 ). Transitions are from extreme states to transitional states and from transitional to extremes. More precisely, excluding loops, only the following transitions are possible: from k u to either k l 1 or k r 1 , then from these two to k d , from k d to either k l 2 or k r 2 and then back to k u . Signals are such that the player knows: (i) the state changed to an extreme state, or (ii) the state changed to a transitional state and the new state is with higher probability a left state or a right state. In terms of actions and rewards, each action has an associated set of states in which the reward is 1 and the rest is 0: by playing a u the reward is 1 only if the current state is k u , playing a d rewards only state k d , a l rewards states k l 1 and k l 2 , and a r rewards states k r 1 and k r 2 . Figure 7 is a representation of this game with specific transition probabilities.
Consider an initial belief p 1 = 1/4 · δ ku + 3/4 · δ k d . The value of the game is 21/32. An optimal strategy is given by playing action a d until getting a signal s l or s r . If the decision-maker got signal s l , then play action a l , otherwise, play action a r . Repeat action a l or a r until getting the signal s c . Then, play a u until getting a signal s l or s r . When this happens, play a l or a r accordingly until getting signal s c . And so, repeat the cycle.
The belief dynamic under this optimal strategy is the following. The initial belief is p 1 , supported in the extreme states. By getting a signal s w , the belief does not change. By getting signal s l , the weight on k u distributes between states k l 1 and k r 1 in a proportion 3 : 1 and the weight on k d distributes between k l 2 and k r 2 in the same way. By getting signal s r , the distribution is similar, but the role of left states are interchanged with right states. Once the belief is in the transitional states, by playing the respective action (either a l or a r ), the belief does not change while receiving signal s w . Upon receiving the signal s c , the new belief is 3/4 · δ ku + 1/4 · δ k d . By symmetry of the POMDP, the dynamic is then similar until getting signal s c for a second time. At that time, the belief is equal to the initial distribution, namely 1/4 · δ ku + 3/4 · δ k d .
Remark 4.7. For the decision-maker to have a finite-memory strategy, some quantity with finitely many options must be updated over time. A tentative idea is to compute the posterior belief, but it can take infinitely many values. In this example, using a belief partition is enough to encode an optimal strategy. In general, it is an open question if a belief partition is sufficient to achieve ε-optimal strategies.
1 / 9 ; s r 3 / 9 ; s r 3 / 9 ; s l 1 / 9 ; s l 1 / 9 ; s r 3 / 9 ; s r 3 / 9 ; s l 1 / 9 ; s l 1/9; s w 1; s w 1; s w 1; s w 1; s w 1/9; s w (a) Action a u k u |0  In this section, we introduce two key lemmas and derive from them the proof of Theorem 2.9. We first define the history at stage m, which is all the information the decision-maker has at stage m.
Definition 5.1 (m-stage history). Given a strategy σ ∈ Σ and an initial belief p 1 , denote the (random) history at stage m by Recall that we denote the state at stage m by K m , which takes values in K; the signal at stage m by S m , which takes values in S; and the action at stage m by A m , which takes values in A. Note that the history at stage m does not contain direct information about the states K 1 , . . . , K m .
The belief of the player at stage m plays a key role in the study of POMDPs, and we formally define it as follows.
Definition 5.2 (m-stage belief). Given a strategy σ ∈ Σ and an initial belief p 1 , denote the belief at stage m by P m , which is given by, for all k ∈ K, For fixed σ and p 1 , one can use Bayes rule to compute P m . To avoid heavy notations, we omit the dependence of P m on σ and p 1 . For p ∈ ∆(K), denote the support of p by supp(p), which is the set of k ∈ K such that p(k) > 0.
The first ingredient of the proof of Theorem 2.9 is the following lemma.
Lemma 5.3. For any initial belief p 1 and ε > 0, there exists m ε ≥ 1, σ ε ∈ Σ and a (random) belief P * ∈ ∆(K) (which depends on the history before stage m ε ) such that: 2. There exists σ ∈ Σ, which depends on P * , such that for all k ∈ supp(P * ) This result is a consequence of Venel and Ziliotto [27,Lemma 33]. This previous work states the existence of elements µ * ∈ ∆(∆(K)) and σ * ∈ ∆(Σ) with similar properties to those of P * ∈ ∆(K) and σ ∈ Σ. In this sense, the present lemma can be seen as a deterministic version of this previous result. To focus on the new tools we introduce in this paper to prove Theorem 2.9, we relegate the proof and the explanation of the differences between the two lemmata to Appendix A.
Remark 5.4. The first property of Lemma 5.3 follows immediately from [27,Lemma 33] by the type of convergence in this previous result. On the other hand, the second property requires the introduction of a certain Markov chain on K × A × ∆(K). This Markov chain is already present in the work [27] but was used for other purposes. Therefore, the proof consists mainly of recalling previous results and constructions.
Remark 5.5. Note that P k σ represents the law on plays induced by the strategy σ, conditional on the fact that the initial state is k. This does not mean that we consider the decision-maker to know k. In the same fashion, γ k ∞ (σ) is the reward given by the strategy σ, conditional on the fact that the initial state is k. Even though σ is optimal in Γ(P * ), this does not imply that σ is optimal in Γ(δ k ): we may have γ k ∞ (σ) < v ∞ (δ k ). The importance of Lemma 5.3 comes from the fact that the average rewards converge almost surely to a limit that only depends on the initial state k. Intuitively, this result means that for any initial belief p 1 , after a finite number of stages, we can get ε-close to a belief P * such that the optimal reward from P * is, in expectation, almost the same as from p 1 , and moreover from P * there exists an optimal strategy that induces a strong ergodic behavior on the state dynamics. Thus, there is a natural way to build a 3ε-optimal strategyσ in Γ(p 1 ): first, apply the strategy σ ε for m ε stages, then apply σ. Since after m ε steps the current belief P mε is ε-close to P * with probability higher than 1 − ε, the reward from playingσ is at least the expectation of γ P * ∞ (σ) − 2ε, which is greater than v ∞ (p 1 ) − 3ε. Therefore, this procedure yields a 3ε-optimal strategy. Nonetheless, σ may not have finite memory, and thusσ may not have either. The main difficulty of the proof is to transform σ into a finite-memory strategy. We formalize this discussion below.
Definition 5.6 (ergodic strategy). Let p * ∈ ∆(K). We say that a strategy σ is ergodic for p * if the following holds for all k ∈ supp(p * ) From the previous discussion, we aim at proving the following result.
Lemma 5.7. Let p * ∈ ∆(K) and σ be an ergodic strategy for p * . For all ε > 0, there exists a finite-memory strategy σ ′ such that This is our key lemma and the main technical contribution. The next section is devoted to explaining the technique used and proving it.
Proof of Theorem 2.9 assuming Lemmas 5.7 and 5.3. Let p 1 be an initial belief and ε > 0. Let m ε , σ ε , P * and σ be given by Lemma 5.3. Define the strategy σ 0 by: playing σ ε until stage m ε , then switch to the strategy σ ′ given by Lemma 5.7 for σ and p * = P * . Note that σ 0 has finite memory. We have and the theorem is proved.
6 Super-support and proof of Lemma 5.7 In this entire section, fix p * ∈ ∆(K), which will be used as an initial belief, and σ an ergodic strategy for p * .  Therefore, the support of P m can be deduced from the super-support B m . On the other hand, B m can not be deduced from P m , and thus can not be deduced from the support of P m . This justifies the vocabulary.

Notation
We will build a finite-memory ε-optimal strategy that plays by blocks. Each block has fixed finite length and, within each block, the strategy depends only on the history in the block and on the super-support at the beginning of the block. At the end of the block, the automaton computes the new super-support according to the block history and the previous super-support. Thus, the only difference with a bounded recall strategy is that our strategy keeps track of the super-support. Super-support is a type of origin information: it is related to the value partition, and therefore to where the current mass distribution comes from. We denote σ m := σ[H m ], the corresponding random shift at stage m.
In other words, σ[h m ] corresponds to the continuation of the strategy σ conditional on the fact that the history of the first m stages was h m .

Illustration
The super-support captures specific information related to the beginning of the game: the origin of the current mass distribution (given by P m ) in terms of the initial value partition (K i ) i∈[1 .. I] .
There are finitely many possible super-supports and it is possible to keep track of the current super-support using Bayesian updating. Therefore, it is a good variable to be used in finitememory strategies.
Let us recall our simple example of a POMDP, Example 4.5.
Example 4.5. Consider two states (k u , k d ) and two actions: up (a u ) and down (a d ). All transitions are possible (including loops) and they do not depend on the action. Signals inform the player when the state changes. In terms of actions and rewards, by playing a u the player obtains a reward of 1 only if the current state is k u . Similarly, by playing a d the player obtains a reward of 1 only if the state is k d . Figure 6 is a representation of the game with specific transition probabilities.
Finite-recall is enough to approximate the value of this POMDP: the decision-maker can recall the last action. Then, upon seeing the signal s c , the player has to change actions. Recall that p 1 = 1/4 · δ ku + 3/4 · δ k d . Therefore, an optimal strategy is given by playing a d until getting signal s c , then playing a u until getting signal s c and repeat. This strategy is ergodic for p 1 and the corresponding value partition is given by (K 1 = {k u }, K 2 = {k d }), because, if K 1 = k u , the long-run reward is 0 and, if K 1 = k d , the long-run reward is 1. In this case, the super-support describes completely the belief P m since it keeps track of which state has the highest (or lowest) probability.
Although the example is simple, we can already see the difference between support strategies and super-support strategies. In this case, all strategies based on the current support (the support of P m ) are constant and therefore can achieve a long-run reward of at most 1/2. On the other hand, super-support strategies can be optimal and achieve a long-run reward of 3/4. This example also shows that playing by blocks and defining the behaviour in each block by the current support (instead of the super-support) is not enough.
Let us analyze now our more complex POMDP example, Example 4.6.
Example 4.6. Consider six states and four actions: up (a u ), down (a d ), left (a l ) and right (a r ). States can be separated into two groups: extremes (k u and k d ) and transitional (k l 1 , k r 1 , k l 2 and k r 2 ). Furthermore, transitional states can be divided into two groups: left states (k l 1 and k l 2 ) and right states (k r 1 and k r 2 ). Transitions are from extreme states to transitional states and from transitional to extremes. More precisely, excluding loops, only the following transitions are possible: from k u to either k l 1 or k r 1 , then from these two to k d , from k d to either k l 2 or k r 2 and then back to k u . Signals are such that the player knows: (i) the state changed to an extreme state, or (ii) the state changed to a transitional state and the new state is with higher probability a left state or a right state. In terms of actions and rewards, each action has an associated set of states in which the reward is 1 and the rest is 0: by playing a u the reward is 1 only if the current state is k u , playing a d rewards only state k d , a l rewards states k l 1 and k l 2 , and a r rewards states k r 1 and k r 2 . Figure 7 is a representation of this game with specific transition probabilities.
Recall that p 1 = 1/4 · δ ku + 3/4 · δ k d and that an optimal strategy is given by playing action a d until getting a signal s l or s r . If the decision-maker gets signal s l , then play action a l , otherwise, play action a r . Repeat action a r until getting the signal s c . Then, play a u until getting a signal s l or s r . When this happens, play a l or a r accordingly until getting signal s c . And so, repeat the cycle.
This optimal strategy is ergodic for p 1 and the corresponding value partition is given by (K 1 = {k u }, K 2 = {k d }) because, if K 1 = k u , the long-run reward is 0 and, if K 1 = k d , the long-run reward is 7/8. Contrary to the previous example, the super-support does not describe completely the belief P m . Indeed, consider the initial belief p 1 , which is supported on the extreme states, and that the decision-maker gets either signals s l or s r . Then, the new belief is supported in all the transitional states and the super-support is the same under any of these two histories, and equal to: Based on this super-support one can not reconstruct the current belief, but one knows more than only the support: we can differentiate the origin (k u or k d ) of the current belief distribution.
Notice that using the super-support alone is not enough to get ε-optimal strategies. Indeed, in transitional states, the decision-maker needs to know whether the state is more likely to be in a left state or a right state in order to play well, and the super-support does not contain such information. That is why, in the proof of Lemma 5.7, we consider a more sophisticated class of strategies, that combine super-support and bounded recall. For the moment, let us describe such a strategy for this example. Choose n 0 very large, and for each ℓ ≥ 1, play the following strategy in the time block [ℓn 0 + 1 .. (ℓ + 1)n 0 ]: -Case 1: the super-support at stage ℓn 0 + 1 is ({k u }, {k d }). Play the previous 0-optimal strategy, that is: play action a d until getting a signal s l or s r . If the decision-maker gets signal s l , then play action a l , otherwise, play action a r . Repeat action a r until getting the signal s c . Then, play a u until getting a signal s l or s r . When this happens, play a l or a r accordingly until getting signal s c . And so, repeat the cycle.
-Case 2: the super-support at stage ℓn 0 + 1 is ({k d }, {k u }). Play the same strategy as in Case 1, except that the roles of a u and a d are switched.
-Case 3: the super-support at stage ℓn 0 + 1 is ({k l 1 , k r 1 }, {k l 2 , k r 2 }). Play a l (or a r ) until getting the signal s c . At this point, the super-support is ({k d }, {k u }). Then, play as in Case 2.
-Case 4: the super-support at stage ℓn 0 + 1 is ({k l 2 , k r 2 }, {k l 1 , k r 1 }). Play a l (or a r ) until getting the signal s c . At this point, the super-support is ({k u }, {k d }). Then, play as in Case 1.
This strategy is sub-optimal during the first phase of Case 3 and Case 4, until the decision-maker receives signal s c . As n 0 grows larger and larger, this part becomes negligible. Thus, for any ε > 0, there exists n 0 such that this strategy is ε-optimal (but not optimal).

Properties
Now we can state properties of super-supports when the strategy σ is ergodic for p * and explain how rich is the structure of the random sequence of beliefs (P m ) m≥1 . By definition of ergodic strategies, the map k → γ k ∞ (σ) is constant on K i , and we denote its value by γ i ∞ . Considering the law given by P p * σ , fix a realization K m = k ∈ B i m . By definition of super-support, there exists k ′ ∈ K i ⊆ supp(p * ) such that k can be reached from k ′ in m steps. Recall that, since σ is ergodic for p * , In particular, the convergence holds when K m = k. Then, Another property of the super-support is concerned with consecutive conditioning and is fairly intuitive. We formally state it in the following lemma and show the proof for completeness.
In other words, the super-support that arises at stage m + m ′ , B m+m ′ , coincides with the super-support that would arise from a two-step procedure: first, advancing m stages; and then, applying the continuation of the strategy, σ m , for m ′ more stages.
Proof. Fix a realization H m+m ′ = h = (h m , h m ′ ) and let k ∈ B i m (h m ). Recall that, by definition of super-support, Therefore, there existsk 1 ∈ K i such that k ∈ supp(Pk 1 σ (K m = · | H m = h m )). In particular, we have that By a semi-group property, we deduce that and thus the lemma is proved.
Remark 6.7. This property does not depend on the fact that σ is ergodic for p * .

Proof of Lemma 5.7
Fix p * ∈ ∆(K) such that σ is ergodic for p * . Note that, for all m ≥ 1, Moreover, since D j corresponds to a super-support that occurs at some stage and under some history, there exists h j and m j such that h j ∈ H m j and D j = B i m (h j ). In other words, D j is the realization of the super-support at stage m j under history h j and D 1 , D 2 , . . . , D J contains all super-supports that can occur.
Definition of the strategy σ ′ . Let ε > 0. By Lemma 6.5, there exists n 0 ∈ N * such that for all i ∈ [1 .. I], j ∈ [1 .. J] and k ∈ D j i , Define the strategy σ ′ by blocks, and characterize each block by induction. For each ℓ ≥ 0, the block number ℓ consists in the stages m such that ℓn 0 + 1 ≤ m ≤ (ℓ + 1)n 0 . We characterize the behavior in block ℓ by a variable J ℓ ∈ [1 .. J] in the following way. For stage m inside block ℓ, the strategy σ ′ plays according to J ℓ and the history between stages ℓn 0 + 1 and m. Each block is characterized by induction because the variable J ℓ is computed at stage ℓn 0 +1 according to J (ℓ−1) and the history in the last n 0 stages. Thus, σ ′ can be seen as mapping from ∪ n 0 m=1 H m × [1 .
. J] to A.
Consider ℓ = 0, i.e. the first block. The strategy σ ′ is defined on the first n 0 stages as follows. Consider the value partition {K 1 , . . . , K I } given by p * and σ. By definition of D 1 , D 2 , . . . , D J , there exists j ∈ [1 .
Let us proceed to the induction step. Consider ℓ ≥ 1 and assume that we have defined J (ℓ−1) and σ ′ up to stage ℓn 0 . Denote the history between stages (ℓ − 1)n 0 + 1 and ℓn 0 + 1 by h ∈ H n 0 +1 and define J ℓ such that, for all i ∈ [1 .. I], Then, extend σ ′ for n 0 additional stages as before: σ ′ (h, J ℓ ) := σ(h J ℓ , h) for all h ∈ H m and m ≤ n 0 . Thus, we have defined J ℓ and extended σ ′ up to stage (ℓ + 1)n 0 . To summarize our construction in words, during stages ℓn 0 + 1, ℓn 0 + 2, . . . , (ℓ + 1)n 0 , the decision-maker plays as if he was playing σ from history h J ℓ . Notice that the indexes J 0 , J 1 , . . . depend on the history, and therefore are random. Now we will connect the strategy σ ′ with the super-support given by p * and σ. Lemma 6.8. For all i ∈ [1 .. I], k ∈ K i and ℓ ≥ 0, we have that is a partition of the support of P ℓn 0 +1 . Moreover, Proof. Fix i ∈ [1 .
To finish the proof of Lemma 5.7, we must show that the finite-memory strategy σ ′ guarantees the reward obtained by σ up to ε. The idea is that in each block we are playing some shift of σ for n 0 stages. The shift is chosen so that information about the initial belief is correctly updated, while the number n 0 is chosen so that the expected average reward of the whole block is close to the expected limit average reward. Then, since all blocks have the same approximation error, the average considering all blocks yields approximately γ p * ∞ (σ). This is the intuition behind the following lemma. Lemma 6.9. Let L ∈ N * . The following inequality holds: Proof. We have, for all ℓ ≥ 0, It follows that To conclude, since σ ′ has finite memory, we have and the above lemma implies that γ p * ∞ (σ ′ ) ≥ γ p * ∞ (σ)−ε, which proves Lemma 5.7: for each ergodic strategy σ and ε > 0, one can construct a finite-memory strategy σ ′ that guarantees the reward obtained by σ up to ε.
A Proof of Lemma 5.3

A.1 Notation
Recall that Lemma 5.3 is a consequence of [27,Lemma 33]. Thus, we start by introducing some of the terms used in [27], namely: n-stage game, invariant measure, occupation measure and the Kantorovich-Rubinstein distance.
Remark A.2. As the notation suggests, it was proven in [27] that for any finite POMDP (v n ) −−−→ n→∞ v ∞ uniformly. The fact that (v n ) n≥1 converges was proven in [23].
The above definition can be intuitively understood in the following way: if the initial belief is distributed according to µ, and the decision-maker plays the stationary strategy σ at stage 1, then the belief at stage 2 is distributed according to µ too.
Remark A.4. Since v ∞ : ∆(K) → [0, 1] is a continuous function, one can replace f by v ∞ in the previous definition. Moreover, interpreting σ as a (mixed) stationary strategy, we would have that the sequence (E µ σ [v ∞ (P m )]) m≥1 is constant.
For sake of notation, we identify ∆(K) with the extreme points of ∆(∆(K)).
Remark A.7. The set ∆(∆(K)) equipped with distance d KR is a compact metric space.