Strong Uniform Value in Gambling Houses and Partially Observable Markov Decision Processes

In several standard models of dynamic programming (gambling houses, MDPs, POMDPs), we prove the existence of a robust notion of value for the infinitely repeated problem, namely the strong uniform value. This solves two open problems. First, this shows that for any > 0, the decision-maker has a pure strategy σ which is-optimal in any n-stage problem, provided that n is big enough (this result was only known for behavior strategies, that is, strategies which use randomization). Second, for any > 0, the decision-maker can guarantee the limit of the n-stage value minus in the infinite problem where the payoff is the expectation of the inferior limit of the time average payoff.


Introduction
The standard model of Markov Decision Process (or Controlled Markov chain) was introduced by Bellman [4] and has been extensively studied since then. In this model, at the beginning of every stage, a decision-maker perfectly observes the current state, and chooses an action accordingly, possibly randomly. The current state and the selected action determine a stage payoff and the law of the next state. There are two standard ways to aggregate the stream of payoffs. Given a strictly positive integer n, in the n-stage MDP, the total payoff is the Cesaro mean n −1 n m=1 g m , where g m is the payoff at stage m. Given λ ∈ (0, 1], in the λ-discounted MDP, the total payoff is the λ-discounted sum λ m 1 (1 − λ) m−1 g m . The maximum expected payoff that the decision-maker can obtain in the n-stage problem (resp. λ-discounted problem) is denoted by v n (resp. v λ ).
A huge part of the literature investigates long-term MDPs, that is, MDPs which are repeated a large number of times. In the n-stage problem (resp. λ-discounted problem), this corresponds to n being large (resp. λ being small). A first approach is to determine whether (v n ) and (v λ ) converge when n goes to infinity and λ goes to 0, and whether the two limits coincide. When this is the case, the MDP is said to have an asymptotic value. The asymptotic value represents the long-term payoff outcome.
A second approach is to define the payoff in the infinite problem as the inferior limit of the expectation of n −1 n m=1 g m . In the literature, this is referred as the long-run average payoff criterion 1 (see Arapostathis et al. [3] for a review of the subject). When the asymptotic value 1 exists and coincides with the value in behavior (resp. pure) strategies of the infinite problem, the MDP is said to have a uniform value in behavior (resp. pure) strategies.
A third approach is to define the payoff in the infinite problem as being the expectation of lim inf n→+∞ n −1 n m=1 g m , as studied in Gillette [13]. Denote by w ∞ the value of this problem. When the asymptotic value exists, then w ∞ lim n→+∞ v n . A natural question is whether the equality holds. When this is the case, the decision problem is said to have a strong uniform value. As we shall see, it is straightforward that the existence of the strong uniform value implies the existence of the uniform value in pure strategies.
When the state space and action sets are finite, Blackwell [6] has proved the existence of a pure strategy that is optimal for every discount factor close to 0, and one can deduce that the strong uniform value exists.
In many situations, the decision-maker may not be perfectly informed of the current state variable. For instance, if the state variable represents a resource stock (like the amount of oil in an oil field), the quantity left, which represents the state, can be evaluated, but is not exactly known. This motivates the introduction of the more general model of Partially Observable Markov Decision Process (POMDP). In this model, at each stage, the decision-maker does not observe the current state, but instead receives a signal which is correlated to it.
Rosenberg, Solan and Vieille [20] have proved that any POMDP has a uniform value in behavior strategies, when the state space, the action set and the signal set are finite. In the proof, the authors highlight the necessity that the decision-maker resort to behavior strategies, and ask whether the uniform value exists in pure strategies. They also raise the question of the long-term properties of the time average payoff. Renault [17] and Renault and Venel [18] have provided two alternative proofs of the existence of the uniform value in behavior strategies in POMDPs, and also ask whether the uniform value exists in pure strategies.
The main contribution of this paper is to show that any finite POMDP has a strong uniform value, and consequently has a uniform value in pure strategies. In fact, we prove this result in a much more general framework, as we shall see now.
The result of Rosenberg, Solan and Vieille [20] (existence of the uniform value in behavior strategies in POMDPs) has been generalized in several dynamic programming models with infinite state space and action set. The first one is to consider the model of gambling house. Introduced by Dubins and Savage [10], a gambling house is defined by a correspondence from a metric space X to the set of probabilities on X. At every stage, the decision-maker chooses a probability on X which is compatible with the correspondence and the current state. A new state is drawn from this probability, and this new state determines the stage payoff. When the state space is compact, and the correspondence is 1-Lipschitz, and the payoff function is continuous (for suitable metrics), the existence of the uniform value in behavior strategies stems from the main theorem in [17]. One can deduce from this result the existence of the uniform value in behavior strategies in MDPs and POMDPs, for a finite state space and any action and signal sets. Renault and Venel [18] have extended the results of [17] to more general payoff evaluations.
The proofs in Renault [17] and Renault and Venel [18] are quite different from the one of Rosenberg, Solan and Vieille [20]. Still, they heavily rely on the use of behavior strategies for the decision-maker, and they do not provide any results concerning the link between the asymptotic value and w ∞ .
In this paper, we consider a gambling house with compact state space, closed graph correspondence and continuous payoff function. We show that if the family {v n , n 1} is equicontinuous and w ∞ is continuous, then the gambling house has a strong uniform value. This result especially applies to 1-Lipschitz gambling houses. We deduce the same result for compact MDPs with 1-Lipschitz transition, and POMDPs with finite set space, compact action set and finite signal set.
Note that under an ergodic assumption on the transition function, like assuming that from any state, the decision-maker can make the state go back to the initial state (see Altman [2]), or assuming that the law of the state variable converges to an invariant measure (see Borkar [7,8]), these results were already known. One remarkable feature of our proof is that we are able to use ergodic theory without any ergodic assumptions.
The paper is organized as follows. The first part presents the model of gambling house and recalls usual notions of value. The second part states our results, that is, the existence of the strong uniform value in gambling houses, MDPs and POMDPs. The last three parts are dedicated to the proof of these results.
A gambling house Γ = (X, F, r) is defined by the following elements: • X is the state space, which is assumed to be compact metric for some distance d.
• F : (X, d) ⇒ (∆(X), d KR ) is a correspondence with a closed graph and nonempty values.
• r : X → [0, 1] is the payoff function, which is assumed to be continuous.

Remark 1.
Because the state space is compact, F is a closed graph correspondence if and only if it is an upper hemicontinuous correspondence with closed values.
Let x 0 ∈ X be an initial state. The gambling house starting from x 0 proceeds as follows. At each stage m 1, the decision-maker chooses z m ∈ F (x m−1 ). A new state x m is drawn from the probability distribution z m , and the decision-maker gets the payoff r(x m ).
For the definition of strategies, we follow Maitra and Sudderth [15,Chapter 2]. First, we need the following definition (see [9, Chapter 11, section 1.8]): Given M a closed subset of ∆(X), we denote by Sco M the strong convex hull of the set M , that is, Equivalently, Sco M is the closure of the convex hull of M .

3
For every m 1, we denote by H m := X m the set of possible histories before stage m, which is compact for the product topology.

Definition 2.
A behavior (resp. pure) strategy σ is a sequence of mappings σ := (σ m ) m 1 such that for every m 1, We denote by Σ (resp. Σ p ) the set of behavior (resp. pure) strategies.
Note that Σ p ⊂ Σ. The following proposition ensures that Σ p is nonempty. This is a special case of Kuratowski-Ryll-Nardzewski theorem (see [1,Theorem 18.13,p. 600]. Proposition 1. Let K 1 and K 2 be two compact metric spaces, and Φ : K 1 ⇒ K 2 be a closed graph correspondence with nonempty values. Then Φ admits a measurable selector, that is, there exists a measurable mapping ϕ : Proof. In [1], the theorem is stated for weakly measurable correspondences. By [1, Theorem 18.10, p. 598] and [1,Theorem 18.20,p. 606], any correspondence satisfying the assumptions of the proposition is weakly measurable, thus the proposition holds.
. When this is the case, we identify σ with f .
A strategy σ is stationary if there exists a measurable mapping f : . When this is the case, we identify σ with f .
Let H ∞ := X N be the set of all possible plays in the gambling house Γ. By the Kolmogorov extension theorem, an initial state x 0 ∈ X and a behavior strategy σ determine a unique probability measure over H ∞ , denoted by P σ x0 . Let x 0 ∈ X and n 1. The payoff in the n-stage problem starting from x 0 is defined for σ ∈ Σ by The fact that the supremum can be taken over pure strategies is a consequence of Feinberg [11,Theorem 5.2].
Remark 2. For µ ∈ ∆(X), one can also define the gambling house with initial distribution µ, where the initial state is drawn from µ and announced to the decision-maker. The definition of strategies and values are the same, and for all n ∈ N * , the value of the n-stage gambling house starting from µ is equal tov n (µ).
Moreover, if the above equality also holds when the supremum on the left-hand side is taken over pure strategies, Γ(x 0 ) is said to have a uniform value in pure strategies.
Remark 3. Contrary to Equation (1), one can not directly replace the supremum over behavior strategies by a supremum over pure strategies. It is a recurring open problem in the literature. This question appears in Renault [17] for gambling houses, and in Rosenberg, Solan and Vieille [20] for POMDPs.
To study long-term dynamic programming problems, an alternative to the uniform approach is to associate a payoff to each infinite history. Given an initial state x 0 ∈ X, the infinitely repeated gambling house Γ ∞ (x 0 ) is the problem with strategy set Σ, and payoff function γ ∞ defined for all σ ∈ Σ by This type of payoff has been introduced by Gillette [13] for zero-sum stochastic games. The The supremum can be taken over pure strategies as a direct consequence of Theorem 5.2 in Feinberg [11].
Definition 6. Let x 0 ∈ X. The gambling house Γ(x 0 ) has a strong uniform value v ∞ (x 0 ) ∈ [0, 1] if it has an asymptotic value v ∞ (x 0 ) and This notion is similar to the notion of value defined in Mertens and Neyman [16] for zerosum stochastic games. When the strong uniform value exists, then the uniform value exists in pure strategies, as the following proposition shows. Proposition 2. The following inequalities always hold: Consequently, if Γ(x 0 ) has a strong uniform value, the above inequalities are equalities, and Γ(x 0 ) has a uniform value in pure strategies.
Proof. Let σ ∈ Σ and n ∈ N * . By definition of v n (x 0 ), we have and taking the liminf we get the right-hand side inequality. Moreover, by Fatou's lemma, which yields the left-hand side of the inequality. The inequality in the middle holds because Σ p ⊂ Σ.
The following example shows that inequality (4) may be strict.
Example 1. There are two states, x and x * , and Moreover, r(x) = 0 and r(x * ) = 1. Thus, at each stage, the decision-maker has to choose between having a payoff 0 and having a payoff 1. Obviously, this problem has a uniform value equal to 1. Let > 0. Define the following strategy which plays by blocks. For every n ∈ N, denote by B n the set of stages between stage 2 2 n and stage 2 2 n+1 − 1. With probability /2, the decision-maker chooses to play δ x during the whole block B n and with probability 1 − /2, he chooses to play δ x * during the whole block B n . Hence, the decision-maker chooses the same state during 2 2 n+1 −2 2 n stages. The strategy σ is uniformly -optimal: there exists n 0 ∈ N * such that for all n n 0 , Nonetheless, by the law of large numbers, for any n 0 ∈ N * , there exists a random time T such that P σ x -almost surely, T n 0 and 1 T T m=1 r m .
Therefore, the strategy σ does not guarantee more than in Γ ∞ (x).

Gambling houses
We can now state our main theorem concerning gambling houses: Theorem 1. Let Γ be a gambling house such that {v n , n 1} is uniformly equicontinuous and w ∞ is continuous. Then Γ has a strong uniform value.
Corollary 1. Let Γ be a gambling house such that {v n , n 1} is uniformly equicontinuous and w ∞ is continuous. Then Γ has a uniform value in pure strategies.
The existence of the uniform value was shown in Renault and Venel [18] in any 1-Lipschitz gambling house 2 . We obtain the following stronger result.
Theorem 2. Let Γ be a 1-Lipschitz gambling house. Then Γ has a strong uniform value. In particular, Γ has a uniform value in pure strategies.
In the next two subsections, we present similar results for MDPs and POMDPs. 2 In fact, their model of gambling house is slightly different: they do not assume that F is closed-valued, but instead assume that it takes values in the set of probability measures on X with finite support. 6

MDPs
A Markov Decision Process (MDP) is a 4-uple Γ = (K, I, g, q), where (K, d K ) is a compact metric state space, (I, d I ) is a compact metric action set, g : K × I → [0, 1] is a continuous payoff function, and q : K × I → ∆(K) is a continuous transition function. As usual, the set ∆(K) is equipped wih the KR metric, and we assume that for all i ∈ I, q(., i) is 1-Lipschitz. Given an initial state k 1 ∈ K known by the decision-maker, the MDP Γ(k 1 ) proceeds as follows. At each stage m 1, the decision-maker chooses i m ∈ I, and gets the payoff g m := g(k m , i m ). A new state k m+1 is drawn from q(k m , i m ), and is announced to the decision-maker. Then, Γ(k 1 ) moves on to stage m + 1. A behavior (resp. pure) strategy is a measurable map σ : An initial state k 1 and a strategy σ induce a probability measure P k1 σ on the set of plays H ∞ = (K × I) N * . The notion of uniform value is defined in the same way as in gambling houses. We prove the following theorem: The MDP Γ has a strong uniform value, that is, for all k 1 ∈ K, the two following statements hold: • The sequence (v n (k 1 )) converges when n goes to infinity to some real number v ∞ (k 1 ).
Consequently, the MDP Γ has a uniform value in pure strategies.

POMDPs
A Partially Observable Markov Decision Process (POMDP) is a 5-uple Γ = (K, I, S, g, q), where K is a finite set space, I is a compact metric action set, S is a finite signal set, g : K × I → [0, 1] is a continuous payoff function, and q : K × I → ∆(K × S) is a continuous transition function. Given an initial distribution p 1 ∈ ∆(K), the POMDP Γ(p 1 ) proceeds as follows. An initial state k 1 is drawn from p 1 , and the decision-maker is not informed about it. At each stage m 1, the decision-maker chooses i m ∈ I, and gets the (unobserved) payoff g(k m , i m ). A pair (k m+1 , s m ) is drawn from q(k m , i m ), and the decision-maker receives the signal s m . Then the POMDP proceeds to stage m+1. A behavior strategy (resp. pure strategy) is a measurable map σ : ∪ m 1 (I ×S) m−1 → ∆(I) (resp. σ : ∪ m 1 (I ×S) m−1 → I). An initial distribution p 1 ∈ ∆(K) and a strategy σ induce a probability measure P p1 σ on the set of plays H ∞ := (K × I × S) N * . The notion of uniform value is defined in the same way as in gambling houses. We prove the following theorem: Theorem 4. The POMDP Γ has a strong uniform value, that is, for all p 1 ∈ ∆(K), the two following statements hold: • The sequence (v n (p 1 )) converges when n goes to infinity to some real number v ∞ (p 1 ).
Consequently, the POMDP Γ has a uniform value in pure strategies.
In particular, this theorem solves positively the open question mentioned in [20], [17] and [18]: finite POMDPs have a uniform value in pure strategies.

Proof of Theorem 1
Let Γ = (X, F, r) be a gambling house such that {v n , n 1}∪{w ∞ } is uniformly equicontinuous.
Let x 0 ∈ X be an initial state. Let us prove that for all > 0, there exists a behavior strategy σ such that Let us first give the structure and the intuition of the proof. It builds on three main ideas, each of them corresponding to a lemma. First, Lemma 1 associates to x 0 a probability measure µ * ∈ ∆(X), such that: • Going from x 0 , for all > 0 and n 0 ∈ N * , there exists a strategy σ 0 and n n 0 such that the occupation measure 1 n n m=1 z m ∈ ∆(X) is close to µ * up to (for the KR distance).
• If the initial state is drawn according to µ * , the decision-maker has a behavior stationary strategy σ * such that for all m 1, z m is distributed according to µ * (µ * is an invariant measure for the gambling house).
Let x be in the support of µ * . Building on a pathwise ergodic theorem, Lemma 2 shows that Let y ∈ X be close to x. Lemma 3 shows that, if y ∈ X is close to x, then there exists a behavior strategy σ such that γ ∞ (y, σ) is close to v(y).
These lemmas are put together in the following way. Lemma 1 implies that, going from x 0 , the decision-maker has a strategy σ 0 such that there exists a (deterministic) stage m 1 such that with high probability, the state x m is close to the support of µ * , and such that the expectation of v(x m ) is close to v(x 0 ). Let x be an element in the support of µ * such that x m is close to x. By Lemma 3, going from x m , the decision-maker has a strategy σ such that γ ∞ (x m , σ) is close to v(x m ). Let σ be the strategy that plays σ 0 until stage m, then switches to σ. Then γ ∞ (x 0 , σ) is close to v(x 0 ), which concludes the proof of Theorem 1.

Preliminary results
Let Γ = (X, F, r) be a gambling house. We define a relaxed version of the gambling house, in order to obtain a deterministic convex gambling house H : ∆(X) ⇒ ∆(X). The interpretation of H(z) is the following: if the initial state is drawn according to z, H(z) is the set of all possible measures on the next state that the decision-maker can generate by using behavior strategies.
First, we define G : By [1,Theorem 17.35, p.573], the correspondence G has a closed graph, which is denoted by Graph G. Note that a behavior strategy in the gambling house Γ corresponds to a pure strategy in the gambling house (X, G, r). For every z ∈ ∆(X), we define H(z) by Note that replacing "∀x ∈ X, σ(x) ∈ G(x)" by "∀x ∈ X, σ(x) ∈ G(x) z − a.s." does not change the above definition (throughout the paper, "a.s." stands for "almost surely").
By Proposition 1, H has nonempty values. We now check that the correspondence H has a closed graph.
Proposition 3. The correspondence H has a closed graph.
By definition of H, for every n ∈ N, there exists σ n : X → ∆(X) a measurable selector of G such that for every f ∈ C(X, [0, 1]), Let π n ∈ ∆(Graph G) such that the first marginal of π n is z n , and the conditional distribution of π n knowing x ∈ X is δ σn(x) ∈ ∆(∆(X)). By definition, for every f ∈ C(X, [0, 1]), we have The set ∆(Graph G) is compact, thus there exists π a limit point of the sequence (π n ) n∈N . By definition of the weak* topology on ∆(X) and on ∆(Graph G), the previous equation yields To conclude, let us disintegrate π. Let z be the first marginal of π. The sets X and ∆(X) are compact metric spaces, thus there exists a probability kernel K : X × B(∆(X)) → [0, 1] such that • for every x ∈ X, K(x, .) ∈ ∆(∆(X)), • for every B ∈ B(∆(X)), K(., B) is measurable, • for every h ∈ C(X × ∆(X), [0, 1]),

Invariant measure
The first lemma associates a fixed point of the correspondence H to each initial state: There exists a distribution µ * ∈ ∆(X) such that • for every ε > 0 and N 1, there exists a (pure) strategy σ 0 and n N such that σ is where z m (x 0 , σ 0 ) ∈ ∆(X) is the distribution of x m , the state at stage m, given the initial state x 0 and the strategy σ 0 .
Moreover, we have where diam(X) is the diameter of X.
The set ∆(X) is compact. Up to taking a subsequence, there exists µ * ∈ ∆(X) such that (v n (x 0 )) converges to v(x 0 ) and (z n ) converges to µ * . By inequality (7), (z n ) also converges to µ * . Because H has a closed graph, we have µ * ∈ H(µ * ), and µ * is H-invariant. By construction, the second property is immediate.
Finally, we have a series of inequalities that imply the third property.
• v is decreasing in expectation along trajectories: the sequence (v(z m (x 0 , σ 0 ))) m 1 is decreasing, thus for every n 1, Taking n to infinity, by continuity ofv, we obtain that v(x 0 ) v(µ * ).
In the next section, we prove that in Γ(µ * ), under the strategy σ * , the average payoffs converge almost surely to v(x), where x is the initial (random) state.
Theorem 5 (pathwise ergodic theorem). Let (X, B) be a measurable space, and ξ be a Markov chain on (X, B), with transition probability function P . Let µ be an invariant probability measure for P . For every f an integrable function with respect to µ, there exist a set B f ∈ B and a function f * integrable with respect to µ, such that µ(B f ) = 1, and for all x ∈ B f , Moreover, Lemma 2. Let x 0 ∈ X and µ * ∈ ∆(X) be the corresponding invariant measure (see Lemma 1). There exist a measurable set B ⊂ ∆(X) such that µ * (B) = 1 and a stationary strategy σ * : X → ∆(X) such that for all x ∈ B, Proof. Because µ * is a fixed point of H, there exists σ * : X → ∆(X) a measurable selector of G (thus, a behavior stationary strategy in Γ) such that for all f ∈ C(X, [0, 1]), Consider the gambling house Γ(µ * ). Under σ * , the sequence of states (x m ) m∈N is a Markov chain with invariant measure µ * . From Theorem 5, there exist a measurable set B 0 ⊂ X such that µ * (B 0 ) = 1, and a measurable map w : X → [0, 1] such that for all x ∈ B 0 , we have 1 n n m=1 r(x m ) → n→+∞ w(x) P σ * x − almost surely, andŵ (µ * ) =r(µ * ).
This implies that w = v P σ * µ * − a.s., and the lemma is proved.
Then, v is also uniformly continuous with the same modulus of continuity.
Lemma 3. Let ε > 0, x, y ∈ X and σ * be a strategy such that Then there exists a strategy σ such that Proof. By assumption, we have Let ε > 0. By definition of w ∞ (y), there exists a strategy σ such that We can now finish the proof of Theorem 1.

Conclusion of the proof
Proof of Theorem 1. We can now put Lemma 1, 2 and 3 together to finish the proof of Theorem 1. Fix an initial state x 0 ∈ X and > 0. We will define a strategy σ as follows: start by following a strategy σ 0 until some stage n 3 , then switch to another strategy depending on the state x n3 . We first define the stage n 3 , then build the strategy σ and finally check that this strategy indeed guarantees v ∞ in the infinitely repeated gambling house Γ ∞ .
By assumption, the family (v n ) n 1 is uniformly equicontinuous. Consequently, there exists n 0 ∈ N * such that for all n n 0 and for all x ∈ X, v n (x) v(x) + .
We first consider Lemma 1 for x 0 , = 3 and N = 2n 0 . There exists µ * an invariant measure, σ 0 a (pure) strategy and n 1 2n 0 such that µ * satisfies the conclusion of Lemma 1 and Let B be given by Lemma 2. In general, there is no hope to prove the existence of a stage m such that z m (x 0 , σ 0 ) is close to µ * . Instead, we prove the existence of a stage n 3 such that under the strategy σ 0 , x n3 is with high probability close to B, and v(z n3 (x 0 , σ 0 )) is close to v(x 0 ).
Let n 2 = n 1 + 1, A = {x ∈ X|d(x, B) ε} and A c = {x ∈ X|d(x, B) > ε}. We denote µ n1 = 1 n1 n1 m=1 z m (x 0 , σ 0 ). By property of the KR distance, there exists a coupling γ ∈ ∆(X × X) such that the first marginal of γ is µ n1 , the second marginal is µ * , and We deduce that µ n1 (A c ) ε 2 . Because the n 2 first stages have a weight of order ε in µ n1 , we deduce the existence of a stage m such that z m (A c ) : Moreover,v(z n3 (x 0 , σ 0 )) is greater than v(x 0 ) up to a margin ε. Indeed we havê Using Equation (8) and the last inequality, we deduce that We have defined both the initial strategy σ 0 and the switching stage n 3 . To conclude, we use Lemma 3 in order to define the strategy from stage n 3 . Note that in Lemma 3, we did not prove that the strategy σ could be selected in a measurable way with respect to the state. Thus, we need to use a finite approximation. The set X is a compact metric set, thus there exists a partition {P 1 , ..., P L } of X such that for every l ∈ {1, ..., L}, P l is measurable and diam(P l ) . It follows that there exists a finite subset {x 1 , ..., x L } of B such that for every x ∈ A ∩ P l , d(x, x l ) 3ε. We denote by ψ the application which associates to every x ∈ A ∩ P l the state x l .
We define the strategy σ as follows: • Play σ 0 until stage n 3 .
• If x n3 ∈ A, then there exists l ∈ {1, ..., L} such that x n3 ∈ P l . Play the strategy given by Lemma 3, with x = x l and y = x n3 . If x n3 / ∈ A, play any strategy.
Let us check that the strategy σ guarantees a good payoff with respect to the long-run average payoff criterion. By definition, we have Because η(0) = 0 and η is continuous at 0, the gambling house Γ(x 0 ) has a strong uniform value, and Theorem 1 is proved.

Proofs of Theorem 2, Theorem 3 and Theorem 4
This section is dedicated to the proofs of Theorem 2, Theorem 3 and Theorem 4. Theorem 2 and Theorem 3 stem from Theorem 1. Theorem 4 is not a corollary of Theorem 1. Indeed, applying Theorem 1 to the framework POMDPs, would only yield the existence of the uniform value in pure strategies and not the existence of the strong uniform value.

Proof of Theorem 2
Let Γ := (X, F, r) be a gambling house such that F is 1-Lipschitz. Without loss of generality, we can assume that r is 1-Lipschitz. Indeed, any continuous payoff function can be uniformly approximated by Lipschitz payoff functions, and dividing the payoff function by a constant does not change the decision problem.
In order to prove Theorem 2, it is sufficient to prove that for all n 1, v n is 1-Lipschitz, and w ∞ is 1-Lipschitz. Indeed, it implies that the family {v n , n 1} is uniformly equicontinuous and w ∞ is continuous. Theorem 2 then stems from Theorem 1.
Recall that G : Proof. Let x and x be two states in X. Fix µ ∈ G(x). Let us show that there exists By definition of G(x), there exists ν ∈ ∆(F (x)) such that for all g ∈ C(X, [0, 1]), Because F is 1-Lipschitz, Φ has nonempty values. Moreover, Φ is the intersection of two correspondences with a closed graph, therefore it is a correspondence with a closed graph. Applying Proposition 1, we deduce that Φ has a measurable selector ϕ : M → ∆(X). Let ν ∈ ∆(∆(X)) be the image measure of ν by ϕ. Throughout the paper, we use the following notation for image measures: By construction, ν (F (x )) = 1 and for all h ∈ C(∆(X), [0, 1]), Let µ := Bar(ν ) and f ∈ E 1 . The functionf is 1-Lipschitz, and Because G is 1-Lipschitz, given (x, u) ∈ Graph G and y ∈ X, there exists w ∈ G(y) such that d KR (u, w) d(x, y). For our purpose, we need that the optimal coupling between u and w can be selected in a measurable way. This is the aim of the following lemma: Lemma 5. There exists a measurable mapping ψ : Graph G × X → ∆(X × X) such that for all (x, u) ∈ Graph G, for all y ∈ X, • the first marginal of ψ(x, u, y) is u, • the second marginal of ψ(x, u, y) is in G(y), • X×X d(s, t)ψ(x, u, y)(ds, dt) d(x, y).
Proposition 5. Let x, y ∈ X and σ be a strategy. Then there exist a probability measure P σ x,y on H ∞ × H ∞ , and a strategy τ such that: x,y has first marginal P σ x , • P σ x,y has second marginal P τ y , • The following inequalities holds: for every n 1 where X m (resp. Y m ) is the m-th coordinate of the first (resp. second) infinite history.
Proof. Define the stochastic process (X m , Y m ) m 0 on (X × X) N such that the conditional distribution of (X m , Y m ) knowing (X l , Y l ) 0 l m−1 is with ψ defined as in Lemma 5. Let P σ x,y be the law on H 2 ∞ induced by this stochastic process and the initial distribution δ (x,y) . By construction, the first marginal of P σ x,y is P σ x .
For m ∈ N * and (y 0 , ..., y m−1 ) ∈ X m , define τ m (y 0 , ..., y m−1 ) ∈ ∆(X) as being the law of Y m , conditional to Y 0 = y 0 , ..., Y m−1 = y m−1 . By convexity of G, this defines a (behavior) strategy τ in the gambling house Γ. Moreover, the probability measure P τ y is equal to the second marginal of P σ x,y .
For all m ∈ N * , we have P σ x,y -almost surely The random process (d(X m , Y m )) m 0 is a positive supermartingale. Therefore, we have Moreover, the random process (d(X m , Y m )) m 0 converges P σ x,y -almost surely to a random variable D, such that E σ x,y (D) d(x, y). For every n 1, we have |r(X m ) − r(Y m )| D P σ x,y a.s.
Integrating the last inequality yields the proposition.
Proposition 5 implies that for all n 1, v n is 1-Lipschitz, and that w ∞ is 1-Lipschitz. Thus, Theorem 2 holds.

Proof of Theorem 3 for MDPs
In this subsection, we consider a MDP Γ = (K, I, g, q), as described in Subsection 2.2: the state space (K, d K ) and the action set (I, d I ) are compact metric, and the transition function q and the payoff function g are continuous. As in the previous section, without loss of generality we assume that the payoff function g is in fact 1-Lipschitz.
In the model of gambling house, there is no explicit set of actions. In order to apply Theorem 1 to Γ, we put the action played in the state variable. Indeed, we consider an auxiliary gambling house Γ, with state space K × I × K. At each stage m 1, the state x m in the gambling house corresponds to the state (k m , i m , k m+1 ) in the MDP. Formally, Γ is defined as follows: • The state space is X := K × I × K, equipped with the distance d defined by • The payoff function r : X → [0, 1] is defined by: for all (k, i, k ) ∈ X, r(k, i, k ) := g(k, i).
• The correspondence F : X → ∆(X) is defined by: where δ k ,i is the Dirac measure at (k , i ), and the symbol ⊗ stands for product measure.
Fix some arbitrary state k 0 ∈ K and some arbitrary action i 0 ∈ I. Given an initial state k 1 in the MDP Γ, the corresponding initial state x 0 in the gambling house Γ is (k 0 , i 0 , k 1 ). By construction, the payoff at stage m in Γ(x 0 ) corresponds to the payoff at stage m in Γ(k 1 ). Now let us check the assumptions of Theorem 1. The state space X is compact metric. Because g is continuous, r is continuous, and the following lemma holds: Lemma 6. The correspondence F has a closed graph.
Proof. Let (x n , u n ) n∈N ∈ (Graph F ) N be a convergent sequence. By definition of F , for every n 1, there exist (k n , i n , k n ) ∈ K × I × K and i n ∈ I such that x n = (k n , i n , k n ), and u n = δ k n ,i n ⊗ q(k n , i n ).
Moreover, the sequence (k n , i n , k n , i n ) n 1 converges to some (k, i, k , i ) ∈ K ×I ×K ×I. Because the transition q is jointly continuous, we obtain that (u n ) converges to δ (k ,i ) ⊗ q(k , i ), which is indeed in F (k, i, k ).
We now prove that for all n ∈ N * , v n is 1-Lipschitz, and that w ∞ is 1-Lipschitz. It is more convenient to prove this result in the MDP Γ, rather than in the gambling house Γ. Thus, in the next proposition, H ∞ = (K × I) ∞ is the infinite history in Γ, a strategy σ is a map from ∪ m 1 K × (I × K) m−1 to ∆(I), and P k1 σ denotes the probability over H ∞ generated by the pair (k 1 , σ). This proposition is similar to Proposition 5.
• The following inequalities hold: for every n 1, where K m , I m (resp. K m , I m ) is the m-th coordinate of the first (resp. second) infinite history.
• Under P σ k1,k 1 , for all m 1, I m = I m .
Proof. Exactly as in Lemma 5, one can construct a measurable mapping ψ : is an optimal coupling between q(k, i) and q(k , i) for the KR distance.
We define a stochastic process on I×K×I×K, in the following way: given an arbitrary action i 0 , we set I 0 = I 0 = i 0 , K 1 = k 1 , K 1 = k 1 . Then, for all m 2, given (I m−1 , K m , I m−1 , K m ), we construct (I m , K m+1 , I m , K m+1 ) as follows: • I m is drawn from σ(K 1 , I 1 , ..., K m ), • we set I m := I m .
By construction, P σ k1,k 1 has first marginal P σ k1 . For m 1 and h m = (k 1 , i 1 , ..., k m ) ∈ H m , define τ (h m ) ∈ ∆(I) as being the law of I m , conditional to K 1 = k 1 , I 1 = i 1 , ..., K m = k m . This defines a strategy. Moreover, for all m 1, we have The process (d K (K m , K m )) m 1 is a positive supermartingale, thus it converges almost surely. We conclude exactly as in the proof of Proposition 5.
The previous proposition implies that the value functions v n and w ∞ are 1-Lipschitz. Therefore, the family {v n , n 1} is equicontinuous, and w ∞ is continuous. By Theorem 1, the gambling house Γ has a strong uniform value. It follows that the MDP Γ has a strong uniform value, and Theorem 3 holds. [18] define slightly differently the auxiliary gambling house associated to a MDP. Instead of taking K × I × K as the auxiliary state space, they take [0, 1] × K, where the first component represents the stage payoff. In our framework, applying this method would lead to a measurability problem, when trying to transform a strategy in the auxiliary gambling house into a strategy in the MDP.

Proof of Theorem 4 for POMDPs
In this subsection, we consider a POMDP Γ = (K, I, S, g, q), as described in Subsection 2.3: the state space K and the signal space S are finite, the action set (I, d I ) is compact metric, and the transition function q and the payoff function g are continuous.
A standard way to analyze Γ is to consider the belief p m ∈ ∆(K) at stage m about the state as a new state variable, and thus consider an auxiliary problem in which the state is perfectly observed and lies in ∆(K) (see [19], [21], [22]). The function g is linearly extended to ∆(K) × ∆(I), in the following way: for all (p, u) ∈ ∆(K) × ∆(I), g(p, u) := k∈K I g(k, i)u(di).
Let q : ∆(K) × I → ∆(∆(K)) be the transition on the beliefs about the state, induced by q: if at some stage of the POMDP, the belief of the decision-maker is p, and he plays the action i, then his belief about the next state will be distributed according to q(p, i). We extend linearly the transition q on ∆(K) × ∆(I), in the following way: for all f ∈ C(∆(K), [0, 1]), We can also define an auxiliary gambling house Γ, with state space [0, 1] × I × ∆(K): at stage m, the auxiliary state x m corresponds to the triple (g(p m , i m ), i m , p m+1 ). Formally, the gambling house Γ is defined as follows: • State space X := [0, 1] × I × ∆(K): the set ∆(K) is equipped with the norm 1 . K , and the distance d on X is d := max(|.|, d I , . K ).
By construction, the payoff at stage m in the auxiliary gambling house Γ(x 0 (p 1 )) corresponds to the payoff g(p m , i m ) in the POMDP Γ(p 1 ). In particular, for all n ∈ N * , the value of the n-stage gambling house Γ(x(p 1 )) coincides with the value of the n-stage POMDP Γ(p 1 ), which is denoted by v n (p 1 ).
One could check that Γ satisfies the assumptions of Theorem 1 and therefore has a strong uniform value. This would especially imply that Γ has a uniform value in pure strategies, and it would prove that Γ has a uniform value in pure strategies. Indeed, let p 1 ∈ ∆(K) and σ be a strategy in Γ(x 0 (p 1 )). Let σ be the associated strategy in the POMDP Γ(p 1 ). For all n 1, we have Consequently, the fact that Γ(x 0 (p 1 )) has a uniform value in pure strategies implies that Γ(p 1 ) also has a uniform value in pure strategies. Unfortunately, this approach does not prove Theorem 4, i.e. the existence of the strong uniform value in Γ, due to the following problem: Problem It may happen that Indeed, r(x m ) is not equal to g(k m , i m ): it is the expectation of g(k m , i m ) with respect to p m . Consequently, the fact that σ is an -optimal strategy in Γ ∞ (x 0 (p 1 )) does not imply that σ is an -optimal strategy in Γ ∞ (p 1 ).
To prove Theorem 4, we adapt the proof of Theorem 1 to the framework of POMDPs. Recall that the proof of Theorem 1 was decomposed into three lemmas (Lemmas 1, 2 and 3) and a conclusion (Subsection 3.5). We adapt the three lemmas, and the conclusion is similar.
In order to obtain the first lemma, we check that F has a closed graph.
Proposition 7. The correspondence F has a closed graph.
Proof. Let (x n , u n ) n∈N ∈ (Graph F ) N be a sequence that converges to (x, u) ∈ X × ∆(X). By definition of F , for every n 1 there exists (a n , i n , p n , i n ) ∈ ([0, 1] × I × ∆(K) × I) such that x n = (a n , i n , p n ), and u n = g(p n , i n ) ⊗ δ i n ⊗ q(p n , i n ).
It follows that the sequence (a n , i n , p n , i n ) n 1 converges to some (a, i, p, i ) ∈ [0, 1]×I ×∆(K)×I and x = (a, i, p).
We can now state a new lemma about pathwise convergence in Γ. This replaces Lemma 2. This proposition is proved in Rosenberg, Solan and Vieille [20,Proposition 1]. In their framework, I is finite, but the fact that I is compact does not change the proof at all.
Last, we establish the junction lemma, which replaces Lemma 3.
Lemma 9. Let p, p ∈ ∆(K) and σ be a strategy such that Proof. Let k ∈ K and p 1 ∈ ∆(K). Denote by P σ p1 (h ∞ |k) the law of the infinite history h ∞ ∈ (K × I × S) N * in the POMDP Γ(p 1 ), under the strategy σ, and conditional to k 1 = k. Then P σ p (h ∞ |k) = P σ p (h ∞ |k) and For every n 1, v n is 1-Lipschitz, thus the function v is also 1-Lipschitz, and the lemma is proved.
The conclusion of the proof is similar to Section 3.5. Note that apart from the three main lemmas, the only additional property used in Section 3.5 was that the family (v n ) n 1 is uniformly equicontinuous. For every n 1, v n is 1-Lipschitz, thus the family (v n ) n 1 is indeed uniformly equicontinuous.

Possible extensions
We discuss here several possible extensions of our results to more general models.
Let us first focus on gambling houses. In this paper, we have considered gambling houses with a compact metric state space. Renault [17] does not assume compactness of the state space. Instead, he assumes that some set of value functions is precompact 3 , and prove the existence of the uniform value. For this purpose, he defines the topology on the state space induced by this set of functions. It is unclear whether we could follow this approach and drop the compactness assumption. As a matter of fact, Renault [17] focuses on probabilities with finite support and mainly uses topological arguments. This is in sharp contrast with our proof, which involves several probabilistic arguments, such as Birkhoff's ergodic theorem.
An intermediary step would be to consider a precompact state space. An important issue is that the invariant measure in Subsection 3.2 may fail to exist. One way to avoid this problem could be to extend the correspondence to the adherence of X, and apply the main theorem to the auxiliary gambling house associated to the extended correspondence. Nonetheless, it is not obvious that a strategy in the auxiliary gambling house could be approximated in a proper way by a strategy in the original gambling house.
In the same spirit, it is natural to ask if the assumption that F has closed values can be dropped. The same problem arises for the existence of the invariant measure. Moreover, our selection theorem does not hold. Last, the boundedness assumption on the payoff function is necessary to build the invariant measure.
Generalizing our results to precompact gambling houses would allow to consider more general MDPs and POMDPs models. Indeed, one may wonder if our results extend to MDPs with precompact state space and noncompact action set. Our approach does not cover these two cases because the auxiliary gambling house defined in Subsection 4.2 would only be defined on a precompact state space, and may not have closed values.
Another extension could be to allow for state-dependent action sets. Following our proof, the state space of the auxiliary gambling house in Subsection 4.2 would then be K × k∈K I(k)×K. It is not obvious to see which kind of assumption has to be made on the transition and payoff function to make this auxiliary state space into a compact metric space such that the auxiliary gambling house satisfies the assumptions of Theorem 1.
As far as POMDPs are concerned, we assumed in Theorem 4 that the state space and the signal set are finite. If we assume instead that the state space is compact metric, it is not clear which kind of assumption on the transition function should be made in order that Lemma 9 is satisfied. Moreover, if we assume that the signal set is compact metric, then the correspondence of the auxiliary gambling house may not be upper hemicontinuous, as Example 4.1 in Feinberg [12] shows.