Optimal trading with online parameters revisions

The aim of this paper is to explain how parameters adjustments can be integrated in the design or the control of automates of trading. Typically, we are interested by the online estimation of the market impacts generated by robots or single orders, and how they/the controller should react in an optimal way to the informations generated by the observation of the realized impacts. This can be formulated as an optimal impulse control problem with unknown parameters, on which a prior is given. We explain how a mix of the classical Bayesian updating rule and of optimal control techniques allows one to derive the dynamic programming equation satisfied by the corresponding value function, from which the optimal policy can be inferred. We provide an example of convergent finite difference scheme and consider typical examples of applications.

surveys as well as [10] for recent references. The knowledge of this impact is crucial. However, unlike other market parameters, it can be observed only when the algorithm is actually running, offline estimation cannot be done. As a matter of fact, the controller faces a classical dilemma: should he impulse the system (pass orders) to gain immediately more information to the price of possible immediate losses, or rather try to maximise his current expected reward with the risk of not learning for the future? In any case, the optimal trading policy has to incorporate the fact that the knowledge on the impact parameters will evolve along time, as the effects of the robot on the system are revealed, and that the uncertainty on their value is a source of risk. One way to analyse this situation is to use the multi-arms bandit recursive learning approach of [19,20]. By sending successive impulses to the system, one increases his knowledge on the true distribution of the response function. It provides asymptotically optimal policies. The main advantage is that it is model free. On the other hand it requires (weak) ergodicitytype conditions which have little chance to be satisfied if the price/book order is actually impacted. Moreover, the global flow of orders is not optimal, it starts to be optimal only in the long range.
In this paper, we propose to use the classical Bayesian approach, see e.g. [15] for general references. A similar idea has already been suggested in [1] in the context of optimal trading, in a different and very particular framework (trend estimation). In this approach, one fixes an a priori distribution on the true value of the parameters. When a new post-trade information arrives, this distribution is updated according to the Bayesian rule to provide an a posteriori distribution, which will be used as a new prior for the next (bunch of) trade(s). This can, in theory, address pretty general model-free optimal control problems, in the sense that the unknown parameter can be the whole distribution of the response function. Still, the a posteriori distribution will be dominated by the prior. If the true distribution is not, the sequence of estimators cannot converge to it. Also, in practice, computational time constraints will force one to restrict to a class of parameterized distributions, so as to reduce to a space of small dimension. Hence, it requires to choose a set of possible models. Our procedure will only reveal what is the value of the parameters most likely to be, given this prescribed framework. In practice, classes of models are already used by practitioners, but parameters are difficult to estimate at a large scale in an automatic way and evolve according to time and market conditions. From this perspective, we believe that our approach allows us to address their estimation issues in an efficient way. There are two ways to consider this updating mechanism. One is that it allows one to estimate the true value of the parameters while acting in an optimal manner. This is the point of view developed in [15], who provides conditions for a discrete time infinite horizon model that ensure the convergence of the a posteriori law to the Dirac mass at the true parameter value. Obviously, this requires identification conditions since we can observe the effect on the system only along the optimal policy, as well as some ergodicity conditions. Another point of view is that it actually falls out automatically from dynamic programming considerations when solving the optimal control problem associated to the initial prior.
We focus here on the second point of view: given a prior on the true parameter, how should we act in a optimal way? After all, we want to be optimal given a prior, and not act on the system just to refine the prior if this is not optimal. Obviously, if the algorithm has to be used repeatedly on the same market, and if the conditions on the market are stable, the a posteriori distribution obtained after running the algorithm can be used as a new, more precise, prior for the next time it will be launched. The convergence issue to a Dirac distribution is left for future researches. Again, this is more of academic interest.
We consider here the general abstract formalism introduced in [4] which aims at pertaining for most models used in practice. In particular, we can work at the meta-order level (control of smart-routines) or at the level of single orders (design of smart-routines or scheduling). In the first case, we are interested by the optimal control of robots that are already given, this is similar to [8]. We only observe the global impact of the robot, after it stops. The second case concerns the design of the robots themselves, as in e.g. [2,16,19,20]. A typical question is whether an order should be passive or aggressive. When an aggressive order is posted, we possibly observe an impact immediately (unless it is of small size). As for a (bunch of) limit order(s), we infer the speed at which they have been executed, if they are before being cancelled. The choice of the trading platform can be addressed similarly, etc.
The rest of this paper is organized as follows. After presenting the general framework of [4], we provide their main characterization: the value function is the solution of a quasivariational partial differential equation from which one can infer the optimal trading policy. We then explain how the equation can be solved numerically. This is illustrated by toy models inspired from the literature, on which simulated based optimal strategies are provided.

Abstract framework
Before to consider typical examples of application, we present here the abstract framework proposed in the companion paper [4]. We believe that it is flexible enough to pertain for most problems encountered in practice.
We model the driving noise by a d-dimensional Brownian motion W (defined on the canonical space C([0, T ], R d ) endowed with the Wiener measure P). One could also consider jump type processes, such as compound Poisson processes, the same analysis would apply. The unknown parameter υ is supported by a (Polish) space (U, B(U)), and our initial prior on υ is assumed to belong to a locally compact subset M of the set of Borel probability measures on U (endowed with the topology of weak convergence). In the applications of Section 4, the collection of possible priors can be identified as a subset of a finite dimensional space (e.g. the parameters of a Gaussian distribution, the weights of law with finite support, etc.). Then, M can be simply viewed as a finite dimensional space. To allow for additional randomness in the measurement of the effects of trades on the system, we consider another (Polish) space E on which is defined a family ( i ) i≥0 of i.i.d. random variables with common measure P on E. On the product space Ω := C([0, T ], R d ) × U × E N , we consider the family of measures {P × m × P ⊗N : m ∈ M} and denote by P m an element of this family whenever m ∈ M is fixed. The operator E m is the expectation associated to P m . Note that W , υ and ( i ) i≥0 are independent under each P m . For m ∈ M given, we let F m = (F m t ) t≥0 denote the P m -augmentation of the filtration F = (F t ) t≥0 defined by F t = σ((W s ) s≤t , υ, ( i ) i≥0 ) for t ≥ 0. Hereafter, all the random variables are considered with respect to the probability space (Ω, F m T ) with m ∈ M given by the context, and where T is a fixed time horizon.

The controlled system
Let A ⊂ [0, T ] × R d be a (non-empty) compact set. It will be the set in which our controls (trading policies) will take place. Given N ∈ N and m ∈ M, we denote by Φ •,m N the collection of sequences of random variables An element φ = (τ i , α i ) 1≤i≤N ∈ Φ •,m will be our impulse control and we write α i in the form More precisely, the τ i 's will be the times at which an impulse is made on the system (e.g. a trading robot is launched), β i will model the nature of the order send at time τ i (e.g. the parameters used for the trading robot), and i will stand for the maximal time length during which no new intervention on the system can be made (e.g. the time prescribed to the robot to send orders on the market).
From now on, we shall always use the notation We allow for not observing nor being able to act on the system before a random time ϑ φ i defined by where X φ is the controlled state process (stock prices, market volumes, wealth, etc.) that will be described below, and In the case where the actions consist in launching a trading robot at τ φ i during a certain time φ i , we can naturally take ϑ φ i = τ φ i + φ i . If the action consists in placing a limit order during a maximal duration φ i , ϑ φ i is the time at which the limit order is executed if it is less than and define Let us now describe our controlled state process. Given some initial data z : Depending on the choice of the model, the different components of X can be the cumulative gains of the algorithm, the number of holding shares, the mid-price, the size and positions of the first bid and ask queues, the position of the trader in terms of limit orders in the different queues, a factor driving the system, a flow of external information provided by experts, the current market volume, etc. This can also be the time itself, if one wants to have time dependent dynamics. This is quite flexible, and we will exemplify this in Section 4.
In the above, the function The map (µ, σ) is continuous, and Lipschitz with linear growth in its second argument, uniformly in the first one.

(2.4)
This dynamics means the following. When no action is currently made on the system, i.e. on the intervals in N φ , the system evolves according to a stochastic differential equation driven by the Brownian motion W : When an order is send at τ φ i , we freeze the dynamics up to the end of the action (of the robot, of the execution/cancellation of the order) at time ϑ φ i . This amounts to saying that we do not observe the current evolution up to ϑ φ i , or equivalently that no corrective action will be taken before the end of the already launched operation at ϑ φ i . At the end of the action, the state process takes a new value The fact that F depends on the unknown parameter υ and the additional noise i models the fact that the correct model is not known with certainty, and that the exact value of the unknown parameter υ can (possibly) not be measured precisely just by observing In order to simplify the notations, we shall now write: (2.5) From now on, we denote by F z,m,φ = (F z,m,φ s ) t≤s≤2T the P m -augmentation of the filtration generated by Hereafter an admissible control will be an element of Φ z,m .

Bayesian updates
As already mentioned, acting on the system reveals some information on the true parameter value: the prior distribution evolves along time. It should therefore be considered as a state variable to remain time consistent and be able to derive a dynamic programming equation. Note also that its evolution can be of interest in itself. One can for instance be interested by the precision of our (updated) prior at the end of the control period, as it can serve as a new prior for another control problem.
In this section, we describe how it is updated with time, according to the usual Bayesian procedure. Given z = (t, x) ∈ Z, u ∈ U and a ∈ A, we write the law of z [z, a, u, 1 ], recall (2.5), in the form q(·|z, a, u)dQ(·|z, a), in which q(·|·) is a Borel measurable map and Q(·|z, a) is a dominating measure on Z for each (z, a) ∈ Z × A. This quantities can be inferred from the knowledge of z and the law of As no new information is revealed in between the end of an order and the start of the next one, the prior should remain constant on these time intervals: with the conventions ϑ φ 0 = 0 and M z,m,φ 0 = m. But, M z,m,φ jumps at each time ϑ φ i at which the effect of the last sent order is revealed, thus bringing a new information on the unknown parameter υ. This prior update follows the classical Bayes rule 1 : . We refer to [4] for a formal proof of this intuitive fact.
Remark 2.1. Again, the parameter is unknown, but we can have an idea of what values are more likely to be correct. As can be seen from (2.9), M z,m,φ remains absolutely continuous with respect to m over time. The prior distribution should therefore have a support large enough to include the true value, otherwise it will not be seen by the a posteriori distributions as well. In practice, one can simply fix a uniform distribution on a rectangle, to which we are certain that the true parameter belongs. If one is only interested by a crude approximation, but want to minimize the computation time, one can simply specify several low/medium/high values and concentrate the support of m 0 on these (i.e. start with a combination of Dirac masses).
Remark 2.2. For later use, note that the above provides the joint conditional distribution of (Z z,φ

Gain function
Given z = (t, x) ∈ Z and m ∈ M, the aim of the controller is to maximize the expected value of the gain functional 2 is the end of the last action after T : Note that we do not look at the value of Z z,φ at T but rather at T[φ] which is either T or the end of the last trade sent before T . This is motivated by the use of robots: we do not want to stop it at T if it is running, we rather prefer to wait till the end of the algorithm. This is compensated by the fact that a penalty can be imposed when T [φ] is strictly larger that T , through the objective function g. Also note that the terminal reward depends on the parameter υ. This is motivated by applications to optimal liquidation in which a final large order may be sent at the end, to liquidate immediately the remaining shares. As suggested earlier, the gain may not only depend on the value of the original time-space state process Z z,φ T[φ] but also on M z,m,φ T[φ] , to model the fact that we are also interested by the precision of the estimation made on υ at the final time. Also note that one could add a running cost without additional difficulty, it can actually be incorporated into the state process Z z,φ .
Given φ ∈ Φ z,m , the expected reward is is the corresponding value function. Note that v depends on m through the set of admissible controls Φ z,m and the expectation operator E m , even if g does not depend on M z,m,φ 3 Value function characterization and numerical approximation

The dynamic programming quasi-variational equation
The aim of this section is to explain how one can derive a pde characterization of the optimal expected gain. As usual, it should be related to a dynamic programming principle. In our setting, it should read as follows: recall the definition of N φ in (2.2). Let us comment this. First, one should restrict to stopping times such that θ φ ∈ N φ . The reason is that no new impulse can be made outside of N φ , each interval [τ φ i , ϑ φ i ) is a latency period. Second, the terminal gain is evaluated at T[φ], which in general is different from T . Hence, the fact that θ φ is only bounded by T[φ].
We continue our discussion, assuming that (3.1) holds and that v is sufficiently smooth. Let us denote Z z,• the dynamics of the state process when no order is sent. Then, the above This corresponds to the sub-optimality of sending an order immediately. As for the time-T boundary condition, the same reasoning as above implies v(T, ·) ≥ K T g and v(T, ·) ≥ Kv(T, ·), in which By optimality, v should therefore solve the quasi-variational equations To ensure that the above operator is continuous, we assume that, on R + × R d × M, K T g is continuous, and Kϕ is upper-(resp. lower-) semicontinuous, for all upper-(resp. lower-) semicontinuous bounded function ϕ. Finally, we assume that comparison holds for (3.4)-(3.5). A sufficient condition is provided in [4].
Then, U ≤ V on Z × M.
We can now state the main result of [4].

An example of numerical scheme
When the comparison result of Assumption 3.1 holds, one can easily derive a convergent finite different scheme for (3.4)-(3.5).
We consider here a simple explicit scheme based on [11,12]. We let h 0 be a time-discretization step so that T /h 0 is an integer, and set T h 0 := {t h 0 j := jh 0 , j ≤ T /h 0 }. The space R d is discretized with a space step h 1 on a rectangle [−c, c] d , containing N x h 1 points on each direction. The corresponding finite set is denoted by X h 1 c . The first order derivatives ∂ t ϕ and (∂ϕ/∂x i ) i≤d are approximated by using the standard up-wind approximations: in which e i is i-th unit vector of R d . As for the second order term, we use the fact that each point x ∈ R d can be approximated as a weighted combination x = x ∈C h 1 (x) x ω(x |x) of the points x lying on the corners C h 1 (x) of the cube formed by the partition of R d it belongs too. Then, given another small parameter in which σ i is the i-th column of σ and is a piecewise linear approximation of ϕ. In the case where only the first row σ 1· of σ is not identically equal to 0, one can use the usual simpler approximation Similarly, we approximate Kϕ by Letting h := (h 0 , h 1 , h 2 ), and setting our numerical scheme consists in solving We specify here a precise boundary condition on ∂X h 1 c but any other (bounded) boundary condition could be used. Finally, we extend v c h to the whole space by setting This scheme is always convergent as (h 2 , h 1 /h 2 , h 0 /h 1 ) → 0 and c → ∞. Remark 3.1. We did not discuss in the above the problem of the discrete approximation of M. Applications will typically be based on a parameterized family M = {m θ , θ ∈ Θ}, for a subset Θ of a finite dimensional space. We can then further approximate Θ by a sequence of finite sets to build up a numerical scheme. Similarly, the set of control values A need to be approximated in practice. If the corresponding sequences of approximations are dense, then convergence of the numerical scheme will still hold.
We conclude this section with the technical lemmas that were used in the above proof.
Proof. We first rewrite Proof. Since M is assumed to be locally compact, it suffices to repeat the arguments in the proof of [5, p80, Proof of Lemma 6.1].

Construction of ε-optimal controls
It remains to explain how to deduce the optimal policy. At each of point (t, x) of the timespace grid and for each prior m, one computes If v c h (t, x, m) is equal to the above maximum, then we play the control (ˆ (t, x, m),b(t, x, m)), otherwise we wait for the next time step. This is the usual philosophy: we act on the system only if this increases the expected gain. As already argued, here the gain should not only be considered as an improvement of the current future reward, it can also be a gain in the precision of our prior which will then lead to better future rewards.
This produces a Markovian control which is optimal for the discrete time problem associated to our numerical scheme, and asymptotically optimal for the original control problem. We shall use this algorithm for the toy examples presented in the next section.

Applications to optimal trading
This section is devoted to the study of two examples of application. Each of them corresponds to an idealized model, the aim here is not to come up with a good model but rather to show the flexibility of our approach, and to illustrate numerically the behavior of our backward algorithm.

Immediate impact of aggressive orders
We consider first a model in which the impact of each single order sent to the market is taken into account. It means that α i represents the number of shares bought exactly at time τ i , so that i = 0, for each i. This corresponds to A = {0} × B in which B ⊂ R + is a compact set of values of admissible orders. Therefore, one can identify A to B in the following, and we will only write b for a = (0, b) ∈ A and β i for α i = ( i , β i ).
Our model can be viewed as a scheduling model or as a model for illiquid market. The first component of X represents the stock price. We consider a simple linear impact: when a trade of size β i occurs at τ i , the stock price jumps by in which υ ∈ R is the unknown linear impact parameter, ( i ) i≥1 is a sequence of independent noises following a centered Gaussian distribution with standard deviation σ . The coefficient 1/2 in the dynamics of X 1 stands for a 50% proportion of immediate resilience. It evolves according to a Brownian diffusion between two trades and has a residual resilience effect: where σ, ρ > 0 and X 1 0 ∈ R are constants. The process X 4 represents the drift of X 1 due to the non immediate resilience and X 4 0 = 0. When a trade occurs, it jumps according to We call it spread hereafter. This is part of the deviation from the un-impacted dynamic. The third component, which describes the total cost, evolves as Finally, the last component is used to keep track of the cumulative number of shares bought: We are interest in the cost of buying N shares, and maximize the criteria where η > 0 is a risk aversion parameter, C > 0, and represents the total cost after setting the total number of shares bought to N at T . If the prior law m on υ is a Gaussian distribution, then q(·|t, x, b, u) is a Gaussian density with respect to and the transition map , maps Gaussian distributions into Gaussian distributions, which, in practice, enables us to restrict M to the set of Gaussian distributions. More precisely, if (m υ (τ i −), σ υ (τ i −)) are the mean and the standard deviation of M τ i − , then the values corresponding to the posterior distribution M ϑ i are Comparing to the general result of the previous section, we add a boundary condition v(t, x 1 , x 2 , N, x 4 ) = 1 and restrict the domain of X 3 to be {0, . . . , N }. Since this parameter x 3 is discrete this does not change the nature of our general results. Note also that the map actually satisfies the conditions provided in [4] to ensure that Assumption 3.1 holds.
We now discuss a numerical illustration. We consider 30 seconds of trading and N = 25 shares to buy. We take η = 1, x 0 = 100 and σ = 0.4x 0 which corresponds to a volatility of 40% in annual terms. The trading period is divided into intervals of 1 second-length. The size of an order β i ranges in {1, 2, 3, 4, 5}. We take σ ε = 10 −4 and ρ such that the spread X 4 is divided by 3 every second if no new order is sent. We start with a prior given by a Gaussian distribution with mean m υ (0) and standard deviation σ v (0). Finally, we take C = 10 200 which makes this threshold parameter essentially inefficient while still ensuring that the terminal condition is bounded.
In Figure 1, we plot the optimal strategy for σ υ (0) = 5.10 −4 and m v (0) = 5.10 −2 in terms of (X 2 , X 3 ). Clearly, the level of spread X 4 has a significant impact: when it is large, it is better to wait for it to decrease before sending a new order. This can also be observed in Figure 2 which provides a simulated path corresponding to an initial prior (m v (0) = 2.10 −2 , σ υ (0) = 10 −3 ): after 15 seconds the algorithm alternates between sending an order and doing nothing, i.e. waiting for the spread to be reduced at the next time step. On the top right graph, we can also observe that the low mean of the initial prior combined with a zero initial resilience leads to sending an order of size 3 at first, then the mean of the prior is quickly adjusted to a higher level and the algorithm slows down immediately.  Let us now consider the case ρ = 0, i.e. without dynamic resilience, with a trading period of 60 seconds and N = 50. In Figure 3, we provide the optimal policy (number of traded shares) in terms of the number X 3 of already traded shares and the prior's mean parameter m υ for different times. Not surprisingly the algorithm is more aggressive as the prior's mean decreases and the remaining number of shares to buy increases. It is rather stable in time (compare t = 0s with t = 30s) up to the end where it is forced to accelerate to avoid a large final impact cost. It is also much more aggressive compared to the case ρ > 0 presented above: we can no more make profit of the decrease of the resilience term X 4 , and there is no reason to wait. In Figure 4, we provide a simulated path of (X, α, m υ , σ υ ) that shows how the prior on the unknown coefficients υ can adapt to changing market conditions. The red dashed lines and circles correspond to the same path of Brownian motion and the same realized noises ( i ) i≥1 as the black solid lines and crosses, but the true parameter is changed from 5.10 −2 to 5.10 −4 after 5 seconds. It is more aggressive quite quickly after the shock as the prior adapts to the new small level of impact. Note that the total number of shares is bought slightly before 30s, so that the prior do not change anymore after this date.
In Figure 5, we plot the log of the value function minus the cost 5.10 3 of buying the total shares without impact (similar to the implementation shortfall), in terms of the different quantities of interest.   4.2 Random execution times: application to strategies using limitorders In this section, we consider a limit-order trading model. X 1 now represents a mid-price (of reference) and, between two trades, has the dynamic An order is of the form ( , β) in which is the maximal time we are ready to wait before being executed, while β is the price at which the limit order is sent 3 . For simplicity, each order corresponds to buying one share. We assume that the time θ it takes to be executed follows an exponential distribution of parameter ρ(υ, X 1 τ − β), given the information at time τ . One can send a new order only after ϑ := τ + ∧ θ. Hence, given a flow of orders φ = (τ i , i , β i ) i≥1 , the number X 3 of shares bought evolves according to As in the previous model, X 3 is restricted to {0, . . . , N }. The total cost X 2 of buying the shares has the dynamics We want to maximize in which 1.02 is the best ask (kept constant) and 5.10 2 is an impact coefficient. This corresponds to the cost of liquidating instantaneously the remaining shares (N − x 3 ) + at T . This model is a version of [2], [16], [18], see also [17].
Direct computations show that the prior process M evolves according to In the case where M is the convex hull of a finite number of Dirac masses, then the weights associated to M can be computed explicitly.
Here again, the map Ψ(t, x, m) = N − x 3 satisfies the conditions provided in [4] to ensure that Assumption 3.1 holds.
We now consider a numerical illustration. We take C = 10 200 . The time horizon is T = 15 minutes. To simplify, we fix the reference mid-price to be X 1 ≡ 1 (i.e. σ = 0) and restrict to = 1, i.e. an order is sent each minute. We take N = 10. One can send limit buy orders in the range B := {0.90, 0.92, 0.94, 0.96, 0.98}. As for the intensity of the execution time, we use an exponential form as in [16]: ρ(u, x 1 −b) = λ(u)e −20(0.98−b) in which λ(u) = − ln(1 − u). This means that the probability to be executed at the price 0.98 within one minute is u. Orders are sent each minute, but we use a finner time grid in order to take into account that it can be executed before this maximal timelength. The original prior is supported by two Dirac masses at u = 0.3 and u = 0.8. The corresponding probabilities of being executed within one minute are plotted in Figure 6. Our time step corresponds to 15 seconds, so that every 15 seconds the controller can launch a new order if the previous one has been executed before the maximal 1 minute time-length. In Figure 7, we plot the difference, in logarithms, between the value functions obtained in the latter case and for a time step of 1 minute (in which case a new order cannot be launched before one minute). Clearly, the possibility of launching new orders in advance is an advantage. In Figure 8, we plot the optimal policy at time t = 0 and t = 7.5 minutes. As expected, the algorithm is more aggressive when the probability of having υ = 0.8 is higher. In Figure 9, we plot a simulated path. The red and black lines and points correspond to the same realization of the random variables at hand, but for different values of the real value of υ. Black corresponds to the most favorable case υ = 0.8, while red corresponds to υ = 0.8 for the first 7.5 minutes and υ = 0.3 for the remaining time. The initial prior is P[υ = 0.8] = 9%. Again, the algorithm adapts pretty well to this shock on the true parameter. We also see that it is more aggressive when the prior probability of being in the favorable case is high.