[ad_1]
Finding the right balance between exploitation and exploration
Making decisions under uncertainty is a common challenge faced by professionals in various fields, including data science and asset management. Asset managers face this problem when selecting among multiple execution algorithms to carry out their trades. The allocation of orders among algorithms resembles the multi-armed bandit problem that gamblers face when deciding which slot machines to play, as they must determine the number of times to play each machine, the order in which to play them, and whether to continue with the current machine or switch to another. In this article, we describe how an asset manager can best distribute orders among available algorithms based on realized execution cost.
Dummy example
For each order, we take an action a to allocate to one of K algorithms
The value of action a is the expected execution cost for the algorithm
Suppose that K = 3 and the expected execution cost for the algorithms are
If you would know the action values a priori, it would be trivial to solve the problem. You would always select the algorithm with the lowest expected execution cost. Suppose now that we start allocating orders among the three algorithms as shown in Figure 1.
We still do not know the action values with certainty, but we do have estimates after some time t:
We can for instance construct the empirical distribution of the execution cost¹ for each algorithm, as shown in Figure 2.
Allocating all orders to the algorithm with the lowest expected execution cost may appear to be the best approach. However, doing so would prevent us from gathering information on the performance of the other algorithms. This illustrates the classical multi-armed bandit dilemma:
- Exploit the information that has already been learned
- Explore to learn which actions give the best outcomes
The objective is to minimize the average execution cost after allocating N orders:
Solving the problem using policies
To solve the problem, we need an action selection policy that tells us how to allocate each order based on current information S. We can define a policy as a map from S to a:
We discuss the most well known policies² for the multi-armed bandit problem, which can be classified in the following categories:
- Semi-uniform strategies: Greedy & ε-greedy
- Probability matching strategies: Upper-Confidence-Bound & Thompson sampling
Greedy
The greedy approach allocates all orders to the action with the lowest estimated value. This policy always exploits current knowledge to maximize immediate reward:
ϵ-Greedy
The ε-greedy approach behaves greedily most of the time but with probability ε selects randomly among the suboptimal actions:
An advantage of this policy is that it converges to the optimal action in the limit.
Upper-Confidence-Bound
The Upper-Confidence-Bound (UCB) approach selects the action with the lowest action value minus a term that is inversely proportional to the number of times the trading algorithm is used, i.e. Nt(a). The approach thus selects among the non-greedy actions according to their potential for actually being optimal and the associated uncertainties in those estimates:
Thompson Sampling
The Thompson Sampling approach, as proposed by Thompson (1933), assumes a known initial distribution over the action values and updates the distribution after each order allocation³. The approach selects actions according to their posterior probability of being the best action:
Evaluating policies
In practice, policies are commonly evaluated on regret which is the deviation from the optimal solution:
where μ* is the minimal execution cost mean:
Actions are a direct consequence of the policy, and we can therefore also define regret as a function of the chosen policy:
In Figure 3, we simulate the regret for the aforementioned policies in the dummy example. We observe that the Upper-Confidence-Bound approach and Thompson sampling approach perform best.
Allocating orders? Embrace uncertainty!
The dummy example simulation results strongly indicate that relying solely on a greedy approach may not yield optimal outcomes. It is, therefore, crucial to incorporate and measure the uncertainty in the execution cost estimates when developing an order allocation strategy.
Footnotes
¹ To ensure comparability of the empirical distribution of the execution cost, we need to either allocate similar orders or use order-agnostic cost metrics for evaluation.
² In situation where an algorithm’s execution cost are dependent on the order characteristics, contextual bandits are a more suitable option. To learn more about this approach, we recommend Chapter 2.9 in Barto & Sutton (2018) for an introduction.
³ We strongly suggest Russo et al. (2018) as an outstanding resource to learn about Thompson sampling.
Additional resources
The following tutorials / lectures were personally very helpful for my understanding of multi-armed bandit problems.
Industry
Academia
References
[1] Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
[2] Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. (2018). A tutorial on Thompson sampling. Foundations and Trends® in Machine Learning, 11(1), 1–96.
[3] Thompson, W. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika. 25(3/4): 285–294.
[4] Thompson, W. R. 1935. On the theory of apportionment. American Journal of Mathematics. 57(2): 450–456.
[5] Eckles, D. and M. Kaptein. 2014. Thompson sampling with the online bootstrap. arXiv preprint arXiv:1410.4009.
[ad_2]
Source link