Use Reinforcement Learning to do Algorithmic Trading

7 min readJan 20, 2021

Context: For the longest time, I’ve avoided offers in Quantitative Finance as I believe it does not address the heart of people’s problems; however, I do think that there is value in using Quantitative Finance as a problem to develop better utility that does just that. This is because working on the algorithmic trading problem can help us better harness machine creativity in coming up with strategy. Programs like AlphaGo & AlphaGoZero are truly creative, not only because they could invent new moves, but also due to the fact that they can be modeled as Generative Models, allowing us to see strategizing as a creative task. Here, from a non-finance person’s vantage point, I’d like to articulate the merits of Deep Reinforcement Learning for algorithmic trading.

A Theory About Quantitative Trading:

Assuming people don’t cash out money in the financial market unless for abnormal situations like the market crash, what they do is just moving money from one asset to another, hoping for value increment. Therefore, based on this theory, all trading techniques ultimately are about computing a correlation matrix for all assets, because the money redrawn from one asset would go to another as reflected in their prices. If we have a portfolio with n assets, with just one feature for each asset, we will have a giant n by n matrix that is hard to swallow, that is why we may want to use an expressive function like Neural Network to compute the correlation matrix.

“All trading techniques ultimately are about computing a correlation matrix for all assets.”

Why Deep Reinforcement Learning:

Speed is an essential requirement in algorithmic trading. Quantum Finance offers a fresh computing paradigm, but we also need logics that make the trade decision. Ideally, we need a framework that releases the bottleneck in decision-making as much as possible. Currently, self-supervised learning carries a bright future in AI; it holds the superb performance of supervised methods as well as the “no-label-no-problem” of unsupervised methods. Self-supervised learning methods all have one thing in common, they themselves generate the learning targets which they will be trained on. Their learning process goes like this: The agent (1) first makes a prediction, (2) observes the actual occurrence, and then (3) checks the prediction for error, and finally (4) learns from that error. And yes, I have just described the essence of Reinforcement Learning. Knowing what Reinforcement Learning is, now let’s see why it works for algorithmic trading.

1. Efficient Explore & Exploit.

For most domain-specific problems with well-defined objective functions, rather than trying hard to develop models, we should be concerned with developing a better representation of the environment to train the model. But even with an adequate representation, using it to derive a trading strategy is a rather difficult job for humans, for there are simply too many possibilities. Overtime, human traders tend to rely on only a few signals as their primary tools. Luckily, unlike other frameworks, reinforcement learning deals with both the prediction (e.g., stock prices) and control (e.g., portfolio allocation). With offline reinforcement learning, we can learn trading strategies offline, venturing into the uncharted territories. This massively reduce the execution latency needed for high-frequency trading.

Contrary to Online Reinforcement Learning, Offline Reinforcement Learning trains the agent with a fixed dataset alone without any further incoming data. This makes learning more efficient and allow for policy “completeness” . In online learning, the current policy is dependent on a stochastic future, it’s therefore incomplete. Though the complete strategy of today may not apply to the future, if it’s a complete strategy, it tends to be more robust.

In offline-reinforcement learning, the agent is trained on a fixed dataset of experiences “offline” rather than the experiences on the go (online). Normally, offline-Reinforcement Learning faces an existential problem: Given a dataset with fixed targets, if the agent acts too off the track, we wouldn’t know the targets of its actions. But in algorithmic trading, we can always compute the different targets with the prices, this saves us massive trouble in control.

“…even with an adequate representation, using it to derive a trading strategy is a rather difficult job for humans for there are simply too many possibilities.”

2. Trade on the future, not the past.

Although the past data is all we have, it is simply not an adequate predictor of the future for most non-stationary problems, particularly in financial markets where events are ever-changing. Model-based reinforcement learning has a unique philosophy, rather than trying to predict the future based on the past, the agent can vision all the possibilities of the future and plan ahead. For example, say, if you don’t want to do your homework, you could either (1) gamble on the possibility that your teacher will not bring it up, or (2) you can prepare a response for each of the 2 possibilities as your teacher may or may not bring it up. With the latter option, you will be in control of the situation regardless of what it is to come. When there are a lot of possibilities, preparing for the future (rather than predicting it) helps to lift the burden of uncertainty. By focusing on preparing for the major possibilities, we are now dealing with the problem of capacity rather than accuracy, which is a lot easier to command. In fact, as I’d like to believe, RL’s reliance of Markov Decision Process empathizes this unique take on tackling the problem of continuous predictive control, where we only use the current state as the latent space to vary the response to the future while discarding the past.

Markovian Reinforcement Learning model can be viewed as a GAN-like generative model where a generator (our policy) would act as a Bayes model to estimate a posterior (action) based on sample states drawn from a prior d(s). In single agent setting, the discriminator is fully trained to recognize d(s), so the model plays a reward-maximization game. In multi-agent setting, the discriminator (market) is affected by other players, so d(s) becomes non-stationary and so the agent becomes an Empirical Bayes involved in a MinMax game that converges at Nash equilibrium.

Since the financial market is inherently non-stationary, a plug-in framework like Model-based Reinforcement Learning should be considered as a solution for an Empirical Bayes estimator. Regardless, we have to keep in mind that the state-action value prediction is wrong from time to time, so a decoupling of prediction and decision must be realized.

“Although the past data is all we have, it is simply not an adequate predictor of the future for most non-stationary problems.”

3. Separation of prediction and decision.

A long-established rhetoric in Inferential Statistics to separate prediction and decision. The reasoning is that “making a correct prediction at a given time doesn’t guarantee that the decision based on that prediction will be correct.” There are many possible reasons for this, to give one example, consider the following scenario where we have two targets following two different normal distributions. Though both left and right have the same expectation, standard deviation is a lot wider on the right, hence more uncertainty, in which case our actions should be different (e.g., seeking more diversification on RHS).

**The Linear model on the LHS is more robust than the one on the RHS because it has lower variances. On the RHS, we may need extra information at every step to reduce the uncertainty.**

The rhetoric for separating prediction and decision should be strong in Algorithmic Trading because stochastic predictors (e.g., stock predictor) usually require computing the expectation of a target distribution. Reinforcement Learning currently employs the so-called Actor-critic methods where a model comprises a Critic (making prediction on the state-action value) and an Actor (making decisions based on, but not restricted by, the Critic’s prediction). Since both functions are trained concurrently and end-to-end, there is no need to worry about their alignments. However, the value function (Critic) should be regularized to account for the fact that the state distribution is non-stationary.

“… making a correct prediction at a given time doesn’t guarantee that the decision based on that prediction will be correct.”

Concluding remarks: The perfect game for one thing that is good at games.

For average people, trading is, but a game called the “Greater Fools”. Your goal is to find people who are more foolish than yourself so that you can sell them something of less value. If everyone knows an asset is going to fail, no one will be willing to make the trade; hence, the financial market is one of the few human inventions that spawns riches out of chaos, and if wining of any kind is about reducing the uncertainty of the environment, there will be no inherent win-win scenario to trading. In other words, trading financial assets is different from commodities, it is an adversarial game, and your only goal is to beat a stochastic opponent (the market, or another trading bot) who is under the same rules (hopefully) but with imperfect information. Better yet, this stochastic opponent is not one but many actors who you even team up with: Riding the trend is one of the benefits of teaming up to benefit all, but only temporally. And that’s all there is to it. It so happens that Reinforcement Learning is very exceptional at handling games with simple rules but a lot of probabilities. Like the game of Go, one can learn its rules under 60 mins, but the game has more broad configurations than the number of atoms in the Universe. If Reinforcement Learning can beat the game of Go, it can beat the market.

Use Reinforcement Learning to do Algorithmic Trading

A Theory About Quantitative Trading:

Why Deep Reinforcement Learning:

Concluding remarks: The perfect game for one thing that is good at games.

Written by EASON