Data Science / Machine learning / Python / R News / Statistics

Using random forest to model limit order book dynamic

by The R Trader · May 1, 2020

This article is originally published at https://www.thertrader.com

In this article I use the random forest algorithm to forecast mid price dynamic over short time horizon i.e. a few seconds ahead. This is of particular interest to market makers to skew their bid/ask spread in the direction of the most favorable outcome. Most if not all the literature on the topic (see references below) focuses on applying straight out of the box algorithm to create forecast at any point in time. The problem in a real life environment is different. A market maker can provide a standard bid/ask spread most of the time and only when she/he has a statistical hedge she/he can skew the spread in the direction given by the model. This is what I try to do here: creating a forecast only when a statistical hedge exists.

I used Python scikit-learn library and my laptop only to run the analysis (ASUS Zenbook with 8GB of RAM running Linux Lubuntu 19.04, and Python 3.7). I created a dedicated GitHub repository here. It contains the code, some sample data and the associated explanations. I’m happy for anyone to re-use my work as long as proper reference to it is made.

Before proceeding, I wanted to thank LOBSTER for providing the dataset used in this study.

Data

The Limit Order Book (LOB) is the record of collective interest to buy or sell certain quantities of an asset at a certain price. I used here LOBSTER tick data for TSLA (Tesla) from 1st Jan 2015 to 30th Jan 2015: a total of 20 full trading days. The LOBSTER data structures is made of 2 files: a ‘message‘ files and an ‘orderbook ‘ file for each trading day. The keen reader can have a look here for a detailed explanation of the data structure.

There are between 200,000 and 1 million observations per trading day and per stock with order book level depth of 10 (10 best ask and best bid for each tick). This is roughly multiplied by two as there is one message file and one order book file every day. So this is a fairly big data set that forced me to do some code optimisation.

Price predictors

Several categories of features are selected based on what is reported to be most significant in the literature. I also added some based on experience. In order to avoid overfitting and make results more “readable”, all features have been expressed in terms of deciles relative to their recent history (the last 5 sec.) . The added benefit of such an approach is to include the past in the forecasting exercise. Only the first 5 levels of the order book have been kept. There are 67 features overall.

Order Imbalance (x 10): An order imbalance (Imbalance) occurs when there are not enough buy or sell orders on the market to meet the demand for the opposite side. In the context of this study we define the Imbalance as an oscillator : $I^{t}_{L}=(V^{t}_{L, buy} - V{t}_{L, sell})/(V{t}_{L, buy} + V{t}_{L, sell})$ , V = Volume, L = Level in the order book, t = time. Imbalance level and the decile relative to the past 5 sec. have been included

Mid price (x 5), B/A price spread (x 5), B/A size spread (x 5): Last observation. Decile ranking compared to the last 5 sec.

Bid price differences across levels (x 5), Ask price differences across levels (x 5): Differences between order book levels (1 to 5). Deciles ranking compared to the last 5 sec.

Accumulated price difference (x 1) and size difference (x 1): Sum of all bid minus sum of all asks across levels. Same for volumes. Deciles ranking compared to the last 5 sec.

Bid price (x 5), Ask price (x 5), Bid size (x 5) and Ask size (x 5): Variation relative to the last second. Decile ranking compared to the last 5 sec.

Bid price mean (x 1), Ask price mean (x 1), Bid size mean (x 1) and Ask size mean (x 1): Mean prices and volumes across order book levels. Decile ranking compared to the last 5 sec.

Trade intensity (x 6): Volume of buy and sell orders created and cancelled within a short time period at levels 1 to 5. I defined two different variables:

Count of event types 1 and 3 (creation & total deletion) over the last second.
The above metric expressed as decile compared to the past 5 sec.
Relative events count (10 sec. compared to 900 sec.). Deciles ranking compared to the last 5 sec.

Methodology

Mid price forecasting: I focus on forecasting the mid price defined as the average of best bid and best ask at any point in time: $Mid^{t}=(Bid^{t}{1} + Ask^{t}{1})/2$ . I want to forecast the direction of the market over the next few seconds and I’m not interested in the magnitude of the move. Therefore the independent variable (i.e. label) has been defined as following over 2 different time horizons: 10 seconds and 20 seconds :

-1: Negative return
0: Stable
+1: Positive return

Walk forward analysis: In order to match what is done in real life, I adopted a walk forward approach to estimate the model parameters. Walk forward does model’s optimization on a training set (calibration); test on a period after the training set (test / out of sample / validation) and then rolls it all forward and repeats the process. There are multiple test periods and we’ll look at these results combined to estimate the overall model performance.

Standard random forest algorithm

Random forest (from Wikipedia) is an ensemble learning method for classification or regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random forests correct for decision trees’ habit of overfiting to their training set.

In more layman terms, random forest is a bunch of decision trees bundled together. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes the model’s prediction. Random forest is a bagging algorithm.

Advantages of random forest: Decision trees are robust to outliers and they do not assume any prior distribution in the data set (purely non parametric) which, in the case of tick data, is a significant advantage. However standard decision trees tend to overfit the data. Using Random forest addresses this problem. The voting scheme embedded in the random forest algorithm removes (at least partially) the overfitting bias of individual trees. In addition it’s considered as one of the most successful tool to model LOB in the machine learning arsenal.

Draw backs of random forest: Random forest suffers from the ‘black box’ problem. Letting the algorithm chose the tree depth often results in extremely complicated trees that are impossible to interpret. Features are available in the form of continuous values meaning that trees can split the variables at any level and this level will vary greatly from one tree to another. This will inevitably create an over fitting bias at individual trees level. To overcome this issue I expressed most features as decile relative to their recent past.

Modified random forest: The standard scikit-learn implementation of random forest creates a classification that is the mode classification created by each tree. The class (up, down or stable) with the highest vote in the tree is the forecast for that tree. There is only one forecast per tree (this is another issue that I will not address here) and this no matter the probability threshold. Let’s take an example. Whether the model forecasts a 51% chance of market going up or 90% chance of market going up, from the algorithm perspective, both forecasts are equal. This is obviously a major draw back. I amended the algorithm to produce forecast only when the probability threshold was significantly higher than 50%. This theoretically means less but more accurate forecast which is exactly what I’m trying to achieve.

Results

I used the standard implementation of scikit-learn random forest but, as mentioned above, I changed the significance level (forecasting threshold) to evaluate the impact on the quality of the forecast. The criteria for success is the hit ratio: the number of correct forecasts divided by the total number of forecasts per class of the label and per trading day. The model is trained over 4 days and the resulting random forest is applied the following day. Then the training set and the testing set are rolled forward by one day and hit ratios are recalculated. The process is repeated until reaching the end of the sample period. Results presented below are those using a 10 sec. forward return (i.e. mid price return between now and 10 sec. from now). Very similar results are obtained using a 20 sec. forward return window. The tables below present the hit ratios per class of the label and per day for 3 different forecasting thresholds (50%,55% and 60%).

Table 1: Hit Ratio per day and per label. Calibration 4 days, Test 1 day, Forecasting threshold 50%, Forward return 10 sec.

Table 2: Hit Ratio per day and per label. Calibration 4 days, Test 1 day, Forecasting threshold 55%, Forward return 10 sec.

Table 3: Hit Ratio per day and per label. Calibration 4 days, Test 1 day, Forecasting threshold 60%, Forward return 10 sec.

With a hit ratio higher than 50% in the standard case and that goes higher as I move the prediction threshold up this confirms the assumption made above. As I tighten the rules the prediction gets better.

There is a clear loss of predictive power between the calibration and the test period. This is expected but given the size of the difference between the two periods, it can potentially be the sign of some overfitting despite the voting scheme embedded in the algorithm. Something to investigate.

The model doesn’t make any forecast for the class ‘stable’ which indicates that there is no 10 seconds period where the mid price didn’t move. I did some random data check and it seems to be the case.

I also tested the impact of calibrating the model on 4 days and testing it on one day but 4 days later. For example on the first line of the table 4 below the random forest has been calibrated using data from 2015-01-02, 2015-01-05 2015-01-06, 2015-01-07 and tested with data from 2015-01-15. I only present results for a prediction threshold of 60% but the conclusion applies to other thresholds.

Table 4: Hit Ratio per day and per label. Calibration 4 days, Test 1 day 4 days later, Forecasting threshold 60%, Forward return 10 sec.

A hit ratio higher than 61% overall is still very good and in line with previous results. So creating a lag between the calibration period and the test period doesn’t seem to affect the results. This indicates that the rules generated with this methodology are very generic (i.e. robust) which opens the door to further investigation.

Conclusion

This study suggests that random forest is a good tool for forecasting the direction of the mid price a few seconds ahead. This is encouraging but it’s only a preliminary work that is subject to a number of limitations.

Assumptions: I made some strong assumptions that probably need to be relaxed. I got rid of the first and last 15 min of every trading day to avoid the model being polluted by the opening and closing auction/cross effect. I’m not sure this is a good think though. I didn’t perform features selections, instead I arbitrarily got rid of all order book levels above 5 this is very likely sub optimal. Similarly, the 4 days calibration, one day test periods have been chosen at random. This could be easily improved.

Implementation considerations: What is presented here is by no means a trading strategy. This is instead a promising research avenue that needs to be refined and tailored to individual needs. In particular, I don’t answer the question on how profitable this could be and how to implement it. As it stands, reevaluating all trees at each tick to get the probability of up, stable or down market is hardly doable in practice. This will create unacceptable extra latency. A smart way to implement this approach is required.

Further research avenues: It might be very interesting to investigate the combination of forecasts over several time horizons and not only 10 seconds alone. In addition it could be more efficient to consider not trees but individual decision paths instead as the basis for the calculation of up, stable and down market probability. Last point, applying to TSLA a random forest estimated on another stock LOB could be interesting especially as we found out that rules generated with the methodology presented here are rather robust.

As usual any comments welcome!

References

“Machine learning techniques for price change forecast using the limit order book data” – James Han, Johnny Hong, Nicholas Sutardja, Sio Fong Wong – Dec 2015

“Machine Learning for Forecasting Mid Price Movement using Limit Order Book Data” – Paraskevi Nousi, Avraam Tsantekidis, Nikolaos Passalis, Adamantios Ntakaris, Juho Kanniainen, Anastasios Tefas, Moncef Gabbouj, Alexandros Iosifidis – Sep 2018

“Modeling high-frequency limit order book dynamicswith support vector machines” – Alec N.Kercheval, uan Zhang – Oct 2013

“Price jump prediction in a limit order book” – Ban Zheng, Eric Moulines, Frédéric Abergel – May 2013

“Investigating Limit Order Book Characteristics for Short Term Price Prediction: a Machine Learning Approach” – Faisal Qureshi – Dec 2018

“The Evaluation and Optimization of Trading Strategies” – Robert Pardo

Thanks for visiting r-craft.org
This article is originally published at https://www.thertrader.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Using random forest to model limit order book dynamic

You may also like...

Categories

Using random forest to model limit order book dynamic

Data

Price predictors

Methodology

Results

Conclusion

References

You may also like...

Gold-Mining Week 15 (2018)

Let’s get LEGO’d!

What is life expectancy? And, even more important, what it isn’t

Categories