Speeding up your Python code
This article is originally published at https://www.thertrader.com
I know this topic is addressed on a very regular basis on the web but I’m pretty sure sharing my experience will help some finance people. I’m currently working on Limit Order Book modeling. This means dealing with fairly big data sets. I have around 1 million observations per stock and per day. So modeling the behavior of the order book just over 10 days is already a decent big data exercise. This has a significant impact on how to write the code, the type of objects to use and more generally how to approach the problem.
All tests in this post have been ran on my laptop: an ASUS Zenbook with 8GB of RAM a Core i5 processor, Lubuntu 19.04 (Disco) and Python 3.7
Data
I used a sample data set provided by Lobster. For each trading day there are 2 csv files per stock:
- The ‘message’ file: contains indicators for the type of event causing an update of the limit order book in the requested price range. All events are time stamped to seconds after midnight, with decimal precision of at least milliseconds and up to nanoseconds depending on the requested period
- The ‘order book’ file: contains the evolution of the limit order book up to the requested number of levels
A complete description of Lobster data structure can be found here
Experiment
I want to calculate some metrics that will be used as input for a machine learning model. Some of the metrics I need are based on regularly spaced data. Bear in mind that I’m dealing here with tick data which is completely irregularly spaced in time. So, should I want to calculate the 1 second return for example, I need first at each point in time, to find the right time stamp looking backward then do the calculation itself. The former is the most time consuming task. What follows is a basic prototyping exercise designed to explain how I approached the problem
- Using Pandas
import numpy as np import pandas as pd import os import time os.chdir(r'/home/arno/work/research/lobster/data/INTC/') f = 'INTC_2015-01-02_34200000_57600000_orderbook_10.csv' theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv' mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip') #----- Method #1 : Standard method timeStamp = mf['timeStamp'].to_frame() timeStamp = timeStamp[:100000] stop = [] start_time = time.time() stop = [timeStamp.timeStamp[abs(timeStamp.timeStamp - (timeStamp.timeStamp[i] - 1)).idxmin()] for i in range(len(timeStamp)-1)] print("--- %s seconds ---" % (time.time() - start_time))
In the above code data gets imported from a csv file (the message file for INTC for a single day) as a Pandas dataframe. I select the first 100000 rows in order to speedup the computation. The original data set contains around 900000 records. The float_precision option keeps the same number of decimals as in the original csv file. Then I select the time stamp and I convert it to a Pandas dataframe and, as described in the previous section, I look for the index 1 second back in the past. The all operation takes around 110 seconds, a rather poor performance given the modest size of the data.
- Using Numpy
#----- Method #2 : Numpy Array timeStamp = mf['timeStamp'].to_frame().values # convert to Numpy array automatically. Very handy!!!! timeStamp = timeStamp[:100000] stop = [] start_time = time.time() stop = [timeStamp[np.abs(timeStamp - (timeStamp[i] - 1)).argmin()] for i in np.arange(len(timeStamp))] print("--- %s seconds ---" % (time.time() - start_time))
Now instead of using a Pandas dataframe I used a Numpy array. The .value option converts automatically the Pandas dataframe to a Numpy array. The rest of the code is very similar to the previous case. Now to find the right index back in the past for the first 100000 rows, it takes about 27 seconds. It’s a significant improvement but I can certainly do better.
- Using Numpy + optimal experiment design
#----- Method #3: Numpy + optimal experiment design timeStamp = mf['timeStamp'] timeStamp = timeStamp[:100000] timeStampInSeconds = timeStamp.round(0) lookBack = max(timeStampInSeconds.value_counts()) + 10 timeStamp = timeStamp.to_frame().values myPos = [] start_time = time.time() for i in range(len(timeStamp)): if i == 0: pos = timeStamp[0] elif i < lookBack: pos = timeStamp[abs(timeStamp[:i,0] - (timeStamp[i,0] - 1)).argmin()] elif i >= lookBack: a = i - lookBack bb = timeStamp[a:i,0] pos = bb[abs(bb - (timeStamp[i,0] - 1)).argmin()] myPos.append(pos) print("--- %s seconds ---" % (time.time() - start_time))
When thinking a bit more about the problem I realised that at each point in time, first I don’t need to search for the entire set of indexes but only indexes located before i so it can be of the form i-n with n being anything between 0 and i-1. Second, if I find the maximum number of ticks per second in the entire data set this will be the maximum number of ticks I will have to look back in the past to find the right index to calculate the return per second. This is what has been done in the first 3 lines of code above. The rest of the code is a loop to go through all observations.
There is a massive improvement: It takes only 0.65 second to go through 100000 observations. The table below summarises the results presented above plus the results of running the same experiment on larger data sets
# Observations | Pandas | Numpy | Numpy + optimal experiment design |
100,000 | 110 sec. | 27 sec. | 0.65 sec. |
200,000 | 431 sec. | 155 sec. | 1.27 sec. |
300,000 | 856 sec. | 381 sec. | 1.91 sec. |
Takeaways
- Use the right tool for the job: In the example above using Numpy instead of Pandas is clearly the right choice.
- Think twice about the problem: Reshaping the problem in a more appropriate format allowed a massive performance gain.
- Loops are not necessarily a bad choice if used on purpose
- Most of the improvement comes from defining the experiment differently
- When the problem is poorly designed, the computing time doesn’t rise linearly with the size of the data set. That makes optimisation even more important
In a nutshell, code optimisation is not only about programing it’s also about properly shaping the problem
As usual all comments welcome
Thanks for visiting r-craft.org
This article is originally published at https://www.thertrader.com
Please visit source website for post related comments.