Data Science / Python / R News

Speeding up your Python code

by The R Trader · March 15, 2020

This article is originally published at https://www.thertrader.com

I know this topic is addressed on a very regular basis on the web but I’m pretty sure sharing my experience will help some finance people. I’m currently working on Limit Order Book modeling. This means dealing with fairly big data sets. I have around 1 million observations per stock and per day. So modeling the behavior of the order book just over 10 days is already a decent big data exercise. This has a significant impact on how to write the code, the type of objects to use and more generally how to approach the problem.

All tests in this post have been ran on my laptop: an ASUS Zenbook with 8GB of RAM a Core i5 processor, Lubuntu 19.04 (Disco) and Python 3.7

Data

I used a sample data set provided by Lobster. For each trading day there are 2 csv files per stock:

The ‘message’ file: contains indicators for the type of event causing an update of the limit order book in the requested price range. All events are time stamped to seconds after midnight, with decimal precision of at least milliseconds and up to nanoseconds depending on the requested period
The ‘order book’ file: contains the evolution of the limit order book up to the requested number of levels

A complete description of Lobster data structure can be found here

Experiment

I want to calculate some metrics that will be used as input for a machine learning model. Some of the metrics I need are based on regularly spaced data. Bear in mind that I’m dealing here with tick data which is completely irregularly spaced in time. So, should I want to calculate the 1 second return for example, I need first at each point in time, to find the right time stamp looking backward then do the calculation itself. The former is the most time consuming task. What follows is a basic prototyping exercise designed to explain how I approached the problem

Using Pandas

import numpy as np    
import pandas as pd
import os
import time

os.chdir(r'/home/arno/work/research/lobster/data/INTC/')

f = 'INTC_2015-01-02_34200000_57600000_orderbook_10.csv'
theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv'

mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip')

#----- Method #1 : Standard method 
timeStamp = mf['timeStamp'].to_frame() 
timeStamp = timeStamp[:100000]

stop = []
start_time = time.time()
stop = [timeStamp.timeStamp[abs(timeStamp.timeStamp - (timeStamp.timeStamp[i] - 1)).idxmin()] for i in range(len(timeStamp)-1)]
print("--- %s seconds ---" % (time.time() - start_time))

In the above code data gets imported from a csv file (the message file for INTC for a single day) as a Pandas dataframe. I select the first 100000 rows in order to speedup the computation. The original data set contains around 900000 records. The float_precision option keeps the same number of decimals as in the original csv file. Then I select the time stamp and I convert it to a Pandas dataframe and, as described in the previous section, I look for the index 1 second back in the past. The all operation takes around 110 seconds, a rather poor performance given the modest size of the data.

Using Numpy

#----- Method #2 : Numpy Array
timeStamp = mf['timeStamp'].to_frame().values # convert to Numpy array automatically. Very handy!!!! 
timeStamp = timeStamp[:100000]

stop = []
start_time = time.time()
stop = [timeStamp[np.abs(timeStamp - (timeStamp[i] - 1)).argmin()] for i in np.arange(len(timeStamp))]
print("--- %s seconds ---" % (time.time() - start_time))

Now instead of using a Pandas dataframe I used a Numpy array. The .value option converts automatically the Pandas dataframe to a Numpy array. The rest of the code is very similar to the previous case. Now to find the right index back in the past for the first 100000 rows, it takes about 27 seconds. It’s a significant improvement but I can certainly do better.

Using Numpy + optimal experiment design

#-----  Method #3: Numpy + optimal experiment design
timeStamp = mf['timeStamp']
timeStamp = timeStamp[:100000]
timeStampInSeconds = timeStamp.round(0) 
lookBack = max(timeStampInSeconds.value_counts()) + 10  

timeStamp = timeStamp.to_frame().values

myPos = []

start_time = time.time()

for i in range(len(timeStamp)):
    if i == 0:
        pos = timeStamp[0]
    elif i < lookBack:
        pos = timeStamp[abs(timeStamp[:i,0] - (timeStamp[i,0] - 1)).argmin()]
    elif i >= lookBack:    
        a = i - lookBack
        bb = timeStamp[a:i,0]
        pos = bb[abs(bb - (timeStamp[i,0] - 1)).argmin()]
    myPos.append(pos)
    
print("--- %s seconds ---" % (time.time() - start_time))

When thinking a bit more about the problem I realised that at each point in time, first I don’t need to search for the entire set of indexes but only indexes located before i so it can be of the form i-n with n being anything between 0 and i-1. Second, if I find the maximum number of ticks per second in the entire data set this will be the maximum number of ticks I will have to look back in the past to find the right index to calculate the return per second. This is what has been done in the first 3 lines of code above. The rest of the code is a loop to go through all observations.

There is a massive improvement: It takes only 0.65 second to go through 100000 observations. The table below summarises the results presented above plus the results of running the same experiment on larger data sets

# Observations	Pandas	Numpy	Numpy + optimal experiment design
100,000	110 sec.	27 sec.	0.65 sec.
200,000	431 sec.	155 sec.	1.27 sec.
300,000	856 sec.	381 sec.	1.91 sec.

Takeaways

Use the right tool for the job: In the example above using Numpy instead of Pandas is clearly the right choice.
Think twice about the problem: Reshaping the problem in a more appropriate format allowed a massive performance gain.
Loops are not necessarily a bad choice if used on purpose
Most of the improvement comes from defining the experiment differently
When the problem is poorly designed, the computing time doesn’t rise linearly with the size of the data set. That makes optimisation even more important

In a nutshell, code optimisation is not only about programing it’s also about properly shaping the problem

As usual all comments welcome

Thanks for visiting r-craft.org
This article is originally published at https://www.thertrader.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Speeding up your Python code

You may also like...

Categories

Speeding up your Python code

Data

Experiment

Takeaways

You may also like...

Predicting Sunspot Frequency with Keras

Hospital Infection Scores – R Shiny App

Crime Analysis – Denver-Part 3

Categories