Flavors of SQL on Pandas DataFrame
This article is originally published at https://statcompute.wordpress.com
In R, sqldf() provides a convenient interface of running SQL statement on data frames. Similarly, Python also offers multiple ways to interact between SQL and Pandas DataFrames by leveraging the lightweight SQLite engine. While pandasql (https://github.com/yhat/pandasql) works similarly to sqldf() in R, pysqldf (https://github.com/airtoxin/pysqldf) is even more powerful. In my experiments shown below, advantages of pysqldf over pandasql are two-fold. First of all, pysqldf is 2 – 3 times faster than pandasql. Secondly, pysqldf supports new function definitions, which is not available in pandasql. However, it is worth mentioning that the generic python interface to an in-memory SQLite database can be more efficient and flexible than both pysqldf and pandasql, as demonstrated below, as long as we are able to get the DataFrame into the SQLite and let it stay in-memory.
from sqlite3 import connect from pandas import read_sql_query import pandasql import pysqldf import numpy # CREATE AN IN-MEMORY SQLITE DB con = connect(":memory:") cur = con.cursor() cur.execute("attach 'my.db' as filedb") cur.execute("create table df as select * from filedb.hflights") cur.execute("detach filedb") # IMPORT SQLITE TABLE INTO PANDAS DF df = read_sql_query("select * from df", con) # WRITE QUERIES sql01 = "select * from df where DayofWeek = 1 and Dest = 'CVG';" sql02 = "select DayofWeek, AVG(ArrTime) from df group by DayofWeek;" sql03 = "select DayofWeek, median(ArrTime) from df group by DayofWeek;" # SELECTION: # 1. PANDASQL %time t11 = pandasql.sqldf(sql01, globals()) # 2. PYSQLDF %time t12 = pysqldf.SQLDF(globals()).execute(sql01) # 3. GENERIC SQLITE CONNECTION %time t13 = read_sql_query(sql01, con) # AGGREGATION: # 1. PANDASQL %time t21 = pandasql.sqldf(sql02, globals()) # 2. PYSQLDF %time t22 = pysqldf.SQLDF(globals()).execute(sql02) # 3. GENERIC SQLITE CONNECTION %time t23 = read_sql_query(sql02, con) # DEFINING A NEW FUNCTION: # DEFINE A FUNCTION NOT SUPPORTED IN SQLITE class median(object): def __init__(self): self.a = [] def step(self, x): self.a.append(x) def finalize(self): return numpy.median(self.a) # 1. PYSQLDF udafs = {"median": median} %time t31 = pysqldf.SQLDF(globals(), udafs = udafs).execute(sql03) # 2 GENERIC SQLITE CONNECTION con.create_aggregate("median", 1, median) %time t32 = read_sql_query(sql03, con)
Thanks for visiting r-craft.org
This article is originally published at https://statcompute.wordpress.com
Please visit source website for post related comments.