GPT Writes Terrible Pandas Code
This article is originally published at https://www.sharpsightlabs.com
Jesus.
I just spent about an hour playing with GPT, asking it to write some Pandas code for me, and I want to set my computer on fire.
Do you know the frustration and mild contempt you feel when something will make your life harder, but also offend your aesthetic sensibilities?
Like going to a rental car vendor and getting a 2005 Pontiac Aztec that also has broken cupholders.
I mean, it will probably be frustrating to use, but it’s also an affront to my good taste.
This is my reaction to GPT’s Pandas code.
To be fair: it didn’t do absolutely everything wrong.
But at least 80% of the time, it wrote code like a beginner who doesn’t know what the f*ck he’s doing, and has learned everything from other beginners posting snippets on Stack Overflow (I’m being a bit of an asshole … Stack Overflow occasionally has great content).
I’m Going to Show You GPT’s Bad Pandas Code
In this blog post, I’m going to show you the bad Pandas code that GPT generated for me.
I’m going to show you a small, but enlightening range of examples, which will probably be enough to show the types of mistake that GPT makes.
For clarification, these examples were made with GPT-3.5, which is the one that I can use most consistently without limits.
A Quick Caveat, For Beginners
Ok.
I’m about to show you some code that will probably look “fine” to you if you’re a beginner. I might even offend you by telling you that it’s bad, because there’s a good chance that you’ve written code like this or used code like this in the past.
I understand.
While I will criticize this code for being bad, it’s okay if you’re written or used something similar. We were all beginners once, including me.
BUT, there is a better way. I have a strong perspective on Pandas code, and code in general.
If you write code like the GPT code I’m about to show you, just know that my jabs are mostly in good fun, and that I want to help you improve.
Note on Datasets
In the first few examples, I gave it a custom dataframe with comma delimited data as follows:
sales_data = pd.DataFrame({"name":["William","Emma","Sofia","Markus","Edward","Thomas","Ethan","Olivia","Arun","Anika","Paulo"] ,"region":["East","North","East","South","West","West","South","West","West","East","South"] ,"sales":[50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000] ,"expenses":[42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000]})
I’ll also use the titanic
dataframe from Seaborn in one or two of the later examples.
Subset Rows on a Single Numeric Condition
First, I asked GPT to write code to subset the rows of sales_data
based on a value of a numeric variable.
Specifically, I asked it to retrieve rows where sales is greater than 60000.
Here was the conversation:
And here is the code.
sales_data[sales_data['sales'] > 60000]
Let me just say that I hate this style of Pandas code.
I call this “Bracket notation.”
You’ll see here that you need to reference the name of the dataframe multiple times: once to specify the dataframe that we’re subsetting itself, but also to retrieve the sales
variable.
This is bad Pandas code. It’s messy, and harder to read than it should be. If you don’t understand why it’s bad yet, then keep reading the other examples.
The better way
Here’s how I would do it.
Use the Pandas query method instead:
sales_data.query('sales > 60000')
The code is simpler, but also more intuitive.
Here, we’re taking a dataframe and performing a “query” operation on it to retrieve a specific subset. It’s more human-readable. And the advantage of this will become more obvious as we move on to more complex examples.
Subset Rows on a Single Categorical Condition
Next, we’ll do a similar example, but with a categorical variable used to subset the data instead of a numeric variable:
Here’s the actual code produced by GPT:
sales_data[sales_data['region'] == 'East']
You’ll notice that it’s the same issue as the previous example: GPT is using that terrible “bracket notation”.
This code is difficult to read, and unintuitive.
And as I said previously, this problem will compound as the examples get more complex.
The Better Way
Here’s how I would do it.
sales_data.query('region == "East"')
Like the previous example, you should be using the ‘query’ method.
It’s easier to read, easier to understand, and easier to modify.
Subset on Two Conditions
Ok. Now, let’s move on to a more complex example.
Here, we’ll subset on two conditions. These will be the same conditions in the two previous examples, but we’re going to ask GPT to do them both in one piece of code.
Here’s the interaction with GPT:
And here’s the code that it created:
sales_data[(sales_data['sales'] > 50000) & (sales_data['region'].isin(['East', 'West']))]
The code just gets uglier.
Notice how many times it uses the name of the data, sales_data
.
And notice how many nested brackets and parenthesis are in here.
This code is hard to read, and it will be harder to modify.
The Better Way
One better way to write this is with a single query statement, and an “and” operator.
sales_data.query('(sales > 50000) and (region in ["East", "West"])')
Notice that you only need to reference the dataframe once. The code is also a little easier to read.
But, you can also do this with Pandas method chaining.
(sales_data .query('sales > 50000') .query('region in ["East", "West"]') )
Here, we’re actually filtering the rows twice … once based on sales, and once based on region.
Because we’re doing this in two steps, this code will be easier to modify (e.g., it’s easier to comment out one of the steps, if we need to).
This code is more intuitive, easier to read, and easier to use.
Sort Data in Descending Order
Data sorting is one of the few things that it did well.
Here, I’ve asked it to sort the data by sales, in descending order.
This is the code output:
sales_data.sort_values(by='sales', ascending=False)
This is fine. It’s exactly how I would have done it.
Perform a Multistep Operation to Query, Filter, Sort
Now, I’ll show you one more example.
This should demonstrate how bad the GPT code is, and how much better it could be if you wrote the code the right way.
In this example, I asked GPT to operate on the titanic dataframe.
I’ll leave out the conversation that I had with GPT, but it actually took me 2 or 3 tries to get it to retrieve the titanic dataframe properly. This is because the titanic data exists in multiple packages. GPT named the loaded data differently than I wanted, so I needed to clarify the exact name to give it.
Setting that aside, before we look at the Pandas data wrangling, here is the code to load the data:
import seaborn as sns titanic = sns.load_dataset('titanic')
Ok.
Once we got the data, I asked GPT to do a multistep data wrangling operation to do 3 things:
- subset the rows based on a categorical variable (return only rows with a particular value)
- subset the columns to retrieve only 3 specific columns
- sort the data in descending order by one of the variables
Here’s the conversation I had with GPT:
And here’s the code it produced.
titanic_subset = titanic[titanic['embark_town'] == 'Southampton'][['sex', 'age', 'survived']].sort_values(by='age', ascending=False)
To put it bluntly, I hate this code.
How many f*cking brackets do you need to do some simple data wrangling?
A lot.
And for 2 out of 3 of the steps, there’s no human-readable commands that clearly say what it’s doing. We can see “sort_values” sorting the data, but is it immediately and totally clear where there’s a row-subset and a column-subset? Only if you’re well versed in this type of bad Pandas code. But any relative beginner will be confused.
And what if I want to temporarily remove one of these steps? It will be difficult to do, because the code is all on one line.
This is bad code. If you write code like this: there’s a better way.
The better way
The right way to do this is using Pandas chaining, with every operation on a separate line.
This will make the code easier to read, easier to modify, and it will make it aesthetically cleaner all around.
Let’s take a look.
(titanic .query('embark_town == "Southampton"') .filter(['sex', 'age', 'survived']) .sort_values(['age'], ascending = False) )
Every operation has its own line.
The commands are relatively human readable.
If you want to remove any of these steps temporarily, you can simply comment the line out with a ‘#
‘ character at the beginning of the line.
This code is much better.
Use Pandas Methods, and Use Pandas Chains, ignore GPT
The first lesson here should be clear: GPT writes bad Pandas code.
Will it improve over time? Maybe.
The issue is that most humans write bad Pandas code, and GPT was probably trained on that code.
The second lesson is that there is a better way.
If you’re using Pandas, you should be using the Pandas methods, like query, filter, sort_values, assign, etc.
They are much better.
And when you do any multi-step data manipulation with Pandas, you should be using Pandas chains.
Still not convinced?
About 3 years ago, I did an analysis of Covid-19 data. In that analysis, I needed to do several complex data manipulations with multiple steps.
For example:
(covid_data .query("date == datetime.date(2020, 3, 29)") .filter(['country','confirmed','dead','recovered']) .groupby('country') .agg('sum') .sort_values('confirmed', ascending = False) .reset_index() )
To be fair: this code is still somewhat complex, but it would be dramatically more complicated if it was written by GPT.
Can you imagine all the brackets! Having all of this on a single line, as if that’s how humans read it?
And let’s not forget, you still need to tell GPT all of the 6 steps in English language, precisely enough to get it to do the exact, multi-step operation that you need.
GTFO.
For the time being, there’s no substitute for knowing what you’re doing.
If you want to be an effective data wrangler, you need to know Pandas. And you need to know how to use it the right way.
Tell me what you think
Do you agree with me that GPT writes bad Pandas code?
Or are you a masochist who absolutely loves brackets and ugly Python code.
I want to hear from you!
Leave your comments in the comments section below.
Thanks for visiting r-craft.org
This article is originally published at https://www.sharpsightlabs.com
Please visit source website for post related comments.