More Terrible Pandas code from GPT
This article is originally published at https://www.sharpsightlabs.com
A few days ago, I was writing some code for a new machine learning course that I’m building, and I had a somewhat tricky problem to solve.
As I’ve said in the past, Pandas code (and data manipulation more generally) is very important for machine learning. Not just for wrangling your data into the proper shape, but also for analyzing the results after you build your models.
In this particular case, I wrote some code to compare multiple machine learning models, and I was trying to analyze the results contained within the output.
To do this, I wanted to subset the output down to the results for a specific model type, but also subset down to records where the training time was an exact number. (It’s unimportant why, but I was unsure that the procedure was working properly, so I wanted to subset the results data to verify that it was correct.)
I had a few ideas for how to do this, but the problem is tricky in a subtle way, so I wanted to see how GPT would solve it.
I asked GPT the following question:
To put it lightly, I was displeased with the results.
GPT Wrote Terrible Pandas Code
I’ll show you how I solved the problem later in this blog post, but first, we’ll take a look at the many bad solutions GPT came up with, and how it completely ignored my preferences about how to solve the problem.
Let’s take a look at how GPT solved this problem.
Can you read this code, quickly and easily?
And good god.
So. many. brackets.
Anyone who’s seen me write Pandas code before or seen me critique it, knows that this code is problematic.
“I hate your solution, GPT”
I quickly let GPT know that I didn’t like the solution.
(I had had a long day, and was a bit of an a$$hole about it):
GPT then updated it’s response by using the dreaded “bracket notation” again, while also giving me an even more complex solution.
You’ll notice in this code that GPT is creating a new variable, but is using bracket notation to do it. It should be using the Pandas assign method.
It does use the Pandas query method, as I requested, but instead of using two separate query steps like I initially did in my code, it collapsed the two queries into one.
This is bad, because it makes it harder to remove one or the other query in case you need to change how you perform the operation or debug the code. Again, you should separate “
and” queries whenever possible, because it will make your code easier to work with.
GPT Doubles Down, Again
I politely let GPT know that I was displeased with the output, implying that I wanted it to correct the issues.
It turned out badly.
You can see next that GPT attempted to re-do the code, but just made the solution more complex.
I pointed out that the lambda function just made the code hard to read.
(Again, I was a bit of an a**hole about it, but it’s a computer program.)
After I point this the problems, GPT attempted to re-do the code yet again:
Is this any simpler? Easier to use? Easy to read?
Yet again, another over-complicated solution.
The right way to solve this
Now that you’ve seen the many complicated ways that GPT used to try to solve this, let’s look at the good solution:
Before I show you how to do this, we’ll need to get the data (so you can run the code yourself).
import pandas as pd bootstrap_df = pd.read_csv('http://sharpsightlabs.com/datasets/model_results.csv')
Solve the Problem with 2 Query Steps
Now that you have the data, I’ll show you a better way to solve the problem.
Here, we’ll use 2 calls to the Pandas query method.
The first to query will subset to rows for logistic regression data, and the second query will subset for rows where
fit_time is one particular value.
(bootstrap_df .query('model == "LogReg"') .query('fit_time.round(6) == .678665') )
Notice how simple this is.
The real secret to doing this is to use the
round() method, instead of all the complicated code suggested by GPT.
It’s very simple. GPT got it wrong, in the sense that it made it way, way to complicated.
You Need to Master Pandas
The point here is that you need to master Pandas on your own.
GPT can be a useful pair-programmer, and it can help you solve some types of problems.
But unfortunately, GPT writes terrible Pandas code.
Relying on GPT will only cause problems for you.
If you want to be a great data scientist, as I’ve said many times in the past …
… you need to master Pandas.
Tell me What You Think
Do you agree with me that GPT writes bad Pandas code?
Or do you prefer your code to be complex and virtually unreadable.
I want to hear from you!
Leave your comments in the comments section at the bottom of the page.
Please visit source website for post related comments.