How to Use the Pandas Astype Function in Python
This article is originally published at https://www.sharpsightlabs.com
In this tutorial, I’ll explain how to use the Pandas astype function to modify the datatype of Pandas dataframe columns and Pandas objects.
I’ll explain what the technique does, explain the syntax, and show you step-by-step examples.
If you need something specific, you can click on any of the following links.
Table of Contents:
Let’s start with a quick introduction to what the astype technique does.
A Quick Introduction to Pandas Astype
The Pandas astype method modifies the datatype of Pandas objects.
Frequently, we use this tool to modify the datatype of the columns of a dataframe.
But there are some other ways to use it, which I’ll cover in the examples section.
Most commonly, we use this technique when we clean a dataset in Python, although there are potentially other uses for this tool during other parts of the data science workflow.
You can use the Pandas astype technique a few different ways.
We can use this tool to change the datatype of:
- a Pandas Series
- a single column of a Pandas dataframe
- multiple columns of dataframe
I’ll show you examples of each of these in the examples section.
But first, let’s take a look at the syntax.
The Syntax of Pandas Astype
Here, we’ll look at the syntax of the Pandas astype()
technique.
Since you can use the astype technique on both dataframes and Series objects, we’ll look at the syntax for:
A quick note
Before we look at the syntax, I just want to remind you that everything else this syntax section assumes that you’ve already imported Pandas, and that you already have a Pandas dataframe or Series.
You can import Pandas like this:
import pandas as pd
And there are a variety of ways to create or import dataframes. To learn more about dataframes, you can read our tutorial on Python dataframes.
syntax: use astype on a Series
First, let’s look at how to use astype on a Pandas Series.
To call the method for a Series, just type the name of the series, and then use “dot syntax” to call the astype()
method.
Inside the parenthesis, you provide the name of the data type. The name of the data type should be enclosed inside quotations. For example, astype('int16')
.
syntax: use astype on dataframe column(s)
Next, let’s look at how to use astype on a Python dataframe.
First, you’ll type the name of the dataframe, and then use “dot syntax” to call the astype()
method.
But inside the parenthesis, instead of providing a single value as the argument, we’ll provide a dictionary.
Inside the dictionary, you provide the column name, and the datatype you want to use for the column.
You can use this to change the datatype of a single dataframe column.
But you can use this to change the datatype of multiple columns. If you want to operate on multiple columns, simply provide several column/datatype pairs, separated by commas, like this:
your_dataframe.astype({'column1':'datatype', 'column2':'datatype'})
This syntax enables you to operate on several dataframe columns at the same time.
This is very useful for data cleaning when you have several variables that need to be modified.
The parameters of Pandas astype
Now, let’s look at some of the parameters of the astype function:
dtype
copy
errors
dtype
(required)
The dtype
parameter enables you to specify the datatype.
If you provide a single datatype, it will try to apply the datatype to the whole object. If you’re operating on a Series, it will convert that series to the particular datatype. But if you’re operating on a dataframe, it will try to convert every variable to that datatype (which can cause errors).
Alternatively, if you’re operating on a dataframe, you can provide a dictionary as the argument to this parameter. As shown in the syntax section, if you use this style of syntax, you can provide a dictionary of column/datatype pairs.
copy
The copy
parameter specifies whether or not the astype
method returns a copy of the object, or operates directly on the original.
By default, this parameter is set to copy = True
. What that means is that by default, the method will return a new object (i.e., a new dataframe), and leave the original object unchanged.
Be careful if you set copy = False
, because it will directly overwrite your dataset. If you do this, you need to make sure that your code works exactly as expected.
errors
The errors
parameter specifies whether the technique will raise an exception if you try to apply an inappropriate or invalid datatype to the object.
The two possible arugments to this are:
raise
(which will raise an exception if there is an issue)ignore
(which will suppress exceptions if there is an issue. If there is an error, the original object will be returned)
By default this is set to raise = True
.
The output of Pandas astype
By default, the astype method will return a new object.
If you’re operating on a Series, it will return a Series (with the modified datatype).
If you’re operating on a dataframe, it will return a dataframe (with the modified datatype(s)).
Examples: How change the datatype in a Python dataframe
Now that we’ve looked at the syntax for the Python astype function, let’s look at some examples.
Examples:
- Change datatype of a Pandas Series
- Change the datatype of one column in a dataframe
- Change the datatypes of multiple variables in a dataframe
Run this code first
Before you run the examples, you’ll need to:
- import necessary packages
- get the example dataframe
Let’s do each of those.
Import necessary packages
First, we need to import Pandas.
You can do that with the following code:
import pandas as pd
Create example dataframe
Next, we need to create the example dataset that we’re going to work with.
We’ll build the dataset with the Pandas DataFrame function, but providing it with a dictionary of column names, and some lists with the data that we want the variables to contain:
sales_data = pd.DataFrame({"name":["William","Emma","Sofia","Markus","Edward","Thomas","Ethan","Olivia","Arun","Anika","Paulo"] ,"region":["East","North","East","South","West","West","South","West","West","East","South"] ,"sales":[50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000] ,"expenses":[42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000] } ,dtype = 'object' )
We’ll also create a Pandas series object, that contains only the expenses variable from the dataframe:
expenses_variable = sales_data.expenses
Let’s print out the dataframe so we can see the contents:
print(sales_data)
OUT:
name region sales expenses 0 William East 50000 42000 1 Emma North 52000 43000 2 Sofia East 90000 50000 3 Markus South 34000 44000 4 Edward West 42000 38000 5 Thomas West 72000 39000 6 Ethan South 49000 42000 7 Olivia West 55000 60000 8 Arun West 67000 39000 9 Anika East 65000 44000 10 Paulo South 67000 45000
You can see 4 variables:
name
region
sales
expenses
So the dataframe contains dummy sales and expense data for several salespeople, in four different regions.
Examine data types
Let’s quickly look at the data types:
sales_data.dtypes
OUT:
name object region object sales object expenses object dtype: object
Right now, all of the variables have the object
datatype.
We’ll want to change most of these.
EXAMPLE 1: Change Datatype of a Pandas Series
Let’s start ultra simple.
Here, we’re going to use astype()
on a Pandas series.
In the section where we created our dataframe, we also separated out one of the variables into a Series called expenses_variable
.
Check initial datatype
First, let’s check the original datatype:
expenses_variable.dtype
OUT:
dtype('O')
This is telling us that the Series has the ‘object’ datatype.
Change datatype
Now, we’ll use Pandas astype to change the datatype to int32
.
expenses_variable.astype('int32')
OUT:
0 42000 1 43000 2 50000 3 44000 4 38000 5 39000 6 42000 7 60000 8 39000 9 44000 10 45000 Name: expenses, dtype: int32
Explanation
If you look carefully at the bottom of the output, you’ll see that the datatype of the output is dtype: int32
.
Keep in mind: this has not changed the expenses_variable
directly. The expenses_variable
Series itself still has the ‘object’ datatype, because the output of astype()
was sent to the console.
If we wanted to directly change the data, permanently, we’d need to reassign the output back to the original variable name with the code:
expenses_variable = expenses_variable.astype('int32')
EXAMPLE 2: Change the datatype of one column in a dataframe
Now, let’s operate on a column inside of a dataframe.
This will be a little bit different than example 1, where we operated on a Pandas Series.
Here, we’re going to operate on a dataframe, and to do that, the syntax will be slightly different.
Check initial datatype
First, let’s check the original datatypes:
sales_data.dtypes
OUT:
name object region object sales object expenses object dtype: object
You’ll notice that all of the columns have the object
datatype.
Change datatype
Here, we’re going to use Pandas astype to change the data type of the sales
column.
(We’re going to also use the .dtypes
attribute to look at the data types of the output.)
sales_data.astype({'sales':'int32'}).dtypes
OUT:
name object region object sales int32 expenses object dtype: object
Explanation
Notice that in the output, the datatype of sales
has been changed to int32
.
To do this, we called astype
, and provided a dictionary as the argument.
The dictionary contains the name of the column on the left side and on the right hand side we have the new datatype.
When we use astype to operate on a full dataframe, we can use this syntax style, where we provide a dictionary with pairs for the column and datatype in the form {'column':'data-type'}
.
EXAMPLE 3: Change the datatypes of multiple variables in a dataframe
Now, let’s modify the datatype of multiple columns in a dataframe.
How to accomplish this will be very similar to how we modified one column in example 2 (we’ll have to use another dictionary).
Check initial datatype
Again, before we perform the operation, let’s check the original datatypes:
sales_data.dtypes
OUT:
name object region object sales object expenses object dtype: object
Once again, notice that all of the columns have the object
datatype.
Change datatype
Now, we’ll change the datatype of multiple columns.
To do this, we’ll use a dictionary with the name of the variable and the datatype as the 'key':'value'
pairs of our dictionary.
(We’ll also call the .dtypes
attribute after the astype
operation, so we can see the new datatypes)
sales_data.astype({'region':'category' ,'sales':'int32' ,'expenses':'int32' } ).dtypes
OUT:
name object region category sales int32 expenses int32 dtype: object
Explanation
Notice in the output that the datatypes of 3 columns have been changed:
region
was changed tocategory
sales
was changed toint32
expenses
was also changed toint32
How did we do it?
We called the astype method and inside the parenthesis, we provided a dictionary.
The dictionary contained several key/value pairs of the form 'column':'datatype'
.
That’s really it … you just need to provide a dictionary with the column names and the new data types.
Frequently asked questions about Pandas replace
Now that we’ve looked at some examples, let’s look at some common questions about the replace()
technique.
Frequently asked questions:
Question 1: I used astype, but my dataframe is unchanged. Why?
If you use the astype method, you might notice that your original dataframe remains unchanged after you call the method.
For example, in example 1, we used the following code:
sales_data.astype({'sales':'int32'})
But if you check the datatypes of sales_data
after you run the code, you’ll realize that the datatype of sales
is unchanged.
That’s because when we run the astype()
method, it produces a new dataframe, and leaves the original dataframe unchanged.
This is how most Pandas methods work.
By default, the output of the method is sent to the console. We can see the output in the console, but to save it, we need to store it with a name.
For example, you could store the output like this:
sales_data_updated = sales_data.astype({'sales':'int32'})
You can name the new output whatever you want. You could even name it with the original name sales_data
.
Just be careful. If you reassign the output of astype
to the original variable name, it will overwrite your original dataset. Make sure that you check your code so it works properly before you overwrite an input dataframe.
Leave your other questions in the comments below
Do you have any other questions about the Pandas astype method?
Is there something else that you need to know that I haven’t covered here?
If so, leave your question in the comments section below.
Discover how to become ‘fluent’ in Pandas
This tutorial showed you how to use the Pandas astype method, but if you want to master data wrangling with Pandas, there’s a lot more to learn.
So if you want to master data wrangling in Python, and become ‘fluent’ in Pandas, then you should join our course, Pandas Mastery.
Pandas Mastery is our online course that will teach you these critical data manipulation tools, show you how to memorize the syntax, and show you how to put it all together.
You can find out more here:
Learn More About Pandas Mastery
Thanks for visiting r-craft.org
This article is originally published at https://www.sharpsightlabs.com
Please visit source website for post related comments.