Data Science / Python / R News

How to Use the Pandas Assign Method to Add New Variables

by Joshua Ebner · October 30, 2020

This article is originally published at https://www.sharpsightlabs.com

In this tutorial, I’ll explain how to use the Pandas assign method to add new variables to a Pandas dataframe.

In this tutorial, I’ll explain what the assign method does and how it works. I’ll explain the syntax, and I’ll show you step-by-step examples of how to use it.

If you need something specific, you can click on one of the following links and it will take you to the right section in the tutorial.

Table of Contents:

Having said that, if you really want to understand Pandas assign, I recommend that you read the whole article.

A quick introduction to Pandas Assign

So what does the assign method do?

Put simply, the assign method adds new variables to Pandas dataframes.

Quickly, I’ll explain that in a little more depth.

Pandas is a toolkit for working with data in Python

You’re probably aware of this, but just to clarify: Pandas is a toolkit for working with data in the Python programming language.

In Pandas, we typically work with a data structure called a dataframe.

A dataframe is a collection of data stored in a rows and column format.

Pandas gives us a toolkit for creating these Dataframes, and it also provides tools for modifying dataframes.

Pandas has tools for sorting dataframes, aggregating dataframes, reshaping dataframes, and a lot more.

And one of the most important things we need to be able to do, is add new columns to a dataframe.

Pandas Assign Adds New Columns to a Dataframe

The Pandas assign method enables us to add new columns to a dataframe.

We provide the input dataframe, tell assign how to calculate the new column, and it creates a new dataframe with the additional new column.

It’s fairly straightforward, but as the saying goes, the devil is in the details.

So with that said, let’s take a look at the syntax so we can see how the assign method works.

The syntax of the assign method

The syntax for the assign method is fairly simple.

You type the name of your dataframe, then a “dot”, and then type assign().

Remember, the assign method is a Python method that’s associated with dataframe objects, so we can use so-called “dot syntax” to call the method.

Next, inside the parenthesis, we need to provide a “name value pair.”

What does that mean?

We simply provide the name of the new variable and the value that we want to assign to that variable. The value that we assign can be simple (like an integer constant), but it can also be a complicated value that we calculate.

I’ll show you examples of exactly how we use it in the examples section of this tutorial.

Syntax to add multiple variables to a dataframe

One quick note on the syntax:

If you want to add multiple variables, you can do this with a single call to the assign method.

Just type the name of your dataframe, call the method, and then provide the name-value pairs for each new variable, separated by commas.

Honestly, adding multiple variables to a Pandas dataframe is really easy. I’ll show you how in the examples section.

The Output of the Assign Method

Before we look at the examples, let’s quickly talk about the output of the assign method.

This is really important, so you need to pay attention …

The output of the assign method is a new dataframe.

Read that again. It’s really important.

The output of the assign method is a new dataframe.

So if you use the assign method, you need to save the output in some way, or else the output will go to the console (if you’re working in an IDE).

The implication of this, is that if you just run the method, your original dataframe will be left unchanged unless you store the output to the original name.

(You can obviously also store the output to a new name. This is safer, unless you’re positive that you want to overwrite your original data.)

Examples: how to add a column to a dataframe in Pandas

Ok. Now that I’ve explained how the syntax works, let’s take a look at some examples of how to use assign to add new variables to a dataframe.

Examples:

Obviously, you can click on any of the above links, and it will take you to that example in the tutorial.

Run this code first

Before you run any of these examples, you need to do two things:

import pandas
create the dataframe we’ll use

Import Pandas

You can run this code to import Pandas:

import pandas as pd

Create DataFrame

Next, let’s create our dataframe.

sales_data = pd.DataFrame({
"name":["William","Emma","Sofia","Markus","Edward","Thomas","Ethan","Olivia","Arun","Anika","Paulo"]
,"region":["East","North","East","South","West","West","South","West","West","East","South"]
,"sales":[50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000]
,"expenses":[42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000]})

We’ve called this DataFrame sales_data.

This dataframe contains mock sales data for 11 people and it has variables for both sales and expenses.

From here, we can use the assign() method to add some new variables.

EXAMPLE 1: Create a new variable and assign a constant

In this first example, we’re going to add a new variable to the datafame and assign a constant value for every row.

Let’s think about something specific.

Say that you’re working with this dataset, and all of these people work for the same company. You might have some other dataframes that have records for salespeople who work for different companies, but everyone in sales_data works for the same company.

What if we want to create a variable that contains the company name for the people in this dataframe?

We can do that with assign as follows:

sales_data.assign(company = "Vandelay Industries")

OUT:

       name region  sales  expenses              company
0   William   East  50000     42000  Vandelay Industries
1      Emma  North  52000     43000  Vandelay Industries
2     Sofia   East  90000     50000  Vandelay Industries
3    Markus  South  34000     44000  Vandelay Industries
4    Edward   West  42000     38000  Vandelay Industries
5    Thomas   West  72000     39000  Vandelay Industries
6     Ethan  South  49000     42000  Vandelay Industries
7    Olivia   West  55000     60000  Vandelay Industries
8      Arun   West  67000     39000  Vandelay Industries
9     Anika   East  65000     44000  Vandelay Industries
10    Paulo  South  67000     45000  Vandelay Industries

Explanation

So what did we do in this example?

Here, we created a new variable called company.

For every row in the data, the value for the company variable is the same. The value is “Vandelay Industries.”

In technical terms, the value is a constant for every row. More specifically, it’s a string value.

Having said that, when we create variables with constant values, we can add string values like this example, but we can also assign a new variable with a constant numeric value. For example, try the code sales_data.assign(newvar = 1).

EXAMPLE 2: Add a variable that’s a computed value

Here, we’re going to assign a new variable that’s a computed value.

Specifically, we’re going to create a new variable called profit that equals sales minus expenses. (Finance and accounting geeks will know that this is not a precise way to compute profit, but we’ll use this simplified calculation for purposes of example.)

Let’s run the code, and I’ll explain below.

sales_data.assign(profit = sales_data.sales - sales_data.expenses)

OUT:

       name region  sales  expenses  profit
0   William   East  50000     42000    8000
1      Emma  North  52000     43000    9000
2     Sofia   East  90000     50000   40000
3    Markus  South  34000     44000  -10000
4    Edward   West  42000     38000    4000
5    Thomas   West  72000     39000   33000
6     Ethan  South  49000     42000    7000
7    Olivia   West  55000     60000   -5000
8      Arun   West  67000     39000   28000
9     Anika   East  65000     44000   21000
10    Paulo  South  67000     45000   22000

Explanation

Here, we created a new computed column called profit.

As you can see, profit is simply sales minus expenses.

Notice though, that when we reference the sales and expenses variables inside of assign(), we need to call them as sales_data.sales and sales_data.expenses.

Alternatively, we could call them as sales_data['sales'] and sales_data['expenses'].

I prefer the former because they’re much easier to read, but you can choose.

EXAMPLE 3: Add multiple variables to your dataframe

In the previous two examples, we were adding only one new variable at a time.

Here in this example, we’ll add two variables at the same time.

We’re going to add the profit variable and the company variable.

Let’s take a look.

sales_data.assign(profit = sales_data.sales - sales_data.expenses
                 ,company = "Vandelay Industries"
                 )

OUT:

       name region  sales  expenses  profit              company
0   William   East  50000     42000    8000  Vandelay Industries
1      Emma  North  52000     43000    9000  Vandelay Industries
2     Sofia   East  90000     50000   40000  Vandelay Industries
3    Markus  South  34000     44000  -10000  Vandelay Industries
4    Edward   West  42000     38000    4000  Vandelay Industries
5    Thomas   West  72000     39000   33000  Vandelay Industries
6     Ethan  South  49000     42000    7000  Vandelay Industries
7    Olivia   West  55000     60000   -5000  Vandelay Industries
8      Arun   West  67000     39000   28000  Vandelay Industries
9     Anika   East  65000     44000   21000  Vandelay Industries
10    Paulo  South  67000     45000   22000  Vandelay Industries

Explanation

Here in this example, we added two variables at the same time: profit and company.

Notice that syntactically, I actually put the second variable on a new line of code. This is mostly for readability. If you want, you can keep all of your code on the same line, but I don’t necessarily recommend it. I personally think that your code is much easier to read and debug if each different variable assignment is on a separate line.

That said, the two new variable assignments must be separated by a comma. Here, the comma that separates the two variable assignments comes before the assignment of the company variable. This is important, so don’t forget the comma.

EXAMPLE 4: Store the output of assign to a new name

Finally, let’s do one more example.

Here, we’re going to store the output to a new name.

Notice that in the previous examples, the code did not modify the original dataframe.

When we use assign, it produces a new dataframe as an output and leaves your original dataframe unchanged. This is very important to remember! Many beginner data science students get frustrated when they first use this technique, because they can’t figure out why their dataframe stays the same, even after they run assign(). Always remember: assign produces a new dataframe.

Having said that, we can store the new output dataframe to a new name.

If we want, we can store it to a new name, like sales_data_revised.

Or, we can store it to the original dataframe name, sales_data, and overwrite the original!

So it is possible to directly modify your original dataframe, but you need to do it with an equal sign to store the output of the assign method.

Ok, with all that said, let’s look at an example.

Here, we’ll take the output of assign and store it to a new name called sales_data_revised.

sales_data_revised =  sales_data.assign(profit = sales_data.sales - sales_data.expenses
                                        ,company = "Vandelay Industries"
                                        )

Now, the new dataframe is stored in sales_data_revised.

Let’s print it out.

print(sales_data_revised)

OUT:

       name region  sales  expenses  profit              company
0   William   East  50000     42000    8000  Vandelay Industries
1      Emma  North  52000     43000    9000  Vandelay Industries
2     Sofia   East  90000     50000   40000  Vandelay Industries
3    Markus  South  34000     44000  -10000  Vandelay Industries
4    Edward   West  42000     38000    4000  Vandelay Industries
5    Thomas   West  72000     39000   33000  Vandelay Industries
6     Ethan  South  49000     42000    7000  Vandelay Industries
7    Olivia   West  55000     60000   -5000  Vandelay Industries
8      Arun   West  67000     39000   28000  Vandelay Industries
9     Anika   East  65000     44000   21000  Vandelay Industries
10    Paulo  South  67000     45000   22000  Vandelay Industries

Explanation

When we run the code in this example, assign() is creating a new dataframe with the newly assigned variables, profit and company.

But instead of letting that new output be passed to the console, we’re storing it with a new name so we can access it later.

Remember: assign produces a new dataframe as an output and leaves the original unchanged. If you want to store the output, you need to use the equal sign to pass the output to a new name.

How to Overwrite your Original Data

One last comment on this.

You can actually overwrite your original data directly. To do this, just run the assign method and pass the output to the original dataframe name, sales_data.

sales_data = sales_data.assign(profit = sales_data.sales - sales_data.expenses
                               ,company = "Vandelay Industries"
                               )

This is totally appropriate to do in some circumstances. Sometimes, you really do want to overwrite your data.

But be careful!

Test your code before you do this, otherwise you might overwrite your data with incorrect values!

Frequently Asked Questions about the Pandas Assign Method

Let’s very quickly address one common question about the Pandas assign method.

Question 1: Why is my dataframe unchanged, after using assign?

This is a very common question, and the answer is very straightforward.

As I mentioned several times in this tutorial, the assign method returns a new dataframe that contains the newly assigned variables, and it leaves your input dataframe unchanged.

If you want to overwrite your dataframe, and add the new variables, you need to take the output and use the equal sign to re-store the output into the original name.

So you need to set sales_data = sales_data.assign(...), like this:

sales_data =  sales_data.assign(profit = sales_data.sales - sales_data.expenses
                                ,company = "Vandelay Industries"
                                )

Keep in mind that this will overwrite your data! So you need to be very careful when you do this. Test your code and make sure that it’s working exactly as expected before you do this. If you don’t you might overwrite your original data with an incorrect dataset, and you’ll have to re-start your data retrieval and data wrangling from scratch. This is sometimes a huge pain in the a**, so be careful.

Alternatively, you can store the output of assign with a new name, like this:

sales_data_revised =  sales_data.assign(profit = sales_data.sales - sales_data.expenses
                                        ,company = "Vandelay Industries"
                                        )

Storing the output with a new name, like sales_data_revised, is safer because it doesn’t overwrite the original.

You may actually want to overwrite the original, just make sure that your code works before you do.

Leave your other questions in the comments below

Do you have other questions about the assign method?

Leave your questions in the comments section near the bottom of the page.

Discover how to master data wrangling with Pandas

This tutorial should give you a taste of how to use Pandas to manipulate your data, but there’s a lot more to learn.

If you really want to master data wrangling with Pandas, you should join our premium online course, Pandas Mastery.

Pandas Mastery is our online course that will teach you these critical data manipulation tools.

Inside the course, you’ll learn all of the essentials of data manipulation in pandas, like:

adding new variables
filtering data by logical conditions
subsetting data
working with Pandas indexes
reshaping data
and much more …

Additionally, you’ll discover our unique practice system that will enable you to memorize all of the syntax you learn.

And, it will only take a few weeks.

We’re re-opening Pandas Mastery for enrollment next week on November 3.

If you have questions about it, just leave your question in the comments section below.

The post How to Use the Pandas Assign Method to Add New Variables appeared first on Sharp Sight.

Thanks for visiting r-craft.org
This article is originally published at https://www.sharpsightlabs.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

How to Use the Pandas Assign Method to Add New Variables

You may also like...

Categories

How to Use the Pandas Assign Method to Add New Variables

A quick introduction to Pandas Assign

Pandas is a toolkit for working with data in Python

Pandas Assign Adds New Columns to a Dataframe

The syntax of the assign method

Syntax to add multiple variables to a dataframe

The Output of the Assign Method

Examples: how to add a column to a dataframe in Pandas

Run this code first

Import Pandas

Create DataFrame

EXAMPLE 1: Create a new variable and assign a constant

Explanation

EXAMPLE 2: Add a variable that’s a computed value

Explanation

EXAMPLE 3: Add multiple variables to your dataframe

Explanation

EXAMPLE 4: Store the output of assign to a new name

Explanation

How to Overwrite your Original Data

Frequently Asked Questions about the Pandas Assign Method

Question 1: Why is my dataframe unchanged, after using assign?

Leave your other questions in the comments below

Discover how to master data wrangling with Pandas

You may also like...

Wordcloud of conference abstracts – FOSS4G Edinburgh

Call for rstudio::conf lightning talks

March 2021: “Top 40” New CRAN Packages

Categories