How to Use Pandas Append to Combine Rows of Data in Python
This article is originally published at https://www.sharpsightlabs.com
In this tutorial, I’ll explain how to use the Pandas append technique to append new rows to a Pandas dataframe or object.
I’ll explain exactly what the append
technique does, how the syntax works, and I’ll show you step-by-step examples.
Table of Contents:
Let’s start with a quick explanation of what the append method does.
A quick introduction to Pandas append
The Pandas append technique appends new rows to a Pandas object. This is a very common technique that we use for data cleaning and data wrangling in Python.
This technique is somewhat flexible, in the sense that we can use it on a couple of different Pandas objects. We can use this technique on:
- dataframes
- Series
When we use append
on dataframes, the dataframes often have the same columns. But if the input dataframes have different columns, then the output dataframe will have the columns of both inputs.
Having said all of that, what this technique does depends on how we use the syntax.
That being the case, let’s look at the syntax and the optional parameters.
The syntax of Pandas append
Here, I’ll explain the syntax for the Pandas append method.
I’ll explain the syntax for both Pandas dataframes, and Pandas Series objects.
A quick note
Before we look at the syntax, keep in mind a few things:
First, these syntax explanations assume that you’ve already imported the Pandas package. You can do that with the following code:
import pandas as pd
Second, these syntax explanations also assume that you already have two Pandas dataframes or other objects that you want to combine together.
For a refresher on dataframes, you can read our blog post on Pandas dataframes.
Dataframe append syntax
Using the append method on a dataframe is very simple.
You type the name of the first dataframe, and then .append()
to call the method.
Then inside the parenthesis, you type the name of the second dataframe, which you want to append to the end of the first.
There are also some optional parameters that you can use, which I’ll discuss in the parameters section.
Series append syntax
The syntax for using append on a Series is very similar to the dataframe syntax.
You type the name of the first Series, and then .append()
to call the method.
Then inside the parenthesis, you type the name of the second Series, which you want to append to the end of the first.
And once again, there are also some optional parameters that you can use which will slightly change how the method works.
Let’s take a look at those parameters.
The parameters of append
The Pandas append method has three optional parameters that you can use:
ignore_index
verify_integrity
sort
Let’s look at each of them.
ignore_index
(optional)
The ignore_index
parameter enables you to control the index of the new output Pandas object.
By default, this is set to ignore_index = False
. In this case, Pandas keeps the original index values from the two different input dataframes. Keep in mind that this can cause duplicate index values which can cause problems.
If you set this parameter to ignore_index = True
, Pandas will ignore the index values in the inputs, and will generate a new index for the output. The index values will be labeled 0
, 1
, … n - 1
.
verify_integrity
(optional)
The verify_integrity
parameter check the “integrity” of the new index. If the index has duplicates, and you set verify_integrity = True
, Python will produce an error message.
By default, this parameter is set to set verify_integrity = False
. In this case, Python will actually allow duplicates.
sort
(optional)
The sort
parameter controls the sort order of the columns, if the two input dataframes have different columns.
By default, this parameter is set to sort = False
. In this case, the columns are not resorted when they are appended together.
If you set sort = True
, Pandas will re-sort the columns in the output.
The output of Pandas append
The output of append depends on the input.
Generally, the output will be a new Pandas object, with the rows of the second object appended to the bottom of the first object.
More specifically, if the inputs are dataframes, the output will be a dataframe. And if the inputs are Series, then the output will be a Series.
Also note: the append()
method produces a new object and leaves the two original input objects unchanged. This can be very confusing for beginners so remember that the method produces a new object.
Examples: how to append new rows to a Pandas object
Ok. Now that you’ve seen the syntax, let’s look at a few examples of how to use append to add new rows to a Pandas object.
Examples:
- Append new rows onto a dataframe
- Ignore and reset the index, when you append new rows
- Verify the integrity of the index, when you append new rows
Run this code first
Before you run any of the examples, you need to do two things:
- import Pandas
- create the dataframes we’ll work with
Let’s do those one at a time.
Import Pandas
First, let’s import Pandas.
You can do that with the following code:
import pandas as pd
This will enable us to call pandas functions with the prefix pd
, which is the common convention.
Create dataframes
Next, let’s create two dataframes.
Here, we’ll create dataframes that contain mock sales data.
You can create them with the following code:
sales_data_1 = pd.DataFrame({"name":["William","Emma","Sofia","Markus","Edward"] ,"region":["East",np.nan,"East","South","West"] ,"sales":[50000,52000,90000,np.nan,42000] ,"expenses":[42000,43000,np.nan,44000,38000]}) sales_data_2 = pd.DataFrame({"name":["Thomas","Ethan","Olivia","Arun","Anika","Paulo"] ,"region":["West","South","West","West","East","South"] ,"sales":[72000,49000,np.nan,67000,65000,67000] ,"expenses":[39000,42000,np.nan,39000,44000,45000]})
And let’s print them out, so you can see roughly what’s in them:
print(sales_data_1) print(sales_data_2)
OUT:
name region sales expenses 0 William East 50000.0 42000.0 1 Emma NaN 52000.0 43000.0 2 Sofia East 90000.0 NaN 3 Markus South NaN 44000.0 4 Edward West 42000.0 38000.0 name region sales expenses 0 Thomas West 72000.0 39000.0 1 Ethan South 49000.0 42000.0 2 Olivia West NaN NaN 3 Arun West 67000.0 39000.0 4 Anika East 65000.0 44000.0 5 Paulo South 67000.0 45000.0
As you can see, these dataframes contain sales information, including name, region, total sales, and expenses.
Notice as well that although the dataframes have the same columns, they have different rows. We’ll use the append()
method to append the rows in sales_data_2
on to sales_data_1
.
EXAMPLE 1: Append new rows onto a dataframe
First, let’s start simple.
Here, we’ll simply append the rows in sales_data_2
to the end (i.e., the bottom) of sales_data_1
.
Let’s run the code, and then I’ll explain:
sales_data_1.append(sales_data_2)
OUT:
name region sales expenses 0 William East 50000.0 42000.0 1 Emma NaN 52000.0 43000.0 2 Sofia East 90000.0 NaN 3 Markus South NaN 44000.0 4 Edward West 42000.0 38000.0 0 Thomas West 72000.0 39000.0 1 Ethan South 49000.0 42000.0 2 Olivia West NaN NaN 3 Arun West 67000.0 39000.0 4 Anika East 65000.0 44000.0 5 Paulo South 67000.0 45000.0
Explanation
This is fairly simple.
To call the method, we type the name of the first dataframe, sales_data_1
, and then we type .append()
to call the method.
Inside the parenthesis, we have the name of the second dataframe, sales_data_2
.
The output dataframe contains the rows of both, stacked on top of each other.
Notice one thing though: in the numeric index on the left, there are duplicate values. That’s because the index of the original input dataframes both contained similar values (i.e., the index for both started at 0 and incremented by 1 for each row).
These duplicates in the index could be problematic.
We’ll fix it in the next example.
EXAMPLE 2: Ignore and reset the index, when you append new rows
Here, we’ll combine the rows of the two dataframes, but we’ll reset the index for the output dataframe. This will create a new numeric index starting at 0.
To do this, we need to set ignore_index = True
. Effectively, this will cause Python to “ignore” the index in the input dataframes, and it will create a new index for the output:
sales_data_1.append(sales_data_2, ignore_index = True)
OUT:
name region sales expenses 0 William East 50000.0 42000.0 1 Emma NaN 52000.0 43000.0 2 Sofia East 90000.0 NaN 3 Markus South NaN 44000.0 4 Edward West 42000.0 38000.0 5 Thomas West 72000.0 39000.0 6 Ethan South 49000.0 42000.0 7 Olivia West NaN NaN 8 Arun West 67000.0 39000.0 9 Anika East 65000.0 44000.0 10 Paulo South 67000.0 45000.0
Explanation
Notice in the output that the index starts at 0, increments by 1 for each row, and stops at 10.
This is a new index for the output, and it effectively removes any duplicate index labels that were in the input dataframes.
EXAMPLE 3: Verify the integrity of the index, when you append new rows
Now, instead of resetting the index, let’s verify the index.
To do this, we’ll set verify_integrity = True
.
This will check the index labels of the inputs for duplicates. If there are duplicate index labels, Pandas will produce an error.
Let’s take a look:
sales_data_1.append(sales_data_2, verify_integrity = True)
OUT:
ValueError: Indexes have overlapping values: Int64Index([0, 1, 2, 3, 4], dtype='int64') --------------------------------------------------------------------------- ValueError Traceback (most recent call last) ....
Explanation
Here, we set verify_integrity = True
. This checked the input dataframes for duplicate index labels.
As you can see, running this code produced a ValueError
.
The reason is that there were duplicate index labels in the two input dataframes. They both had rows with a labels 0
, 1
, 2
, 3
, and 4
.
When you encounter an error like this, you may need to do some data cleaning on your input data to remove duplicate rows. Or, you may simply want to ignore the index, as we did in example 2. How you handle this really depends on context.
Frequently asked questions about Pandas append
Now that we’ve looked at some examples, let’s look at some common questions about the append()
technique.
Frequently asked questions:
Question 1: I used append, but my dataframe is unchanged. Why?
If you use the append method, you might notice that your original dataframe remains unchanged.
For example, in example 1, we ran the following code:
sales_data_1.append(sales_data_2)
If you print out sales_data_1
after you run that code, you’ll realize that sales_data_1
is unchanged.
That’s because the append()
method produces a new dataframe, and leaves both original dataframes unchanged.
By default, this output is sent to the console. We can see it in the console, but to save it, we need to store it with a name.
For example, you could store the output like this:
sales_data_combined = sales_data_1.append(sales_data_2)
You can name the output whatever you want. You could even name it sales_data_1
. But be careful, if you do that, it will overwrite your original dataset. Make sure that you check your code so it works properly before you overwrite an input dataframe.
Leave your other questions in the comments below
Do you have any other questions about the Pandas append method?
Is there something else that you need to know that I haven’t covered here?
If so, leave your question in the comments section below.
To learn more about Pandas, sign up for our email list
This tutorial should have given you a good introduction to the Pandas append technique, but if you really want to master data wrangling and data science in Python, there’s a lot more to learn.
So if you’re ready to learn more about Pandas and more about data science, then sign up for our email newsletter.
We publish FREE tutorials almost every week on:
- Base Python
- NumPy
- Pandas
- Scikit learn
- Machine learning
- Deep learning
- … and more.
When you sign up for our email list, we’ll deliver these free tutorials directly to your inbox.
Thanks for visiting r-craft.org
This article is originally published at https://www.sharpsightlabs.com
Please visit source website for post related comments.