Pandas Mean, Explained
This article is originally published at https://www.sharpsightlabs.com
In this tutorial, I’ll show you how to use the Pandas mean technique. The mean()
technique calculates the mean of the the numeric values in a Pandas dataframe or Pandas series.
So in the tutorial, I’ll explain how we use the technique, how the syntax works, and I’ll show you step-by-step examples.
If you need something specific, just click on any of the following links.
Table of Contents:
Let’s start with an introduction to Pandas mean.
A quick introduction to Pandas Mean
The Pandas mean technique is a tool for data exploration and data analysis in Python.
We use the mean()
technique to compute the mean of the values in a Pandas dataframe or Series.
It’s most common to use this tool on a single dataframe column, but the Pandas mean technique will work on:
- entire Pandas dataframes
- Pandas Series objects
- individual dataframe columns
Again, the Pandas mean technique is most commonly used for data exploration and analysis. When we analyze data, it’s very common to examine summary statistics like mean, median, minimum, maximum, etc.
We sometimes do this for a whole variable, but there are also instances when we first group our data by a categorical variable, and then compute the mean by category. This is extremely common in data analysis, and extremely useful. I’ll show an example of a grouped mean in the examples section.
But before we look at examples of the technique, we first need to understand the syntax.
That being the case, let’s look at the syntax of the Pandas mean technique.
The syntax of Pandas mean
The syntax for the pandas mean technique depends on what type of object you’re using it on.
We can use mean()
on:
- dataframes
- Series
- individual dataframe columns
That being the case, we’ll look separately at the dataframe syntax, the Series syntax, and the syntax for using mean on a single dataframe column.
A quick note
For all of the following syntax explanations, I’ll assume that you already have Pandas installed, and that you have a dataframe that you can work with.
You can import Pandas with the following code:
import pandas as pd
If you need a quick review of Pandas dataframes, please read our introduction to dataframes in Python.
Dataframe Syntax
Let’s start with the syntax for how to use mean()
on a dataframe.
When you use mean()
on an entire dataframe, you simply type the name of the dataframe and then .mean()
to call the method.
When you use mean()
on a whole dataframe, it will attempt to operate on all of the columns by default. In practice though, the output typically includes only the means of numeric variables (int
, float
, and bool
).
There are also some optional parameters that you can use, which will modify the output slightly. I’ll explain those in the parameters section.
Series Syntax
You can also use the mean()
technique on an independent Pandas Series.
The syntax to use the mean technique on a Series is very similar to the syntax for a dataframe.
To use mean()
on a Series, simply type the name of the series, and then .mean()
to call the method.
Just like for dataframes, when you use mean()
on a Series, there are some additional parameters that you can use to modify the output. I’ll explain those in the parameters section.
Dataframe Column Syntax
Finally, let’s look at the syntax for using mean()
on a single dataframe column.
Dataframe columns are actually Pandas Series objects, so the syntax for using Pandas mean on a column is a two step process:
- retrieve the column using dot syntax
- call the
mean()
method
So for example, if you have a dataframe named your_dataframe
, and the column you want to operate on is named column
, you’ll use the code your_dataframe.column.mean()
. That will compute the mean of that single column.
But again, there are some other optional parameters that you can use that will modify the output.
Let’s take a look at those parameters.
The parameters of Pandas mean
The mean technique has several parameters that you can use that will change how it operates.
Having said that, the only parameter I think you might want to use is skipna
.
The other parameters, like axis
, level
, and numeric_only
aren’t particularly useful for the mean()
method.
That being the case, I’m only going to discuss skipna
here.
skipna
The skipna
parameter enables you to “skip” the missing values when the mean is calculated.
By default, this is set to skipna = True
, which causes the mean()
method to exclude missing values or NaN
values.
If you set skipna = False
, the method will attempt to include the missing values. Beware though: if you do this, the resulting output may be NaN
itself.
I’ll show you how to use skipna
in example 3.
Examples: how to calculate the mean on a Pandas dataframe or Pandas series
Now that we’ve looked at the syntax and parameters, let’s look at some examples of the Pandas mean method.
Examples:
- Calculate mean of a single dataframe column
- Use the mean method on an entire dataframe
- Include missing values
- Compute means, grouped by a categorical variable
Run this code first
Before you run the examples, you’ll need to run some preliminary code.
Specifically, you’ll need to:
- Import necessary packages
- Create a dataframe
Let’s do those.
Import Packages
First, we’ll import some packages.
import pandas as pd import seaborn as sns
We’re importing Pandas, since the mean method is part of the Pandas package.
Additionally, we need to import Seaborn, because we’ll be working with a dataframe that’s contained in the Seaborn package.
Get the titanic
dataframe
Next, let’s retrieve the titanic
dataframe.
titanic = sns.load_dataset('titanic')
And let’s print it out:
print(titanic)
OUT:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone 0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False 1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False 2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True 3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False 4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 886 0 2 male 27.0 0 0 13.0000 S Second man True NaN Southampton no True 887 1 1 female 19.0 0 0 30.0000 S First woman False B Southampton yes True 888 0 3 female NaN 1 2 23.4500 S Third woman False NaN Southampton no False 889 1 1 male 26.0 0 0 30.0000 C First man True C Cherbourg yes True 890 0 3 male 32.0 0 0 7.7500 Q Third man True NaN Queenstown no True [891 rows x 15 columns]
This dataframe has quite a few numeric columns for which we’ll be able to calculate the mean.
So now that we have our data, let’s look at some examples.
EXAMPLE 1: Calculate mean of a single dataframe column
First, let’s start by calculating the mean of a single dataframe column.
Here, we’ll calculate the mean of the age
variable.
titanic.age.mean()
OUT:
29.69911764705882
Explanation
This is fairly simple, but let me explain.
Here, we’re using “dot syntax” to retrieve the age
variable. We’re doing that with the code titanic.age
.
But directly after that, we’re using .mean()
to compute the mean.
Effectively, this retrieves the age variable from the titanic
dataframe, and computes the mean on only that variable.
EXAMPLE 2: Use the mean method on an entire dataframe
Next, let’s use the mean technique on a whole dataframe.
titanic.mean()
OUT:
survived 0.383838 pclass 2.308642 age 29.699118 sibsp 0.523008 parch 0.381594 fare 32.204208 adult_male 0.602694 alone 0.602694 dtype: float64
Explanation
Calling mean()
on the entire dataframe caused the method to compute the mean of every numeric variable, including boolean variables.
So for example, it calculated the mean of age
(a floating point number).
It calculated the mean of survived (a 0/1 integer).
And it also calculated the mean of alone
, which is a bool
variable. When it operates on a boolean variable, it treats True
as a 1 and False
as a 0, then computes the mean.
Also notice that in the case of boolean data or 0/1 integers, the mean actually represents a proportion.
So for example, the mean of survived
is 0.383838. According to the data, that’s the proportion of people who survived the sinking of the Titanic!
EXAMPLE 3: Include missing values
Now, let’s include missing values.
By default, when we use mean()
, the skipna
parameter is set to skipna = True
. This causes Pandas mean to ignore missing values.
We can turn that off by setting skipna = False
.
titanic.mean(skipna = False)
OUT:
survived 0.383838 pclass 2.308642 age NaN sibsp 0.523008 parch 0.381594 fare 32.204208 adult_male 0.602694 alone 0.602694 dtype: float64
Explanation
Notice in the output that the mean() technique has successfully calculated the mean for most of the variables.
But the mean of age
is now NaN
.
This is because the age
variable contains missing values, which have now been included in the calculation.
When a variable has missing values like this, you may want to ignore them with skipna = False
. Or, you may want to fill in the missing values using the Pandas fillna technique.
EXAMPLE 4: Compute means, grouped by a categorical variable
Finally, let’s compute grouped means.
Here, we’re going to calculate the “mean” of the survived
variable by class
.
(titanic .groupby(['class']) .survived .mean() )
OUT:
class First 0.629630 Second 0.472826 Third 0.242363 Name: survived, dtype: float64
Explanation
Here, we calculated the mean of survived
, by class
.
Doing this required multiple steps:
- group the data by
class
usinggroupby()
- retrieve the
survived
variable - call the
mean()
method
Additionally, notice that we wrote this code on multiple separate lines. This makes the code much easier to read and debug. To do this, we enclosed the entire expression inside of parenthesis. This is an uncommon syntax style, but it’s extremely powerful for performing multi-step processing or analysis with Pandas.
In terms of output, this actually produced the survival rate by class. This shows a simple example of how we can use Pandas to do data analysis.
Pandas is of course a great toolkit for data wrangling. But if you know how to use it properly, it’s an extremely powerful toolkit for analyzing data and “finding insights” in data.
Leave your other questions in the comments below
Do you have other questions about the Pandas mean technique?
Is there something that I haven’t covered here, that you’re still confused about?
If so, leave your question in the comments section below.
To learn more about Pandas, sign up for our email list
This tutorial should have helped you understand the Pandas mean technique, and how it works.
But to master data cleaning, data exploration, and data analysis with Pandas, there’s a lot more to learn.
And if you want to be great at data science, there’s even more to learn beyond Pandas.
That said, if you’re ready to learn more about Pandas and data science in Python, then sign up for our email list.
When you sign up, you’ll get free tutorials on:
- Base Python
- Pandas
- NumPy
- Machine learning
- Deep learning
- Scikit learn
- … and more.
We publish free data science tutorials every week. When you sign up for our email list, we’ll deliver those tutorials directly to your inbox.
Thanks for visiting r-craft.org
This article is originally published at https://www.sharpsightlabs.com
Please visit source website for post related comments.