Python / R News

Pandas Mean, Explained

by Joshua Ebner · August 24, 2021

This article is originally published at https://www.sharpsightlabs.com

In this tutorial, I’ll show you how to use the Pandas mean technique. The mean() technique calculates the mean of the the numeric values in a Pandas dataframe or Pandas series.

So in the tutorial, I’ll explain how we use the technique, how the syntax works, and I’ll show you step-by-step examples.

If you need something specific, just click on any of the following links.

Table of Contents:

Let’s start with an introduction to Pandas mean.

A quick introduction to Pandas Mean

The Pandas mean technique is a tool for data exploration and data analysis in Python.

We use the mean() technique to compute the mean of the values in a Pandas dataframe or Series.

It’s most common to use this tool on a single dataframe column, but the Pandas mean technique will work on:

entire Pandas dataframes
Pandas Series objects
individual dataframe columns

Again, the Pandas mean technique is most commonly used for data exploration and analysis. When we analyze data, it’s very common to examine summary statistics like mean, median, minimum, maximum, etc.

We sometimes do this for a whole variable, but there are also instances when we first group our data by a categorical variable, and then compute the mean by category. This is extremely common in data analysis, and extremely useful. I’ll show an example of a grouped mean in the examples section.

But before we look at examples of the technique, we first need to understand the syntax.

That being the case, let’s look at the syntax of the Pandas mean technique.

The syntax of Pandas mean

The syntax for the pandas mean technique depends on what type of object you’re using it on.

We can use mean() on:

dataframes
Series
individual dataframe columns

That being the case, we’ll look separately at the dataframe syntax, the Series syntax, and the syntax for using mean on a single dataframe column.

A quick note

For all of the following syntax explanations, I’ll assume that you already have Pandas installed, and that you have a dataframe that you can work with.

You can import Pandas with the following code:

import pandas as pd

If you need a quick review of Pandas dataframes, please read our introduction to dataframes in Python.

Dataframe Syntax

Let’s start with the syntax for how to use mean() on a dataframe.

When you use mean() on an entire dataframe, you simply type the name of the dataframe and then .mean() to call the method.

When you use mean() on a whole dataframe, it will attempt to operate on all of the columns by default. In practice though, the output typically includes only the means of numeric variables (int, float, and bool).

There are also some optional parameters that you can use, which will modify the output slightly. I’ll explain those in the parameters section.

Series Syntax

You can also use the mean() technique on an independent Pandas Series.

The syntax to use the mean technique on a Series is very similar to the syntax for a dataframe.

To use mean() on a Series, simply type the name of the series, and then .mean() to call the method.

Just like for dataframes, when you use mean() on a Series, there are some additional parameters that you can use to modify the output. I’ll explain those in the parameters section.

Dataframe Column Syntax

Finally, let’s look at the syntax for using mean() on a single dataframe column.

Dataframe columns are actually Pandas Series objects, so the syntax for using Pandas mean on a column is a two step process:

retrieve the column using dot syntax
call the mean() method

So for example, if you have a dataframe named your_dataframe, and the column you want to operate on is named column, you’ll use the code your_dataframe.column.mean(). That will compute the mean of that single column.

But again, there are some other optional parameters that you can use that will modify the output.

Let’s take a look at those parameters.

The parameters of Pandas mean

The mean technique has several parameters that you can use that will change how it operates.

Having said that, the only parameter I think you might want to use is skipna.

The other parameters, like axis, level, and numeric_only aren’t particularly useful for the mean() method.

That being the case, I’m only going to discuss skipna here.

`skipna`

The skipna parameter enables you to “skip” the missing values when the mean is calculated.

By default, this is set to skipna = True, which causes the mean() method to exclude missing values or NaN values.

If you set skipna = False, the method will attempt to include the missing values. Beware though: if you do this, the resulting output may be NaN itself.

I’ll show you how to use skipna in example 3.

Examples: how to calculate the mean on a Pandas dataframe or Pandas series

Now that we’ve looked at the syntax and parameters, let’s look at some examples of the Pandas mean method.

Examples:

Run this code first

Before you run the examples, you’ll need to run some preliminary code.

Specifically, you’ll need to:

Import necessary packages
Create a dataframe

Let’s do those.

Import Packages

First, we’ll import some packages.

import pandas as pd
import seaborn as sns

We’re importing Pandas, since the mean method is part of the Pandas package.

Additionally, we need to import Seaborn, because we’ll be working with a dataframe that’s contained in the Seaborn package.

Get the `titanic` dataframe

Next, let’s retrieve the titanic dataframe.

titanic = sns.load_dataset('titanic')

And let’s print it out:

print(titanic)

OUT:

     survived  pclass     sex   age  sibsp  parch     fare embarked   class       who  adult_male deck  embark_town alive  alone  
0           0       3    male  22.0      1      0   7.2500        S   Third      man        True  NaN  Southampton    no  False    
1           1       1  female  38.0      1      0  71.2833        C   First    woman       False    C    Cherbourg   yes  False     
2           1       3  female  26.0      0      0   7.9250        S   Third    woman       False  NaN  Southampton   yes   True  
3           1       1  female  35.0      1      0  53.1000        S   First    woman       False    C  Southampton   yes  False  
4           0       3    male  35.0      0      0   8.0500        S   Third      man        True  NaN  Southampton    no   True  
..        ...     ...     ...   ...    ...    ...      ...      ...     ...      ...         ...  ...          ...   ...    ...  
886         0       2    male  27.0      0      0  13.0000        S  Second      man        True  NaN  Southampton    no   True     
887         1       1  female  19.0      0      0  30.0000        S   First    woman       False    B  Southampton   yes   True  
888         0       3  female   NaN      1      2  23.4500        S   Third    woman       False  NaN  Southampton    no  False  
889         1       1    male  26.0      0      0  30.0000        C   First      man        True    C    Cherbourg   yes   True  
890         0       3    male  32.0      0      0   7.7500        Q   Third      man        True  NaN   Queenstown    no   True   

[891 rows x 15 columns]

This dataframe has quite a few numeric columns for which we’ll be able to calculate the mean.

So now that we have our data, let’s look at some examples.

EXAMPLE 1: Calculate mean of a single dataframe column

First, let’s start by calculating the mean of a single dataframe column.

Here, we’ll calculate the mean of the age variable.

titanic.age.mean()

OUT:

29.69911764705882

Explanation

This is fairly simple, but let me explain.

Here, we’re using “dot syntax” to retrieve the age variable. We’re doing that with the code titanic.age.

But directly after that, we’re using .mean() to compute the mean.

Effectively, this retrieves the age variable from the titanic dataframe, and computes the mean on only that variable.

EXAMPLE 2: Use the mean method on an entire dataframe

Next, let’s use the mean technique on a whole dataframe.

titanic.mean()

OUT:

survived       0.383838
pclass         2.308642
age           29.699118
sibsp          0.523008
parch          0.381594
fare          32.204208
adult_male     0.602694
alone          0.602694
dtype: float64

Explanation

Calling mean() on the entire dataframe caused the method to compute the mean of every numeric variable, including boolean variables.

So for example, it calculated the mean of age (a floating point number).

It calculated the mean of survived (a 0/1 integer).

And it also calculated the mean of alone, which is a bool variable. When it operates on a boolean variable, it treats True as a 1 and False as a 0, then computes the mean.

Also notice that in the case of boolean data or 0/1 integers, the mean actually represents a proportion.

So for example, the mean of survived is 0.383838. According to the data, that’s the proportion of people who survived the sinking of the Titanic!

EXAMPLE 3: Include missing values

Now, let’s include missing values.

By default, when we use mean(), the skipna parameter is set to skipna = True. This causes Pandas mean to ignore missing values.

We can turn that off by setting skipna = False.

titanic.mean(skipna = False)

OUT:

survived       0.383838
pclass         2.308642
age                 NaN
sibsp          0.523008
parch          0.381594
fare          32.204208
adult_male     0.602694
alone          0.602694
dtype: float64

Explanation

Notice in the output that the mean() technique has successfully calculated the mean for most of the variables.

But the mean of age is now NaN.

This is because the age variable contains missing values, which have now been included in the calculation.

When a variable has missing values like this, you may want to ignore them with skipna = False. Or, you may want to fill in the missing values using the Pandas fillna technique.

EXAMPLE 4: Compute means, grouped by a categorical variable

Finally, let’s compute grouped means.

Here, we’re going to calculate the “mean” of the survived variable by class.

(titanic
 .groupby(['class'])
 .survived
 .mean()
)

OUT:

class
First     0.629630
Second    0.472826
Third     0.242363
Name: survived, dtype: float64

Explanation

Here, we calculated the mean of survived, by class.

Doing this required multiple steps:

group the data by class using groupby()
retrieve the survived variable
call the mean() method

Additionally, notice that we wrote this code on multiple separate lines. This makes the code much easier to read and debug. To do this, we enclosed the entire expression inside of parenthesis. This is an uncommon syntax style, but it’s extremely powerful for performing multi-step processing or analysis with Pandas.

In terms of output, this actually produced the survival rate by class. This shows a simple example of how we can use Pandas to do data analysis.

Pandas is of course a great toolkit for data wrangling. But if you know how to use it properly, it’s an extremely powerful toolkit for analyzing data and “finding insights” in data.

Leave your other questions in the comments below

Do you have other questions about the Pandas mean technique?

Is there something that I haven’t covered here, that you’re still confused about?

If so, leave your question in the comments section below.

To learn more about Pandas, sign up for our email list

This tutorial should have helped you understand the Pandas mean technique, and how it works.

But to master data cleaning, data exploration, and data analysis with Pandas, there’s a lot more to learn.

And if you want to be great at data science, there’s even more to learn beyond Pandas.

That said, if you’re ready to learn more about Pandas and data science in Python, then sign up for our email list.

When you sign up, you’ll get free tutorials on:

Base Python
Pandas
NumPy
Machine learning
Deep learning
Scikit learn
… and more.

We publish free data science tutorials every week. When you sign up for our email list, we’ll deliver those tutorials directly to your inbox.

Thanks for visiting r-craft.org
This article is originally published at https://www.sharpsightlabs.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Pandas Mean, Explained

Sign up for FREE data science tutorials

You may also like...

Categories

Pandas Mean, Explained

A quick introduction to Pandas Mean

The syntax of Pandas mean

A quick note

Dataframe Syntax

Series Syntax

Dataframe Column Syntax

The parameters of Pandas mean

skipna

Examples: how to calculate the mean on a Pandas dataframe or Pandas series

Run this code first

Import Packages

Get the titanic dataframe

EXAMPLE 1: Calculate mean of a single dataframe column

Explanation

EXAMPLE 2: Use the mean method on an entire dataframe

Explanation

EXAMPLE 3: Include missing values

Explanation

EXAMPLE 4: Compute means, grouped by a categorical variable

Explanation

Leave your other questions in the comments below

To learn more about Pandas, sign up for our email list

Sign up for FREE data science tutorials

Check your email inbox to confirm your subscription ...

You may also like...

More Fun with Choropleth Maps

Local randomness in R

Data Analysis for the Life Sciences with R

Categories

`skipna`

Get the `titanic` dataframe