Python / R News

Pandas Count, Explained

by Joshua Ebner · August 10, 2021

This article is originally published at https://www.sharpsightlabs.com

In this tutorial, I’ll show you how to use the Pandas count technique to count the records in a Pandas dataframe.

I’ll explain exactly what the technique does, how the syntax works, and I’ll show you step-by-step examples so you can see Pandas count in action.

If you need something specific, just click on any of the following links.

Table of Contents:

Ok. Let’s start with a quick introduction.

A quick introduction to Pandas Count

The Pandas count function is pretty simple. The count() technique counts the number of non-missing records in a Pandas object.

This method works on:

Pandas dataframes
Pandas Series objects
individual dataframe columns

Frequently, we use the count() technique for data exploration.

But it’s also very useful for data cleaning and data analysis. For example, there are situations where missing values are bad, so we sometimes need to identify variables that contain non-missing values. The Pandas count technique is one way to identify columns that contain a large number of missing values.

Having said all of that, how exactly the method works depends on the syntax.

With that in mind, let’s look at the syntax of the Pandas count method.

The syntax of Pandas count

The syntax of the count method is fairly simple, but there are a few ways to use it and a few parameters that can modify its functionality.

So in this section, I’ll cover how to use the count method on dataframes and dataframe columns. I’ll also explain the most useful parameters.

A quick note

Everything that I’m about to explain assumes that you’ve imported Pandas and that you already have a dataframe that you’re working with.

You can import pandas with the following code:

import pandas as pd

And if you need a refresher on dataframes, you can read our introduction to Pandas dataframes.

Dataframe Syntax

Let’s start with how to use the count technique on dataframes.

To call the count method with a dataframe, you simply type the name of the dataframe, and then .count().

So if your dataframe is named your_dataframe, you can use the code your_dataframe.count() to count the number of non-missing values in each of the columns.

There are also some additional parameters that you can use inside the parenthesis, which we’ll get to in a moment.

Series Syntax

Next, let’s look at the syntax for how to use Pandas count on a Series.

The syntax for a Series is very similar to the syntax for a dataframe.

Simply type the name of your series, then .count() to call the method.

Again, there are some additional parameters that you can call that will modify the technique.

Dataframe Column Syntax

Finally, we have the syntax for a dataframe column.

Remember that a dataframe column is actually a Pandas series. Additionally, we can retrieve a column from a dataframe using so-called “dot syntax.”

So the syntax for using count() on a column is a two step process:

retrieve the column using dot syntax
call the count method

So if you have a dataframe named your_dataframe, and a column named column, you’ll use the code your_dataframe.column.count() to use the count technique on that one column.

I’ll show you an example of this in example 4.

Again though, there are some optional parameters that control how the technique works. Let’s look at those parameters.

The parameters of Pandas count

There are two parameters you should know for the count method:

axis
numeric_only

Let’s talk about each of these.

`axis` (optional)

The axis parameter controls whether count() operates on the rows or the columns.

By default, this parameter is set to axis = 0, so count() counts the number of non-missing values in the axis-0 direction. This is effectively the column count.

You can change this and set the parameter to axis = 1. This will compute the non-missing values in the axis-1 direction (the row counts).

You can also use an alternative notation, with axis = "columns" or axis = "rows". I strongly discourage you from using this notation, because it’s highly confusing. I explain why in the FAQ section.

One final comment on the axis parameter: to understand this parameter, you really need to understand axes. For an explanation of how axes work, you should read our tutorial on Numpy axes (Numpy axes are very similar to dataframe axes).

`numeric_only` (optional)

The numeric_only parameter enables you to force the count method to only return counts for numeric variables.

By default, this is set to numeric_only = False, so the count method returns the counts for all of the variables.

But if you set numeric_only = True, the count method will return the counts for the numeric variables only (integers, floats, etc).

I show an example of this in example 3.

Examples: how to count records in a Pandas dataframe or Pandas series

Now that we’ve looked at the syntax, let’s look at some examples of how to use the Pandas count technique.

We’ll look at examples of how to count the records in a dataframe, how to count the records in a single column, and a few other uses.

Examples:

Run this code first

Before you run these examples, you’ll need to run some preliminary code.

In particular, you need to:

Import necessary packages
Load a dataframe

Let’s do those one at a time.

Import Packages

First, let’s just import a couple packages.

We need to import Pandas (because the count() method is part of Pandas).

import pandas as pd
import seaborn as sns

We also need to import Seaborn, because we’ll be working with the titanic dataframe, which is included in the Seaborn package.

Let’s load that dataset next.

Load dataframe

Here, we’ll load the titanic dataframe.

titanic = sns.load_dataset('titanic')

Let’s also print it out, so we can see the contents:

print(titanic)

OUT:

     survived  pclass     sex   age  sibsp  parch     fare embarked   class       who  adult_male deck  embark_town alive  alone  
0           0       3    male  22.0      1      0   7.2500        S   Third      man        True  NaN  Southampton    no  False    
1           1       1  female  38.0      1      0  71.2833        C   First    woman       False    C    Cherbourg   yes  False     
2           1       3  female  26.0      0      0   7.9250        S   Third    woman       False  NaN  Southampton   yes   True  
3           1       1  female  35.0      1      0  53.1000        S   First    woman       False    C  Southampton   yes  False  
4           0       3    male  35.0      0      0   8.0500        S   Third      man        True  NaN  Southampton    no   True  
..        ...     ...     ...   ...    ...    ...      ...      ...     ...      ...         ...  ...          ...   ...    ...  
886         0       2    male  27.0      0      0  13.0000        S  Second      man        True  NaN  Southampton    no   True     
887         1       1  female  19.0      0      0  30.0000        S   First    woman       False    B  Southampton   yes   True  
888         0       3  female   NaN      1      2  23.4500        S   Third    woman       False  NaN  Southampton    no  False  
889         1       1    male  26.0      0      0  30.0000        C   First      man        True    C    Cherbourg   yes   True  
890         0       3    male  32.0      0      0   7.7500        Q   Third      man        True  NaN   Queenstown    no   True   

[891 rows x 15 columns]

This dataframe has 15 columns. If you look carefully at the output above, you’ll actually see some NaN values. We’ll be able to count the non-NaN values with count().

EXAMPLE 1: Count the records in all columns of a dataframe

First, we’ll count the number of non-null records for every column in our dataframe.

This is the simplest way to use the count method on a dataframe.

Let’s take a look, and then I’ll explain:

titanic.count()

OUT:

survived       891
pclass         891
sex            891
age            714
sibsp          891
parch          891
fare           891
embarked       889
class          891
who            891
adult_male     891
deck           203
embark_town    889
alive          891
alone          891
dtype: int64

Explanation

This is really simple.

Here, we’re using count() on the entire titanic dataframe.

To do this, we simply typed the name of the dataframe, titanic, and then .count().

In the output, we can see the number of non-missing records for every column.

We know that the dataframe has 891 total rows (we saw this when we printed out the data).

And here, we can see that many of the variables – like survived, pclass, and class – have 891 values. These variables are fully populated.

But we also see that some values have less than 891. For example, deck has only 203 non-missing records. age has 714.

This could be useful information during data cleaning. It could also be useful if you’re building a machine learning model, since some model types will not tolerate missing values.

EXAMPLE 2: Count the number of non-missing values in the rows

Next, let’s count the number of non-missing values in each of the rows.

Typically, I use the count() technique to count the non-missing values for the columns. But there might be times when you need to examine the rows instead.

To do this, we need to use the axis parameter.

Let’s take a look:

titanic.count(axis = 1)

OUT:

0      14
1      15
2      14
3      15
4      14
       ..
886    14
887    15
888    13
889    15
890    14
Length: 891, dtype: int64

Explanation

Here, we can see the number of non-missing values for the rows.

Now remember: we know from our earlier data examination that the dataframe has 15 columns. So a fully populated row should have 15 non-missing values.

But we can see that several of the rows displayed have 13 or 14 non-missing values. In fact, the first row has only 14 values. That means that some of these rows have missing values. That might be okay, but maybe not, depending on what you’re doing.

In terms of syntax, notice that we needed to set axis = 1 to count the number of missing values in the rows. Understanding Pandas “axes” is difficult, but it would definitely help if you reviewed Numpy axes. Pandas axes are essentially the same as axes for a 2D Numpy array.

Note that you can also do the same thing if you set axis = 'columns'. axis = 'columns' is the same as axis = 1. Having said that, I strongly discourage this notation, because it’s extremely confusing. Setting axis = 'columns' actually gives you the number of non-missing values for the rows. There
s a reason why the Pandas developers named it this way, but it only makes sense if you really understand what axes are. Again, I strongly suggest you avoid this alternate notation, and simply use axis = 1. I explain this more in the FAQ section.

EXAMPLE 3: Get counts for numeric variables only

Next, let’s get the counts for only the numeric variables.

To do this, we can set numeric_only = True.

titanic.count(numeric_only = True)

OUT:

survived      891
pclass        891
age           714
sibsp         891
parch         891
fare          891
adult_male    891
alone         891
dtype: int64

Explanation

Here, by setting numeric_only = True, the count() technique is computing the number of non-missing values for the numeric columns only.

EXAMPLE 4: Count the non-missing values in a specific column

Here, let’s count the non-missing values of a specific column.

In particular, we’ll count the number of non-missing values in the deck variable.

titanic.deck.count()

OUT:

Explanation

To get the number of non-missing values in a single column, we need to use a two step process:

use “dot syntax” to retrieve a specific column
call the .count() method

So the syntax titanic.deck retrieves the deck variable from the titanic dataframe.

Then, by using .count(), Python will count the non-missing records for only that column.

EXAMPLE 5: Count records in a subset of columns

Finally, let’ me show you a “special” technique.

Here, we’ll count the non-missing records in a subset of a few columns.

To do this, we need to use some syntax that you’re unlikely to see elsewhere.

Let’s take a look, and then I’ll explain.

(titanic
 .filter(['survived','age','embark_town'])
 .count()
)

OUT:

survived       891
age            714
embark_town    889
dtype: int64

Explanation

Here, we counted the non-missing records in three variables: survived, age, and embark_town.

To do this, we actually needed to call two different Pandas methods.

First, we called the filter method, to retrieve a subset of dataframe columns.

After that, we called the count() method.

And this computes the non-missing records for only the three chosen variables in our subset.

Also, notice that all of the functions are on separate lines. To make this work, we need to enclose the whole expression inside of parenthesis.

This style of Pandas coding is atypical, but it can be very useful when you’re doing data cleaning, data exploration, or data analysis.

It’s actually a very powerful Pandas technique that you should learn more about (leave your questions about it in the comments section at the bottom of the page).

Frequently asked questions about KEYWORD

Now that you’ve seen some examples of the Pandas count technique, let’s look at a common question.

Frequently asked questions:

Why does axis = 'columns' count the non-missing values in the rows?

Question 1: Why does `axis = 'columns'` count the non-missing values in the rows?

If you set axis = 'columns', you’ll notice that it actually counts the non-missing values for the rows.

WTF?

Ok. To understand this, you need to understand axes, and how most people think about axes.

Let’s quickly review.

Pandas dataframes (like Numpy arrays) have axes. Axes are like directions along the dataframe.

An analogy here is the 3-dimensional Cartesian coordinate system. There’s an x-axis, y-axis, and z-axis. Those are directions in 3D space.

Dataframes also have axes. Axes are like directions along the dataframe.

For a dataframe, axis-0 points downward, and axis-1 points horizontally.

I won’t explain the reasons here about why the axes are numbered like this. It’s just something you need to memorize: axis-0 points downward, and axis-1 points horizontally.

Having said that, people commonly think of axis-1 as the “columns” axis. Why? Because when we visualize it like in the image above, we typically show an arrow pointing horizontally across the top of the columns. So people think of axis-1 as the “columns” axis.

But that’s really foolish, because when you use axis-1 for an operation, such as .count(axis = 1), it actually computes the row counts.

I honestly think this is a misunderstanding of how people think about axes, and using terminology in a counter-intuitive way.

All of that being the case, I strongly suggest that you avoid the notation count(axis = "columns") or count(axis = "rows").

Instead, use axis = 1 or axis = 0.

For more information about axes, read our tutorial on Numpy axes. The details about 2D Numpy arrays apply to Pandas dataframes.

Leave your other questions in the comments below

Do you have other questions about the Pandas count technique?

Is there something you’re still confused about, that I haven’t covered here?

If so, leave your question in the comments section below.

To learn more about Pandas, sign up for our email list

This tutorial should have helped you understand the Pandas count technique, and how it works.

But if you want to master data wrangling and data exploration with Pandas, there’s a lot more to learn.

There’s even more to learn if you want to learn data science in Python.

That said, if you’re ready to learn more about Pandas and data science in Python, then sign up for our email list.

When you sign up, you’ll get free tutorials on:

NumPy
Pandas
Base Python
Scikit learn
Machine learning
Deep learning
… and more.

We publish free data science tutorials every week. When you sign up for our email list, we’ll deliver these free tutorials directly to your inbox.

Thanks for visiting r-craft.org
This article is originally published at https://www.sharpsightlabs.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Pandas Count, Explained

Sign up for FREE data science tutorials

You may also like...

Categories

Pandas Count, Explained

A quick introduction to Pandas Count

The syntax of Pandas count

A quick note

Dataframe Syntax

Series Syntax

Dataframe Column Syntax

The parameters of Pandas count

axis (optional)

numeric_only (optional)

Examples: how to count records in a Pandas dataframe or Pandas series

Run this code first

Import Packages

Load dataframe

EXAMPLE 1: Count the records in all columns of a dataframe

Explanation

EXAMPLE 2: Count the number of non-missing values in the rows

Explanation

EXAMPLE 3: Get counts for numeric variables only

Explanation

EXAMPLE 4: Count the non-missing values in a specific column

Explanation

EXAMPLE 5: Count records in a subset of columns

Explanation

Frequently asked questions about KEYWORD

Question 1: Why does axis = 'columns' count the non-missing values in the rows?

Leave your other questions in the comments below

To learn more about Pandas, sign up for our email list

Sign up for FREE data science tutorials

Check your email inbox to confirm your subscription ...

You may also like...

9th MilanoR Meeting – Dataviz with R | Photos and presentations!

Installing our R development environment on Ubuntu 20.04

GPT is Not a Shortcut to Learning Data Science

Categories

`axis` (optional)

`numeric_only` (optional)

Question 1: Why does `axis = 'columns'` count the non-missing values in the rows?