Python / R News

How to make a Seaborn histogram with the distplot function

by Sharp Sight · February 4, 2020

This article is originally published at https://www.sharpsightlabs.com

This tutorial will show you how to make a Seaborn histogram and density plots using the distplot function.

It will explain the syntax and also show you clear, step-by-step examples of how to use sns.distplot.

The tutorial is divided up into several different sections. You can click on one of the following links to go to the appropriate section.

Table of Contents:

That said, if you’re new to data visualization in Python or new to using Seaborn, I recommend that you read the entire tutorial.

A quick introduction to histograms and distplots

When we’re doing data science, one of the most common tasks is visualizing data distributions.

Frequently, we want to understand how our data are distributed as part of exploratory data analysis.

Sometimes we explore data to find out how it’s structured (i.e., when we first get a dataset).

Other times, we need to explore data distributions to answer a question or validate some hypothesis about the data.

Examining data distributions is also very common in machine learning, since many machine learning techniques assume that the data are distributed in particular ways

There are two primary ways to examine data distributions: the histogram and the density plot.

Histograms

Histograms are arguably the most common tool for examining data distributions.

In a typical histogram, we map a numeric variable to the x axis.

The x axis is then divided up into a number of “bins” … for example, there might be a bin from 10 to 20, the next bin from 20 to 30, the next from 30 to 40, and so on.

When we create a histogram (or use software to create a histogram) we count the number of observations in each bin.

Then we plot a bar for each bin. The length of the bar corresponds to the number of records that are within that bin on the x-axis.

Ultimately, a histogram contains a group of bars that show the “height” of the data (i.e., the count of the data) for different values our numeric variable.

The histogram shows us how a variable is distributed.

Density plots

The other primary tool for evaluating data distributions is the density plot.

There are a variety of methods for creating density plots, but one of the most common is called “kernel density estimation.” The plot that we generate when we use kernel density estimation is called “kernel density estimation plot.” These are also known as “KDE plots” for short.

KDE plots (i.e., density plots) are very similar to histograms in terms of how we use them. We use density plots to evaluate how a numeric variable is distributed.

The main differences are that KDE plots use a smooth line to show distribution, whereas histograms use bars. So KDE plots show density, whereas histograms show count.

Now that I’ve explained histograms and KDE plots generally, let’s talk about them in the context of Seaborn.

Histograms and density plots in Seaborn

Seaborn has two different functions for visualizing univariate data distributions – seaborn.kdeplot() and seaborn.distplot().

In this tutorial, we’re really going to talk about the distplot function.

The Seaborn distplot function creates histograms and KDE plots

Technically, Seaborn does not have it’s own function to create histograms.

Instead, it has the seaborn.distplot() function.

The distplot function creates a combined plot that contains both a KDE plot and a histogram.

At least, that’s the default behavior.

You can use the distplot function to create a chart with only a histogram or only a KDE plot.

I’ll show you how to do both in the examples section, but to understand how you need to understand the syntax.

That being the case, let’s take a look at the syntax of the seaborn.distplot function.

The syntax of sns.distplot

The technical name of the function is seaborn.distplot, but it’s a very common convention to call the function with the code sns.distplot. That’s the convention we’ll be using going forward …. we’re going to call the function as sns.distplot().

(Remember, to use the sns. prefix, you need to import Seaborn with the code import seaborn as sns.)

A simple example of Seaborn disptplot syntax

In the simplest version of the syntax, you just call the function sns.distplot(), and provide the name of a DataFrame variable or list inside of the parenthesis.

This will create a simple combined histogram/KDE plot.

However, the function can be used in more complex ways, if you use some extra parameters.

Let’s take a look at a few important parameters of the sns.distplot function.

The parameters of sns.distplot

The sns.distplot function has about a dozen parameters that you can use. However, you won’t need most of them.

That being the case, we’re going to focus on a few of the most common parameters for sns.distplot:

color
kde
hist
bins

Let’s take a closer look at each of them.

`color` (required)

The color parameter does what it sounds like: it changes the color of the KDE plot and the histogram.

You can use a “named” color from Python, like red, green, blue, darkred, etc.

You can also use hexidecimal colors. Hex colors are beyond the scope of this post. They’re fairly easy once you get the hang of them, but in the interest of simplicity I’m not going to explain them here.

`kde`

The kde parameter enables you to turn the KDE plot on and off in the output.

This parameter accepts a boolean value as an argument (i.e., True or False).

By default kde parameter is set to kde = True. That means that by default, the sns.distplot function will include a kernel density estimate of your input variable.

If you manually set kde = False, then the function will remove the KDE plot.

`hist`

The hist parameter controls whether or not a histogram will appear in the output.

This parameter accepts a boolean input.

By default, it is set to hist = True, which means that by default, the output plot will include a histogram of the input variable.

If you set hist = False, the function will remove the histogram from the output.

`bins`

The bins parameter enables you to control the number of bins in the output histogram.

If you do not set a value for the bins parameter, the function will automatically compute an appropriate number of bins.

Now that you’ve learned about the syntax and parameters of sns.distplot, let’s take a look at some concrete examples.

Examples: how to visualize distributions with seaborn

Here, we’re going to take a look at several examples of the distplot function.

That will include creating a combination histogram/KDE, as well as individual histograms or KDE plots (without the other).

Examples:

Run this code first

Before you run any of the code for these examples, you’ll need to run some preliminary code.

Specifically, you’ll need to import a few packages, set the plot background formatting, and create a DataFrame.

Import packages

First, you need to import two packages, Numpy and Seaborn.

We’ll use Numpy to create a normally distributed dataset that we can plot, and we’ll obviously need Seaborn in order to use the distplot function.

import numpy as np
import seaborn as sns

Set formatting

We’ll also set the chart formatting using the sns.set_style() function.

Depending on your Python settings, Seaborn can charts have the same format as matplotlib charts. Frankly, the matplotlib formatting is a little ugly. Seaborn gives us some better options.

The two options I like best are darkgrid and dark. I frequently use darkgrid for other Seaborn charts, but I prefer dark when I use distplot. That’s because the lines and histogram bars from distplot are a little transparent, and the gridlines from darkgrid tend to distract from the plot.

#sns.set_style('darkgrid')
sns.set_style('dark')

Create dataset

Here, we’re going to create a simple, normally distributed Numpy array.

We’ll create this array by using the np.random.normal function.

np.random.seed(42)
normal_data = np.random.normal(size = 300, loc = 85, scale = 3)

Using the loc parameter and scale parameter, we’ve created this data to have a mean of 85, and a standard deviation of 3.

We’ll be able to see some of these details when we plot it with the sns.distplot() function.

EXAMPLE 1: How to create a Seaborn distplot

First, we’re going to create a distplot with Seaborn.

Remember that by default, the sns.distplot function includes both a histogram and a KDE plot.

Let’s just run the code and take a look at the output.

Here’s the code:

sns.distplot(normal_data)

And here’s the output.

Overall, the distplot shows us how the data are distributed. Remember that when we created the data, we created it to have a mean of 85 and a standard deviation of 3.

Although the standard deviation is a little difficult to see precisely from the plot, the plot certainly shows that the mean of the data is roughly around 85.

The histogram part of the plot gives us a slightly granular view of how the data are distributed. We can roughly see the relative counts within each “bin” of the x axis.

The KDE line (the smooth line) smooths over some of the rough details and provides a smooth distribution line that we can examine.

I don’t want to get too deep into the weeds concerning how we can use this plot for data analysis …. that’s beyond the scope of the post.

The ultimate point is that this is fairly easy to create. We simply call the function and provide the name of the variable that we want to plot inside of the parenthesis.

EXAMPLE 2: Change the color

Next, we’re going to change the color of the plot.

By default, the color is a sort of medium blue color.

Here, we’re going to change the color to “navy.” To do this, we’ll set the color parameter to color = 'navy'.

sns.distplot(normal_data, color = 'navy')

OUT:

Notice in this chart that the color has been changed to a darker shade of blue.

Also notice, however, that although the KDE line is a dark navy color, the histogram is still a little light.

That’s because the histogram is set to be slightly transparent. Technically, the histogram is colored navy, but it’s just a little transparent.

EXAMPLE 3: How to create a Seaborn histogram

Now, let’s create a Seaborn histogram.

To do this, we’re going to call the distplot function and we’re going to remove the KDE line by setting the kde parameter to kde = False.

sns.distplot(normal_data, kde = False)

Here’s the output:

This is pretty straightforward. By setting kde = False, we’re telling the sns.distplot function to remove the KDE line. This leaves only the histogram in its place.

At this point, I think I should comment. I think that it’s debatable whether or not you should create a pure Seaborn histogram without the KDE line.

When I first started using the distplot function, I wanted to create histograms in Seaborn (without the KDE line).

After using it for a while, I actually prefer the distplot that contains both the histogram and the KDE line.

Try them out and see which you prefer.

EXAMPLE 4: Change the number of bins in a Seaborn histogram

Let’s quickly change the number of bins in the histogram.

Here, we’re still going to remove the KDE line in the plot, and we’ll create the underlying histogram with 50 bins.

sns.distplot(normal_data, kde = False, bins = 50)

OUT:

Here, we’ve simply created a Seaborn histogram with 50 bins.

The increased number of bins shows more granularity in the data distribution.

Seeing an increased number of bins can actually help when there’s a lot of variation at small scales or when we’re looking for unusual features in the data distribution (like a spike in a particular location).

Having said that, as an analyst or data scientist, you need to learn when to use a large number of bins, and when to use a small number.

There’s a bit of an art to choosing the right number of bins, and it takes practice.

EXAMPLE 5: Plot a kernel density estimate with Seaborn

Finally, let’s just plot a KDE line without the underlying histogram.

We can do this by calling the distplot function and setting the hist parameter to hist = False.

sns.distplot(normal_data, hist = False)

OUT:

Although I think it can be useful to have the combined KDE/histogram plot, I also like the lone KDE line, as seen here.

I think that this would be particularly useful if you had a large number of variables that you needed to plot (perhaps inside of a small multiple chart).

If you needed to plot a dozen or more distributions, for example, it might be better just to see the KDE line. If you’re plotting a large number of variables, a pure KDE line might be less distracting and easier to read at a glance.

That said, I think there’s an element of preference here as well. Play around with these and see which options you like best.

Frequently asked questions about Seaborn histograms and distplots

Now that you’ve learned about Seaborn histograms and distplots and seen some examples, let’s review some frequently asked questions.

Frequently asked questions:

Question 1: How can I make the histogram more opaque

You’ve probably noticed that by default, the histogram in the distplot is a little transparent.

That’s the default setting.

How do you make it more opaque?

You actually need to use a parameter from matplotlib (the alpha parameter). Moreover, you need to call this in a special way. You need to use the hist_kws parameter from sns.distplot to access the underlying matplotlib parameter.

Here’s some code that shows how:

sns.distplot(normal_data
             ,kde = False
             ,hist_kws = {"alpha": 1}
             )

OUT:

Here, the code hist_kws = {"alpha": 1} is accessing the alpha parameter from matplotlib, and setting alpha equal to 1.

Notice that the output histogram is fully opaque.

Question 2: What’s the difference between distplot and kdeplot

Seaborn actually has two functions to plot the distribution of a variable: sns.distplot and sns.kdeplot.

What’s the difference?

They are almost the same.

The KDE line in a distplot plot is exactly the same as the KDE line from sns.kdeplot. The only difference is that sns.distplot includes a histogram.

If you call sns.distplot(my_var, hist = False), then the output will be identical to sns.kdeplot(myvar).

Leave your other questions in the comments below

Do you have other questions about using the sns.distplot function to create a Seaborn histogram, or a visualization of a distribution?

Leave your question in the comments section at the bottom of the page.

Join our course to learn more about Seaborn

The examples you’ve seen in this tutorial should be enough to get you started, but if you’re serious about learning Seaborn, you should enroll in our premium course called Seaborn Mastery.

There’s a lot more to learn about Seaborn, and Seaborn Mastery will teach you everything, including:

How to create essential data visualizations in Python
How to add titles and axis labels
Techniques for formatting your charts
How to create multi-variate visualizations
How to think about data visualization in Python
and more …

Moreover, it will help you completely master the syntax within a few weeks. You’ll discover how to become “fluent” in writing Seaborn code.

Find out more here:

Learn More About Seaborn Mastery

The post How to make a Seaborn histogram with the distplot function appeared first on Sharp Sight.

Thanks for visiting r-craft.org
This article is originally published at https://www.sharpsightlabs.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

How to make a Seaborn histogram with the distplot function

You may also like...

Categories

How to make a Seaborn histogram with the distplot function

A quick introduction to histograms and distplots

Histograms

Density plots

Histograms and density plots in Seaborn

The Seaborn distplot function creates histograms and KDE plots

The syntax of sns.distplot

A simple example of Seaborn disptplot syntax

The parameters of sns.distplot

color (required)

kde

hist

bins

Examples: how to visualize distributions with seaborn

Run this code first

Import packages

Set formatting

Create dataset

EXAMPLE 1: How to create a Seaborn distplot

EXAMPLE 2: Change the color

EXAMPLE 3: How to create a Seaborn histogram

EXAMPLE 4: Change the number of bins in a Seaborn histogram

EXAMPLE 5: Plot a kernel density estimate with Seaborn

Frequently asked questions about Seaborn histograms and distplots

Question 1: How can I make the histogram more opaque

Question 2: What’s the difference between distplot and kdeplot

Leave your other questions in the comments below

Join our course to learn more about Seaborn

You may also like...

Python – en Supremo?

Crochet Patterns

themis 0.1.0

Categories

`color` (required)

`kde`

`hist`

`bins`