How to make a Seaborn histogram with the distplot function
This article is originally published at https://www.sharpsightlabs.com
This tutorial will show you how to make a Seaborn histogram and density plots using the distplot function.
It will explain the syntax and also show you clear, step-by-step examples of how to use sns.distplot.
The tutorial is divided up into several different sections. You can click on one of the following links to go to the appropriate section.
Table of Contents:
- A quick introduction to histograms and distplots
- A review of histograms and density plots in Seaborn
- The syntax of
sns.distplot()
- Examples of how to use sns.distplot
- Frequently asked questions about Seaborn histograms and Seaborn distplots
That said, if you’re new to data visualization in Python or new to using Seaborn, I recommend that you read the entire tutorial.
A quick introduction to histograms and distplots
When we’re doing data science, one of the most common tasks is visualizing data distributions.
Frequently, we want to understand how our data are distributed as part of exploratory data analysis.
Sometimes we explore data to find out how it’s structured (i.e., when we first get a dataset).
Other times, we need to explore data distributions to answer a question or validate some hypothesis about the data.
Examining data distributions is also very common in machine learning, since many machine learning techniques assume that the data are distributed in particular ways
There are two primary ways to examine data distributions: the histogram and the density plot.
Histograms
Histograms are arguably the most common tool for examining data distributions.
In a typical histogram, we map a numeric variable to the x axis.
The x axis is then divided up into a number of “bins” … for example, there might be a bin from 10 to 20, the next bin from 20 to 30, the next from 30 to 40, and so on.
When we create a histogram (or use software to create a histogram) we count the number of observations in each bin.
Then we plot a bar for each bin. The length of the bar corresponds to the number of records that are within that bin on the x-axis.
Ultimately, a histogram contains a group of bars that show the “height” of the data (i.e., the count of the data) for different values our numeric variable.
The histogram shows us how a variable is distributed.
Density plots
The other primary tool for evaluating data distributions is the density plot.
There are a variety of methods for creating density plots, but one of the most common is called “kernel density estimation.” The plot that we generate when we use kernel density estimation is called “kernel density estimation plot.” These are also known as “KDE plots” for short.
KDE plots (i.e., density plots) are very similar to histograms in terms of how we use them. We use density plots to evaluate how a numeric variable is distributed.
The main differences are that KDE plots use a smooth line to show distribution, whereas histograms use bars. So KDE plots show density, whereas histograms show count.
Now that I’ve explained histograms and KDE plots generally, let’s talk about them in the context of Seaborn.
Histograms and density plots in Seaborn
Seaborn has two different functions for visualizing univariate data distributions – seaborn.kdeplot()
and seaborn.distplot()
.
In this tutorial, we’re really going to talk about the distplot function.
The Seaborn distplot function creates histograms and KDE plots
Technically, Seaborn does not have it’s own function to create histograms.
Instead, it has the seaborn.distplot()
function.
The distplot function creates a combined plot that contains both a KDE plot and a histogram.
At least, that’s the default behavior.
You can use the distplot function to create a chart with only a histogram or only a KDE plot.
I’ll show you how to do both in the examples section, but to understand how you need to understand the syntax.
That being the case, let’s take a look at the syntax of the seaborn.distplot function.
The syntax of sns.distplot
The technical name of the function is seaborn.distplot, but it’s a very common convention to call the function with the code sns.distplot. That’s the convention we’ll be using going forward …. we’re going to call the function as sns.distplot()
.
(Remember, to use the sns.
prefix, you need to import Seaborn with the code import seaborn as sns
.)
A simple example of Seaborn disptplot syntax
In the simplest version of the syntax, you just call the function sns.distplot()
, and provide the name of a DataFrame variable or list inside of the parenthesis.
This will create a simple combined histogram/KDE plot.
However, the function can be used in more complex ways, if you use some extra parameters.
Let’s take a look at a few important parameters of the sns.distplot function.
The parameters of sns.distplot
The sns.distplot function has about a dozen parameters that you can use. However, you won’t need most of them.
That being the case, we’re going to focus on a few of the most common parameters for sns.distplot:
color
kde
hist
bins
Let’s take a closer look at each of them.
color
(required)
The color
parameter does what it sounds like: it changes the color of the KDE plot and the histogram.
You can use a “named” color from Python, like red, green, blue, darkred, etc.
You can also use hexidecimal colors. Hex colors are beyond the scope of this post. They’re fairly easy once you get the hang of them, but in the interest of simplicity I’m not going to explain them here.
kde
The kde
parameter enables you to turn the KDE plot on and off in the output.
This parameter accepts a boolean value as an argument (i.e., True
or False
).
By default kde
parameter is set to kde = True
. That means that by default, the sns.distplot
function will include a kernel density estimate of your input variable.
If you manually set kde = False
, then the function will remove the KDE plot.
hist
The hist
parameter controls whether or not a histogram will appear in the output.
This parameter accepts a boolean input.
By default, it is set to hist = True
, which means that by default, the output plot will include a histogram of the input variable.
If you set hist = False
, the function will remove the histogram from the output.
bins
The bins
parameter enables you to control the number of bins in the output histogram.
If you do not set a value for the bins
parameter, the function will automatically compute an appropriate number of bins.
Now that you’ve learned about the syntax and parameters of sns.distplot, let’s take a look at some concrete examples.
Examples: how to visualize distributions with seaborn
Here, we’re going to take a look at several examples of the distplot function.
That will include creating a combination histogram/KDE, as well as individual histograms or KDE plots (without the other).
Examples:
- How to create a Seaborn distplot
- Change the color
- Create a Seaborn histogram
- Change the number of bins in the Seaborn histogram
- Create a density plot with Seaborn
Run this code first
Before you run any of the code for these examples, you’ll need to run some preliminary code.
Specifically, you’ll need to import a few packages, set the plot background formatting, and create a DataFrame.
Import packages
First, you need to import two packages, Numpy and Seaborn.
We’ll use Numpy to create a normally distributed dataset that we can plot, and we’ll obviously need Seaborn in order to use the distplot function.
import numpy as np import seaborn as sns
Set formatting
We’ll also set the chart formatting using the sns.set_style()
function.
Depending on your Python settings, Seaborn can charts have the same format as matplotlib charts. Frankly, the matplotlib formatting is a little ugly. Seaborn gives us some better options.
The two options I like best are darkgrid
and dark
. I frequently use darkgrid
for other Seaborn charts, but I prefer dark
when I use distplot. That’s because the lines and histogram bars from distplot are a little transparent, and the gridlines from darkgrid
tend to distract from the plot.
#sns.set_style('darkgrid') sns.set_style('dark')
Create dataset
Here, we’re going to create a simple, normally distributed Numpy array.
We’ll create this array by using the np.random.normal function.
np.random.seed(42) normal_data = np.random.normal(size = 300, loc = 85, scale = 3)
Using the loc
parameter and scale
parameter, we’ve created this data to have a mean of 85, and a standard deviation of 3.
We’ll be able to see some of these details when we plot it with the sns.distplot()
function.
EXAMPLE 1: How to create a Seaborn distplot
First, we’re going to create a distplot with Seaborn.
Remember that by default, the sns.distplot function includes both a histogram and a KDE plot.
Let’s just run the code and take a look at the output.
Here’s the code:
sns.distplot(normal_data)
And here’s the output.
Overall, the distplot shows us how the data are distributed. Remember that when we created the data, we created it to have a mean of 85 and a standard deviation of 3.
Although the standard deviation is a little difficult to see precisely from the plot, the plot certainly shows that the mean of the data is roughly around 85.
The histogram part of the plot gives us a slightly granular view of how the data are distributed. We can roughly see the relative counts within each “bin” of the x axis.
The KDE line (the smooth line) smooths over some of the rough details and provides a smooth distribution line that we can examine.
I don’t want to get too deep into the weeds concerning how we can use this plot for data analysis …. that’s beyond the scope of the post.
The ultimate point is that this is fairly easy to create. We simply call the function and provide the name of the variable that we want to plot inside of the parenthesis.
EXAMPLE 2: Change the color
Next, we’re going to change the color of the plot.
By default, the color is a sort of medium blue color.
Here, we’re going to change the color to “navy.” To do this, we’ll set the color
parameter to color = 'navy'
.
sns.distplot(normal_data, color = 'navy')
OUT:
Notice in this chart that the color has been changed to a darker shade of blue.
Also notice, however, that although the KDE line is a dark navy color, the histogram is still a little light.
That’s because the histogram is set to be slightly transparent. Technically, the histogram is colored navy, but it’s just a little transparent.
EXAMPLE 3: How to create a Seaborn histogram
Now, let’s create a Seaborn histogram.
To do this, we’re going to call the distplot function and we’re going to remove the KDE line by setting the kde
parameter to kde = False
.
sns.distplot(normal_data, kde = False)
Here’s the output:
This is pretty straightforward. By setting kde = False
, we’re telling the sns.distplot function to remove the KDE line. This leaves only the histogram in its place.
At this point, I think I should comment. I think that it’s debatable whether or not you should create a pure Seaborn histogram without the KDE line.
When I first started using the distplot function, I wanted to create histograms in Seaborn (without the KDE line).
After using it for a while, I actually prefer the distplot that contains both the histogram and the KDE line.
Try them out and see which you prefer.
EXAMPLE 4: Change the number of bins in a Seaborn histogram
Let’s quickly change the number of bins in the histogram.
Here, we’re still going to remove the KDE line in the plot, and we’ll create the underlying histogram with 50 bins.
sns.distplot(normal_data, kde = False, bins = 50)
OUT:
Here, we’ve simply created a Seaborn histogram with 50 bins.
The increased number of bins shows more granularity in the data distribution.
Seeing an increased number of bins can actually help when there’s a lot of variation at small scales or when we’re looking for unusual features in the data distribution (like a spike in a particular location).
Having said that, as an analyst or data scientist, you need to learn when to use a large number of bins, and when to use a small number.
There’s a bit of an art to choosing the right number of bins, and it takes practice.
EXAMPLE 5: Plot a kernel density estimate with Seaborn
Finally, let’s just plot a KDE line without the underlying histogram.
We can do this by calling the distplot function and setting the hist
parameter to hist = False
.
sns.distplot(normal_data, hist = False)
OUT:
Although I think it can be useful to have the combined KDE/histogram plot, I also like the lone KDE line, as seen here.
I think that this would be particularly useful if you had a large number of variables that you needed to plot (perhaps inside of a small multiple chart).
If you needed to plot a dozen or more distributions, for example, it might be better just to see the KDE line. If you’re plotting a large number of variables, a pure KDE line might be less distracting and easier to read at a glance.
That said, I think there’s an element of preference here as well. Play around with these and see which options you like best.
Frequently asked questions about Seaborn histograms and distplots
Now that you’ve learned about Seaborn histograms and distplots and seen some examples, let’s review some frequently asked questions.
Frequently asked questions:
Question 1: How can I make the histogram more opaque
You’ve probably noticed that by default, the histogram in the distplot is a little transparent.
That’s the default setting.
How do you make it more opaque?
You actually need to use a parameter from matplotlib (the alpha
parameter). Moreover, you need to call this in a special way. You need to use the hist_kws
parameter from sns.distplot to access the underlying matplotlib parameter.
Here’s some code that shows how:
sns.distplot(normal_data ,kde = False ,hist_kws = {"alpha": 1} )
OUT:
Here, the code hist_kws = {"alpha": 1}
is accessing the alpha
parameter from matplotlib, and setting alpha
equal to 1.
Notice that the output histogram is fully opaque.
Question 2: What’s the difference between distplot and kdeplot
Seaborn actually has two functions to plot the distribution of a variable: sns.distplot and sns.kdeplot.
What’s the difference?
They are almost the same.
The KDE line in a distplot plot is exactly the same as the KDE line from sns.kdeplot. The only difference is that sns.distplot includes a histogram.
If you call sns.distplot(my_var, hist = False)
, then the output will be identical to sns.kdeplot(myvar)
.
Leave your other questions in the comments below
Do you have other questions about using the sns.distplot function to create a Seaborn histogram, or a visualization of a distribution?
Leave your question in the comments section at the bottom of the page.
Join our course to learn more about Seaborn
The examples you’ve seen in this tutorial should be enough to get you started, but if you’re serious about learning Seaborn, you should enroll in our premium course called Seaborn Mastery.
There’s a lot more to learn about Seaborn, and Seaborn Mastery will teach you everything, including:
- How to create essential data visualizations in Python
- How to add titles and axis labels
- Techniques for formatting your charts
- How to create multi-variate visualizations
- How to think about data visualization in Python
- and more …
Moreover, it will help you completely master the syntax within a few weeks. You’ll discover how to become “fluent” in writing Seaborn code.
Find out more here:
Learn More About Seaborn Mastery
The post How to make a Seaborn histogram with the distplot function appeared first on Sharp Sight.
Thanks for visiting r-craft.org
This article is originally published at https://www.sharpsightlabs.com
Please visit source website for post related comments.