How to do Simple EDA for Machine Learning
This article is originally published at https://www.sharpsightlabs.com
In this tutorial, I’ll show you how to do some simple exploratory data analysis (EDA) for a machine learning project.
In this tutorial, we’ll look at the Titanic dataset, which is commonly used in machine learning tutorials, and has previously been used as a Kaggle dataset.
This tutorial will really only scratch the surface. There’s a lot of analysis that we could do, but in the interest of brevity, I’ll show you a few things.
Table of Contents:
Project Setup (prior to EDA)
First, we need to set a few things up before we do our EDA.
Specifically, we need to import some packages and get our data.
IMPORT PACKAGES
First, we need to import some packages.
import pandas as pd import seaborn as sns import seaborn.objects as so
We’re importing Seaborn, which enables us to create many of the visualizations we’ll use to visualize and analyze our data. We’ll also use the relatively new Seaborn Objects sub-package (which you might need to install).
And finally, we’ve imported Pandas, which will give us some tools to wrangle or subset our data.
Load Dataset
We also need to load the titanic
dataset, which we’ll be analyzing.
titanic = sns.load_dataset('titanic')
We’ll eventually need to do a little data manipulation on this dataset, but before we do that, we’ll actually need to get a sense of what’s in here. In turn, that will help us decide on how to modify the data going forward.
Basic Data Inspection
Now that we have a dataframe, we’ll do some basic data inspection.
Specifically, we will:
- Print some records
- List the data types
- Count the missing records by column
Print records
Here, we’ll print out a few of the records using the Pandas head method.
# PRINT RECORDS titanic.head()
OUT:
survived pclass sex age ... deck embark_town alive alone 0 0 3 male 22.0 ... NaN Southampton no False 1 1 1 female 38.0 ... C Cherbourg yes False 2 1 3 female 26.0 ... NaN Southampton yes True 3 1 1 female 35.0 ... C Southampton yes False 4 0 3 male 35.0 ... NaN Southampton no True [5 rows x 15 columns]
List Data Types
Next, we’ll list the data types in the dataframe.
To do this, we’ll call the dtypes property of the dataframe.
titanic.dtypes
OUT:
survived int64 pclass int64 sex object age float64 sibsp int64 parch int64 fare float64 embarked object class category who object adult_male bool deck category embark_town object alive object alone bool dtype: object
Here, you can see that we have a mix of integers, floats, categories, and “objects” (which are commonly strings).
We will need to modify or change a few of these, but it will become more obvious how we need to change this as we move forward.
Get Count of Missing Values by Column
Now, before we move on to some visualizations, we’ll get a count of missing values.
# GET COUNT OF MISSING (titanic .isnull() .sum() )
OUT:
survived 0 pclass 0 sex 0 age 177 sibsp 0 parch 0 fare 0 embarked 2 class 0 who 0 adult_male 0 deck 688 embark_town 2 alive 0 alone 0 dtype: int64
Most of these columns have zero or very few missing values. But two of the columns (age
and deck
) have a substantial number.
We may want to avoid these variables in a model, particularly deck
.
But we may still take a look at them and see what they contain.
High-Level Visualizations
And now, we’ll do some high-level visualizations.
Specifically, we’ll create:
- Histograms of the numeric variables
- Visualize the target variable
- Visualize the target variable, broken out by other variables
We’ll also do a little data manipulation along the way.
Create Histograms of Numeric Variables
Next, we’re going to plot histograms of the numeric variables.
To do this, we’re going to use the Seaborn FacetGrid technique, which creates a small multiple chart (AKA, trellis chart).
But, we also need to use select_dtypes to retrieve the numeric variables with the Pandas melt function. The melt function will restructure the data from “tidy” format to long format. Said differently, Pandas melt will put the dataset into a format that allows us to create the small multiple chart.
Notice that we’re calling sns.histplot to actually create the histograms.
# CREATE HISTOGRAMS OF NUMERIC VARIABLES hist_grid = sns.FacetGrid(data = pd.melt(titanic.select_dtypes(include = np.number)) ,col='variable' ,col_wrap = 3 ,sharex=False ) hist_grid.map(sns.histplot, 'value', bins=10)
OUT:
I’m not going to analyze these plots in detail, but a few things stand out.
First, the age
variable seems not-quite-exactly normal, but it does roughly have a bell shape. This is somewhat more typical of what we might want for a numeric variable.
But several of the other “numeric” variables are definitely not normal.
In particular, both survived
and pclass
seem to be categorical variables. They aren’t distributed across many values, but instead have distinct peaks.
That said, we’ll quickly recode those variables to behave more like categoricals.
Recode Variables to Categoricals
Here, we’re going to recode survived
and pclass
to categorical variables.
To do this, we’ll need to use multiple Pandas tools in combination.
Most importantly, we need to use the Pandas pd.Categorical function to create categorical variables. Notice that we’re specifying the categorical values, and using the ordered
parameter to specify that these categories have a specific order.
We’re using the Pandas astype to specify that we want to treat these variables as something other than a numeric.
And we’re using the Pandas assign method to add the newly constructed variables to our dataframe.
titanic_new = (titanic .assign(pclass = pd.Categorical(titanic.pclass.astype(str) ,categories = ['1', '2', '3'] ,ordered = True)) .assign(survived = pd.Categorical(titanic.survived.astype(str) ,categories = ['0','1'] ,ordered = True)) )
Notice that the whole expression is enclosed in parenthesis. We’re calling multiple functions and methods, and using multiple “assign” operations in series. This is ultimately an example of Pandas method chaining, which you really need to know if you want to do complex data manipulations.
I’ll leave you to inspect the new data with a few data inspection methods.
Visualize the Target Variable (Survived)
Now that we’ve done a little data cleaning, we’ll visualize the survived
variable.
Here, we’re going to create a countplot with Seaborn of the survived
variable.
sns.countplot(data = titanic_new ,x = 'survived' )
OUT:
You’ll notice that more people died (0
) than survived (1
). In fact, only about 40% survived.
We’re going to analyze the survived
variable a little further with some additional visualizations.
Visualize Survived by Sex
Here, we’re going to create a new bar chart of the survived variable, but this time we’re going to break out the data by sex
.
To do this, we’re going to use the relatively new Seaborn Objects interface (a new Seaborn visualization package) and create a countplot that plots sex
on the x-axis, and creates a breakout of that data by creating additional bars for survived
, and “dodging” them to the side.
In other words, we’re creating a dodged bar chart.
(so.Plot(data = titanic_new ,x = 'sex' ,color = 'survived' ) .add(so.Bar(), so.Count(), so.Dodge()) )
OUT:
Sorry guys … if you were a male on the Titanic, you were pretty likely to die.
Ladies. A bit more “lucky.” You would have been more likely to live than die.
Importantly, it looks like the sex
variable is fairly predictive of survival. This would be important in a classification model.
Visualize Survived by Pclass
Next, let’s visualize survived by pclass
.
This variable encodes the “passenger class,” and has values 1, 2, and 3. You can think of this like first class vs coach for modern plane flights.
Here, we’re going to visualize this as a bar chart, but I’m actually going to calculate the percent of people who survived.
I’m also going to facet this chart out by sex, in order to create a small multiple chart.
To do this, I’m using the Seaborn Objects interface with so.Bar
to create a bar chart and so.Agg
to compute the mean. Because we’re also using the astype
method to treat the survived variable as a binary, 0/1 integer, computing the mean of this variable will give us a percent.
Note that I’m also faceting this plot with the Seaborn Objects facet
method to create a small multiple chart.
Ok, here’s the code:
(so.Plot(data = titanic_new.assign(survived = titanic_new.survived.astype(int)) ,x = 'pclass' ,y = 'survived' ) .add(so.Bar(), so.Agg('mean')) .facet(col = 'sex') )
And here’s the output:
Fascinating.
If you were female, you were more likely to survive than a male in any passenger class. But females in 1st and 2nd class were more likely to survive than females in 3rd class.
If you were male, you were still pretty likely to die no matter what, but much more likely to die if you were in 2nd and 3rd class. The men in 1st class actually had almost a 40% chance of survival, vs under 20% in 2nd and 3rd.
Clearly, passenger class, like sex, had a strong relationship with survival.
Again, this is useful to us when building a machine learning model.
There’s probably more that we could do
I’m going to stop there for now.
We’ve identified a few variables that appear to be related to the target (both sex
and plcass
).
We also identified a few variables that we needed to recode.
Having said that, there’s probably more that we could do.
Tell me what you want to know
What else do you want to see for machine learning EDA?
Do you have any suggestions?
What did I miss?
I want to hear from you …
Give me your feedback by leaving a comment in the comments section below.
Thanks for visiting r-craft.org
This article is originally published at https://www.sharpsightlabs.com
Please visit source website for post related comments.