Machine learning / Python / R News

How to do Simple EDA for Machine Learning

by Joshua Ebner · May 11, 2023

This article is originally published at https://www.sharpsightlabs.com

In this tutorial, I’ll show you how to do some simple exploratory data analysis (EDA) for a machine learning project.

In this tutorial, we’ll look at the Titanic dataset, which is commonly used in machine learning tutorials, and has previously been used as a Kaggle dataset.

This tutorial will really only scratch the surface. There’s a lot of analysis that we could do, but in the interest of brevity, I’ll show you a few things.

Table of Contents:

Project Setup (prior to EDA)

First, we need to set a few things up before we do our EDA.

Specifically, we need to import some packages and get our data.

IMPORT PACKAGES

First, we need to import some packages.

import pandas as pd
import seaborn as sns
import seaborn.objects as so

We’re importing Seaborn, which enables us to create many of the visualizations we’ll use to visualize and analyze our data. We’ll also use the relatively new Seaborn Objects sub-package (which you might need to install).

And finally, we’ve imported Pandas, which will give us some tools to wrangle or subset our data.

Load Dataset

We also need to load the titanic dataset, which we’ll be analyzing.

titanic = sns.load_dataset('titanic')

We’ll eventually need to do a little data manipulation on this dataset, but before we do that, we’ll actually need to get a sense of what’s in here. In turn, that will help us decide on how to modify the data going forward.

Basic Data Inspection

Now that we have a dataframe, we’ll do some basic data inspection.

Specifically, we will:

Print some records
List the data types
Count the missing records by column

Print records

Here, we’ll print out a few of the records using the Pandas head method.

# PRINT RECORDS
titanic.head()

OUT:

   survived  pclass     sex   age  ...  deck  embark_town  alive  alone
0         0       3    male  22.0  ...   NaN  Southampton     no  False
1         1       1  female  38.0  ...     C    Cherbourg    yes  False
2         1       3  female  26.0  ...   NaN  Southampton    yes   True
3         1       1  female  35.0  ...     C  Southampton    yes  False
4         0       3    male  35.0  ...   NaN  Southampton     no   True

[5 rows x 15 columns]

List Data Types

Next, we’ll list the data types in the dataframe.

To do this, we’ll call the dtypes property of the dataframe.

titanic.dtypes

OUT:

survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

Here, you can see that we have a mix of integers, floats, categories, and “objects” (which are commonly strings).

We will need to modify or change a few of these, but it will become more obvious how we need to change this as we move forward.

Get Count of Missing Values by Column

Now, before we move on to some visualizations, we’ll get a count of missing values.

# GET COUNT OF MISSING
(titanic
 .isnull()
 .sum()
 )

OUT:

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

Most of these columns have zero or very few missing values. But two of the columns (age and deck) have a substantial number.

We may want to avoid these variables in a model, particularly deck.

But we may still take a look at them and see what they contain.

High-Level Visualizations

And now, we’ll do some high-level visualizations.

Specifically, we’ll create:

Histograms of the numeric variables
Visualize the target variable
Visualize the target variable, broken out by other variables

We’ll also do a little data manipulation along the way.

Create Histograms of Numeric Variables

Next, we’re going to plot histograms of the numeric variables.

To do this, we’re going to use the Seaborn FacetGrid technique, which creates a small multiple chart (AKA, trellis chart).

But, we also need to use select_dtypes to retrieve the numeric variables with the Pandas melt function. The melt function will restructure the data from “tidy” format to long format. Said differently, Pandas melt will put the dataset into a format that allows us to create the small multiple chart.

Notice that we’re calling sns.histplot to actually create the histograms.

# CREATE HISTOGRAMS OF NUMERIC VARIABLES
hist_grid = sns.FacetGrid(data = pd.melt(titanic.select_dtypes(include = np.number))
                       ,col='variable'
                       ,col_wrap = 3
                       ,sharex=False
                       )
hist_grid.map(sns.histplot, 'value', bins=10)

OUT:

I’m not going to analyze these plots in detail, but a few things stand out.

First, the age variable seems not-quite-exactly normal, but it does roughly have a bell shape. This is somewhat more typical of what we might want for a numeric variable.

But several of the other “numeric” variables are definitely not normal.

In particular, both survived and pclass seem to be categorical variables. They aren’t distributed across many values, but instead have distinct peaks.

That said, we’ll quickly recode those variables to behave more like categoricals.

Recode Variables to Categoricals

Here, we’re going to recode survived and pclass to categorical variables.

To do this, we’ll need to use multiple Pandas tools in combination.

Most importantly, we need to use the Pandas pd.Categorical function to create categorical variables. Notice that we’re specifying the categorical values, and using the ordered parameter to specify that these categories have a specific order.

We’re using the Pandas astype to specify that we want to treat these variables as something other than a numeric.

And we’re using the Pandas assign method to add the newly constructed variables to our dataframe.

titanic_new = (titanic
                  .assign(pclass = pd.Categorical(titanic.pclass.astype(str)
                                                    ,categories = ['1', '2', '3']
                                                    ,ordered = True))
                  .assign(survived = pd.Categorical(titanic.survived.astype(str)
                                                    ,categories = ['0','1']
                                                    ,ordered = True))
                  )

Notice that the whole expression is enclosed in parenthesis. We’re calling multiple functions and methods, and using multiple “assign” operations in series. This is ultimately an example of Pandas method chaining, which you really need to know if you want to do complex data manipulations.

I’ll leave you to inspect the new data with a few data inspection methods.

Visualize the Target Variable (Survived)

Now that we’ve done a little data cleaning, we’ll visualize the survived variable.

Here, we’re going to create a countplot with Seaborn of the survived variable.

sns.countplot(data = titanic_new
              ,x = 'survived'
              )

OUT:

You’ll notice that more people died (0) than survived (1). In fact, only about 40% survived.

We’re going to analyze the survived variable a little further with some additional visualizations.

Visualize Survived by Sex

Here, we’re going to create a new bar chart of the survived variable, but this time we’re going to break out the data by sex.

To do this, we’re going to use the relatively new Seaborn Objects interface (a new Seaborn visualization package) and create a countplot that plots sex on the x-axis, and creates a breakout of that data by creating additional bars for survived, and “dodging” them to the side.

In other words, we’re creating a dodged bar chart.

(so.Plot(data = titanic_new
        ,x = 'sex'
        ,color = 'survived'
        )
  .add(so.Bar(), so.Count(), so.Dodge())
 )

OUT:

Sorry guys … if you were a male on the Titanic, you were pretty likely to die.

Ladies. A bit more “lucky.” You would have been more likely to live than die.

Importantly, it looks like the sex variable is fairly predictive of survival. This would be important in a classification model.

Visualize Survived by Pclass

Next, let’s visualize survived by pclass.

This variable encodes the “passenger class,” and has values 1, 2, and 3. You can think of this like first class vs coach for modern plane flights.

Here, we’re going to visualize this as a bar chart, but I’m actually going to calculate the percent of people who survived.

I’m also going to facet this chart out by sex, in order to create a small multiple chart.

To do this, I’m using the Seaborn Objects interface with so.Bar to create a bar chart and so.Agg to compute the mean. Because we’re also using the astype method to treat the survived variable as a binary, 0/1 integer, computing the mean of this variable will give us a percent.

Note that I’m also faceting this plot with the Seaborn Objects facet method to create a small multiple chart.

Ok, here’s the code:

(so.Plot(data = titanic_new.assign(survived = titanic_new.survived.astype(int))
        ,x = 'pclass'
        ,y = 'survived'
        )
  .add(so.Bar(), so.Agg('mean'))
  .facet(col = 'sex')
 )

And here’s the output:

Fascinating.

If you were female, you were more likely to survive than a male in any passenger class. But females in 1st and 2nd class were more likely to survive than females in 3rd class.

If you were male, you were still pretty likely to die no matter what, but much more likely to die if you were in 2nd and 3rd class. The men in 1st class actually had almost a 40% chance of survival, vs under 20% in 2nd and 3rd.

Clearly, passenger class, like sex, had a strong relationship with survival.

Again, this is useful to us when building a machine learning model.

There’s probably more that we could do

I’m going to stop there for now.

We’ve identified a few variables that appear to be related to the target (both sex and plcass).

We also identified a few variables that we needed to recode.

Having said that, there’s probably more that we could do.

Tell me what you want to know

What else do you want to see for machine learning EDA?

Do you have any suggestions?

What did I miss?

I want to hear from you …

Give me your feedback by leaving a comment in the comments section below.

Thanks for visiting r-craft.org
This article is originally published at https://www.sharpsightlabs.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

How to do Simple EDA for Machine Learning

You may also like...

Categories

How to do Simple EDA for Machine Learning

Project Setup (prior to EDA)

IMPORT PACKAGES

Load Dataset

Basic Data Inspection

Print records

List Data Types

Get Count of Missing Values by Column

High-Level Visualizations

Create Histograms of Numeric Variables

Recode Variables to Categoricals

Visualize the Target Variable (Survived)

Visualize Survived by Sex

Visualize Survived by Pclass

There’s probably more that we could do

Tell me what you want to know

You may also like...

Hurricanes and Himmicanes revisited with DHARMa

R Weekly 2022-W20 Topographic Maps, Playing the Drums, and a New R4DS Chapter

Sharing htmlwidgets as Gists

Categories