Data Science / Python / R News

ETL – Building a Data Pipeline With Python – Introduction – Part 1 of N

by Scott Stoltzman · March 3, 2019

This article is originally published at https://www.stoltzmaniac.com

ETL (Extract, Transform, Load) is not always the favorite part of a data scientist’s job but it’s an absolute necessity in the real world. If you don’t understand this process, you will have a basic grasp on it by the time you’re done with these lessons. I will be covering:

Data exploration
- Understanding your data
- Looking for red flags
- Utilizing both statistics and data visualization
Checking your data for issues
- Identifying things outside of the “normal” range
- Deciding what to do with NaN or missing values
- Discovering data with the wrong data type
How to clean and transform your data
- Utilize the pandas library
- Utilize pyjanitor
- Getting data into tidy format
Dealing with your database
- Determining whether or not you actually need a database
- Choosing the right database
  - Deciding between relational and NoSQL
- Basic schema design and normalization
- Using an ORM – SQLAlchemy to insert data
Building a data pipeline
- Separate your ETL into parts
- Utilize luigi to keep you on track
- Error montitoring

The best way to go through this series of lessons is to read the blog post and then walk through the Jupyter Notebook located within my GitHub repository.

Here is how to think of the ETL within your data pipeline process:

E – You read in your data from a source. This could be any number of things, perhaps a CSV file, REST Api, Excel file, etc.

T – Check to ensure the data is accurate and is what you would expect it to look like. Run tests on this data.

L – Insert the data into your database (if you need one)

While ETL is relatively straightforward, there are a LOT of ways to handle it and even MORE ways to build a data pipeline. I will walk you through the most straightforward and simple way to handle it. This assumes you have “small data” that is suitable for batch processing.

Here is the problem we will be solving: sales data is coming in to us every month in a CSV. It is relatively clean but we want to make sure that there are no errors in it to ensure the business stakeholders are not making decisions based off of bad data. We need to ingest the data, clean it, and store it in a relational database. This will require an automated pipeline that is triggered when the CSV arrives and we also need to design the schema.

Lessons will come out every 1-2 weeks, get ready for some fun!

Thanks for visiting r-craft.org
This article is originally published at https://www.stoltzmaniac.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

ETL – Building a Data Pipeline With Python – Introduction – Part 1 of N

You may also like...

Categories

ETL – Building a Data Pipeline With Python – Introduction – Part 1 of N

You may also like...

City of Fort Collins Spending Analysis

Certifiably Gone Phishing

Using #rstats to Help Santa Deliver Presents This Christmas!

Categories