Building a Data Pipeline in Python – Part 4 of N – Basic Reporting
This article is originally published at https://www.stoltzmaniac.com
Building a report that passes tests
At this point, we have seen what our data looks like, how it is stored, and what some basic tests might look like. In this post, we start to look at how this might be created as a report to aid in the ETL process.
Many companies find themselves in positions where a CSV (or something similar) is delivered from outside of their organization. In this example, we are assuming that it is being placed into a folder called “new_data”. Our code picks up the file and compares it to what we would expect in terms of test data in order to decide whether or not to move forward with the ETL process.
This Jupyter notebook could be processed each time the file is updated and could be sent to stakeholders before data is processed. It contains a very basic level of testing and visualization, but the idea should get you started. When it runs, tests confirm whether or not the data fits within certain constraints and passes some integrity tests. The data is then plotted and a final output at the bottom shows which tests have passed / failed.
You’re always welcome to look at my GitHub for the repository.
Thanks for visiting r-craft.org
This article is originally published at https://www.stoltzmaniac.com
Please visit source website for post related comments.