Monitoring the Data Lake: Detecting Data Anomalies in ETL pipelines
Why this training
The data in your data warehouse is mission critical. Reports are being used to make crucial company decisions. Your reputation is on the line to be sure of the accuracy of the data being reported. It is critical that you have confidence in your data pipeline execution.
Your data pipelines are complex and changing often. Regressions are possible from minor changes to DAGs and tasks. This may have unintended impacts on tables, and data flow which may not be discovered until much later and data has been already used in reports. Also, failure and DAG outages occur. You need confidence that when failures are fixed, that data is ‘flowing’ again and things are back to normal. Finally, complex ETL demands accuracy when data is mission critical. How do you know that your ETL is copying all rows accurately? That your joins are not dropping any data?
This training gives you practical, real-world examples of tests that can be added to any data pipeline to provide you with the confidence that things are working as expected.
What will I learn?
The class will include hands-on activities and provide pseudo-code examples of tests that can be run against your tables and data models. You will learn about the different classes of tests, how to set them up, and the important metrics to monitor.
How will this help my company and me?
Accurate business decisions, confidence that data quality is valid, and no more guesswork or surprises about data quality. Sounds great, right? Once you’re freed from worrying and fighting fires, you can refocus on your being creative with your data and have the confidence to make changes.