#013: How to Automate & Improve Data Quality

Oct 01, 2022

“If code runs and manual checks pass, then we’re good!”

For most of my 8 years on data teams, this was the extent of Data Quality.

But this leads to sloppy errors that kill user confidence.

 

Today, I want to share 3 ways you can automate & improve Data Quality by using:

  1. Continuous Integration (CI) workflows

  2. Linters

  3. Task Automations

     

CI workflows help you automate deployments, testing & docs.

The biggest confidence killer is broken logic or missing data.

And it can take months to regain that trust.

Instead, build workflows to validate changes beforehand.

Soon stakeholders will be focused on new features, not bug fixes.

 

Example: Use GitHub Actions to deploy & test all Pull Requests changes.

 

Linters establish clear syntax rules for your code.

Everyone has their own take on the “right” way to code.

But this leads to petty arguments and wasted time.

Linters hard-code styling rules and auto-check that they’re being followed.

The result is more consistent and maintainable code.

 

Example: SQL Fluff or PyLint

 

Task automations push, pull and update data on your behalf.

Fair or not, you’re expected to be aware of the full data platform.

But without the right systems, this is an impossible task.

Push notifications and task orchestrators are perfect for this.

Once in place, you’ll feel more in control and can quickly address issues.

 

Example: Slack notifications from Airflow

 

In summary:

Better data quality = Happy stakeholders.

Happy stakeholders = Happy engineers.

The Starter Guide for Modern Data

Build Modern Data Architectures With More Structure, Faster.

Show more impact with modern tools like dbt, Snowflake & GitHub by following a simple foundational design.

You'll also get other helpful content from me. Unsubscribe anytime.