Testing Data Pipelines with dbt-expectations: A Beginner's Guide
How many times have you run into an error in your data pipeline, confused about where to even begin debugging it? You probably start by checking the core model, then the source, or maybe an intermediate model that sits between the source and the core model. In order to get to the root cause, you need to debug each individual step to hone in on where things are going wrong.
Wouldn’t the debugging process be easier if we knew the exact transformation step where a problem manifested? Luckily, if you are a dbt user, there is an easy solution.
What is dbt-expectations?
dbt-expectations is a free open-source testing library for dbt that allows you to assert “expectations” about your data. If your data doesn’t meet these expectations, it will raise an error. This package is essentially the dbt version of Great Expectations, a popular Python testing library.
If you aren’t familiar with Great Expectations, this is an open-source tool that focuses on validating, documenting, and profiling your data. It creates data documentation and data quality reports for your codebases, focusing on simplifying the lives of data engineers and software engineers. Because it is Python based, you need to be actively using Python in your analytics environment.
dbt-expectations enables analytics engineers and data analysts that use SQL more than Python to take advantage of the same data quality testing as data engineers and software engineers. Rather than using Python, these tests use SQL code and integrate directly into your dbt project. This way, you don’t have to refactor your entire codebase, which is written in a different language, just to take advantage of a powerful data quality tool.
Why should you use dbt-expectations?
These tests allow you to assert powerful indicators of data quality without the huge lift of writing custom tests yourself. With dbt-expectations you can test sources, models, and seeds. This way you can check data quality at every layer of your data model rather than just checking the final product. This ensures your data stays consistent across different models as well as different runs of your data pipeline.
With dbt-expectations, you can test the following aspects of your data without needing to write any custom SQL:
- Row count
- Column count
- String patterns (regex)
- Anomalies
- Datatypes
You can also test for null and unique values with more complexity than the generic tests offered in every dbt project.
Adding dbt-expectations to your dbt project
To add dbt-expectations to your dbt project, add the following code to your packages.yml file:
Make sure you also have the following variable defined in your dbt_profile.yml file:
Be sure to change this timezone to the timezone you are working in as this is used in a few of the date-related tests within the package.
Row Count Testing
There are a few different tests within dbt-expectations that look at the row counts of a table. One of my favorites is the expect_table_row_count_to_be_between test. This is good for checking if the correct volume of rows has been ingested into a table. I recommend using this alongside a freshness test to ensure data is not only up-to-date, but the amount of data being ingested also meets your expectations.
When I didn’t implement tests like this one, I often ran into scenarios where my data tables were considered “fresh”, as in data had been ingested within the last day. However, there were only one or two records being ingested rather than the thousands that were expected. This test can help you catch situations like this one that often went unnoticed.
All dbt tests are added to a column or table in that source/model’s corresponding YAML file. You simply specify the name of the dbt-expectations test under a test block. Make sure to check the documentation for whether that test should be added at the model or column level. expect_table_row_count_to_be_between is added at the model level since it is looking at the number of rows in the entire table.
I would add this to a web sessions model like so:
You may have to experiment with the minimum and maximum values for this test so that you don’t get too many or too little alerts. I recommend looking at the average volume in your table over its lifetime as well as the maximum and minimum. Use these numbers to then set these values within your test. Too many tests is sometimes just as bad as too little!
Column Count Testing
Similar to row count, dbt-expectations also has a test for checking the column count of a table. I like to use these column count tests to track changes in tables, alerting me when engineering teams may have added or removed a column from a key data source.
Expect_table_column_count_to_equal is a test you can add to your table to track any changes in the number of columns. If a column is removed or added, you will be alerted since the column number no longer matches what you set in your test.
If we look at the same web sessions model, we can add this test to ensure our column count stays at 15 columns.
With the expect_column_to_exist test, you can also add a test to individual columns within a table to track the status of just that column. This way, you are alerted if that specific column is deleted. This is extremely helpful when a column is used downstream in another data model. If something in the downstream model fails, you can turn to this test to see if any referenced columns were deleted.
I would add this test like so to the session_id column in the web sessions model, which is a unique identifier of the model.
Now, I have another test I can use to my advantage when diagnosing any errors in my data pipeline. This test helps me eliminate any investigation that may go into that column.
String Pattern (REGEX) Testing
dbt-expectation’s ability to test for string patterns helps maintain high data quality throughout various different fields. These tests can be utilized for strong data governance and ensuring values entered by customers are accurate. I recommend implementing them on email addresses, phone numbers, and social security numbers.
Expect_column_values_to_match_regex allows you to assert a regex pattern that dbt then checks for in the values of the column you define the test on. For example, if you want to validate that the values in a column are all proper email addresses, you can specify this regex pattern:
This pattern ensures there is a string of characters before an @ sign as well as after.
If a value in the email column does not follow this regex pattern, dbt-expectations will raise an error.
You can also apply a similar test, expect_column_value_lengths_to_equal, to check for the length of strings within each column. This is important for values like phone numbers, social security numbers, and credit card numbers that always have a set number of characters. Again, this can be used to ensure the values in this column follow the correct format.
Here, dbt-expectations will raise an error whenever a value in this social security column does not have 9 characters.
Anomalies detection
One of the best features of dbt-expectations is the ability to check for anomalies in your data. This means that you can look for abnormal patterns that may hint at a greater underlying issue in your data. Anomaly tests are great for checking overall data volume or the volume of certain characteristics in a data column.
The two anomaly tests that I like to use most often are called expect_column_median_to_be_between and expect_column_mean_to_be_between. One of these calculates the median of the numerical values in a column and the other calculates the mean of these numerical values. Which one you use is dependent on how your data is distributed. Typically, the median is good for normally distributed data and the mean is great for data that tends to be skewed. Doing a basic analysis of your data will help you determine which of these tests to use.
These tests are particularly helpful because you can apply aggregations to them. Unlike the other tests, there is an option to group your data by another column present in the dataset.
Here, you can see that we group by customer_id. This test will then group the dataset by customer_id and apply the test to the mean value of the order_size column. If this mean is below 0 or above 5, the test will fail.
Datatype Testing
Datatype testing helps ensure your column values are the datatype that you expect them to be. This is an ideal test to apply to source data so you know if a datatype change is likely to affect downstream data.
For example, if you expect a column to contain timestamp values and apply a timestamp-specific function to this column downstream, the function will error out if the column value ingests as a date instead. Datatype testing allows you to catch this error early on in your data pipeline, preventing downstream failures.
expect_column_values_to_be_of_type asserts your expectation for the datatype of a column value, erroring out if a value is NOT that same data type. You can apply this to any column like so:
If this order_size column wasn’t an integer, then even applying a mean test like the one above would fail. You can’t perform any type of calculation on a column that isn’t an integer!
Next Steps in Using dbt-expectations
Now that you know the different things you can test for using dbt-expectations, it’s important that you assess what really needs to be tested. Look at the errors that have popped up the most in your data in the last few months. How can you add one of the tests discussed to a source or model in order to prevent an issue downstream?
Working backward like this will allow you to prioritize the tests that will give you the most bang for your time. It will allow you to be more proactive in the debugging and error handling process and allow you to catch data quality issues as close to the source as possible using dbt.