How to Write a High-Quality Data Model From Start to Finish Using dbt
Data modeling is a topic that only continues to get more and more traction, especially in the world of analytics engineering. Data modeling and the popular data transformation tool, dbt, go hand-in-hand. While data modeling is a concept that’s been around for a very long time, dbt gave it a fresh update and really redefined what a data model is.
I like to think of a data model as a set of transformations that take data from its raw form and turns it into something usable by business teams. Writing data models has become a primary responsibility of analytics engineers because of their ability to understand technical concepts but also business processes. dbt has only made the lives of analytics engineers easier, by allowing them to write data models that are modular and whose code can be reused across different functions.
In this article, we will discuss the different layers of a dbt data model, best practices of writing data models, and how to properly test them using dbt.
How to organize source data
Every data model starts with source data, or data that is ingested directly from an external source into your data warehouse. Rather than use these raw data sources directly within your models, dbt helps you to create an extra layer between sources and models. This way a copy of your raw data always exists untouched within your data warehouse. However, before you can use a source in your models, you need to properly define it.
Defining a source
Sources in dbt are defined in YAML files sitting at the root of a source’s directory. It is a best practice to create a directory for each source within a “staging” directory that sits in the “models” directory. This way the documentation lives next to the corresponding staging models for that source.
Within a YAML file, you define your source using its database name and schema name, and then assigning it a name to use downstream. This is what a source would look like in a source.yml file:
Every raw data source that you want to use in a dbt model needs to be referenced in a YAML file like so. If it’s not defined, dbt won’t understand where to look when you reference that database and schema downstream.
Referencing a source
When writing your staging models in dbt, every staging model will reference a source like the one we defined above. Keep in mind that sources should only be referenced in staging models.
In order to select from one, you would call it like so:
A source reference requires two pieces of information- a source name and a table name. If you look at the YAML file we defined, you can see that “google_sheets” refers to the source name, and “customer_orders” refers to the table name.
Documenting a source
Lastly, it’s important that you document all of the sources that you are using in your dbt models. This is also done in the YAML files where you define your sources, like in the one we wrote above. You can add a description to not only the source name but each table name and column name as well. I recommend doing this from the very beginning so that you don’t have to worry about coming back to it later, or even worse, forgetting about it entirely.
Thorough documentation will help other people on your team understand the code you are writing. It will allow them to contribute to the project as well while keeping questions and knowledge transfer minimal. In my experience, documentation is always worth the extra time upfront for the clarity it provides later down the line.
You can add documentation to your sources like so:
It’s important to document any quirks in your data in the table descriptions as well as the column descriptions. I like to think about the questions people will have ahead of time and write my documentation based on that.
Creating staging models
dbt uses something called staging models to keep the integrity of your raw data sources. Staging models read from these raw data sources that we defined and are then referenced in downstream data models. These staging models exist not only to protect your raw data but to help standardize it. It is in the staging layer that you standardize your data across various data sources, putting in place company standards.
For example, you may cast all of your date columns to be a “date” datatype and end in “_date” so that the person querying the data knows what to expect from that column.
Different functions like datatype casting and column renaming are most common here. Your data should be almost identical to the source data you are referencing. This means you don’t want to use any aggregates or window functions in these models. It should be as simple as possible while still implementing company standards.
A staging model typically looks like this:
Note that it is reading from the source that we defined and only includes basic SQL functions that keep the integrity of the data.
dbt data modeling best practices
Once you’ve defined your sources and have written staging models for each of them, you are ready to build out your data models further. The next step in building a dbt model is to focus on the core model, or the model that produces a dataset that will be directly queried by data analysts.
These models contain more advanced SQL transformations like joins, aggregations, and window functions. Unlike in the staging layer, you can pretty much do whatever you want to produce the results that you need. However, there are some best practices you should be following in order to take advantage of the full power of dbt.
Models should always read from staging models.
We mentioned this in the first section but it is a point worth repeating. All of your models that aren’t staging models should be reading data from staging models. They should never reference sources or be read in using non-dbt syntax. This will help maintain the integrity of your raw data and allow the relationship between different models in dbt to be tracked through a DAG.
If your model isn’t using a {{ source() }} or {{ ref() }} function then you are doing something wrong!
Models should always be written modularly.
The reason dbt is so great is that it allows you to write modular code. Instead of writing a model once and being done with it, you can refactor code you’ve already written to use in other data models. Any pieces of code that you repeat across multiple data models should ideally be its own model. This way, it can be ran once and referenced in multiple locations. This then saves computing resources in the form of time and money!
Intermediate models are the models that contain repeatable code. These are never directly queried by analysts or business users but rather help in producing modular core data models.
Models used by data analysts should be stored in a core or mart directory within the models directory.
All models in dbt are organized by source or by business purpose. All models that aren’t your staging models are called core (or mart) models. You should create a folder for these models under the “models” directory and create a subdirectory under here for each business function. This will help you organize your models by the purpose they serve within the business.
Remember that core models are the end models that are directly used by data analysts and business users. They are the datasets referenced within reports and dashboards whereas staging models and intermediate models are more so for the analytics engineer to use.
SQL best practices
dbt utilizes SQL to write its data models. It’s imperative that you follow SQL best practices when writing dbt models. This will keep your data models clean and concise, making them easy to follow and reuse.
Use CTEs or joins instead of subqueries
It is a dbt best practice to use CTEs or joins instead of subqueries. CTEs and joins make code more readable and allow you to easily follow the flow of data that is being produced. Subqueries can often be messy and hard to understand. dbt SQL best practices aim to simplify code and make it as easy to read as possible.
Replace this subquery….
With something like this…
Isn’t the join easier to read? Using SQL functions that prioritize readability is key for writing high-quality dbt data models.
Use descriptive names for CTEs
Continuing with the idea of CTEs, it’s vital that you give them descriptive names. It makes it easier to follow when you know the exact purpose behind a CTE. I like to use verbs to describe mine. For example, if two tables are being joined in a CTE, I name the CTE <table_1>_joined_<table_2>. This gives the reader an idea of why that CTE was written in the first place.
This is especially important for more complex transformations where the purpose may not be entirely obvious. Descriptive names along with in-line comments will help communicate the purpose of that piece of code without requiring whoever is reading it to come right to you.
Use full name aliases rather than abbreviations
When joining two tables, it’s important to make it clear which columns are from which tables. The best way to do this is by spelling out the full table names, or at least a name similar to the original table name when using them in joins. Is there anything more frustrating than reading through code where someone has assigned tables aliases of random letters? Often this requires more brain work to map back the aliases to the original tables than to write out the full name.
Notice how I did this in a previous example. Instead of using aliases, I spell out the full name of the table.
Which would you prefer to read? That or this?
The first option is much easier to read and eliminates any mistakes that can be made in the interpretation of the code!
Properly testing your sources and data models
One of the benefits of writing your models in dbt is the built-in testing that comes along with it. dbt makes it easy to test your sources and models within your documentation itself. It includes built-in generic tests, which are more basic and test for things like null and unique values, as well as the ability to write custom tests.
Testing your sources
When using dbt to test, it’s important that you always write tests for your sources first. This way you are testing your data to ensure it meets your expectations before it is used in any downstream data models. By testing at the source, you help eliminate steps in the debugging process in the case of something going wrong.
Think about it- if you only test your models, you may assume the logic in the model is making them fail. However, if bad data is being used in the model, the issue can be traced back much further. I’ve had many moments of frustration due to debugging on the model level when really the issue was with the source data.
With source data, you typically want to use dbt generic tests. These include:
- Unique
- Not_null
- Accepted_values
- Relationships
I always add the first two generic tests on all of my source columns so I can be alerted of unexpected empty columns or duplicates. If we revisit our source documentation, we can add the tests under the columns that we’ve already documented, like so:
Make sure you run the right queries on your data to ensure your expectations about the data are correct. You don’t want to add these tests for the sake of adding them, they need to be accurate! Otherwise, you will get a ton of inaccurate alerts that will eventually lead to alert fatigue.
Testing your model
Similarly to your sources, it is imperative to test your models within dbt. Adding tests to your models will help you gain a better understanding of when your transformations are failing due to the code itself. Failures could be due to incorrect logic or changes in upstream data models. The important thing is that you catch them for each individual model so that you don’t need to debug an entire DAG.
While I add the not_null and unique tests to my models, I tend to also utilize the accepted_values and relationship generic tests at this layer. These tests help to make sure nothing unexpected slipped through the cracks in your logic.
The accepted_values test allows you to assert a list of values that are expected in a column. If dbt sees a value not specified in this test, the test will fail.
We can add this test under an individual column like so:
Now, if there is a value other than 1, 2, or 3 for order_type_id, the test will fail and dbt will throw an error.
The relationships test allows you to set relationships between different models as well as a model and a source. This is great for ensuring your joins are working as expected, and not creating any duplicates. It essentially checks to make sure all of the values in the column you are testing also exist in the column in the model or source you are relating it to.
Like the others, you add this under a column and then specify the model or source the column is related to.
Here, dbt will check to ensure every value in the order_id column in this model is also found in the order_id column in the orders model. If it finds a value in this column that is not in the orders table, the test will fail.
Conclusion
The power of using dbt to model your data lies in the ability to write modular code and have documentation as well as testing live next to the models. dbt allows you to define sources, write intermediate and core models using those sources, and then thoroughly document and test each piece. When building dbt data models, it is key that you follow best practices in not only how you build your project, but how you write your SQL code.
Following the best practices outlined in this article will help you minimize any future tech debt, allowing your models to exist for years to come. They will also ensure your models are easily understood by others on your team and last beyond the person who built them. dbt makes collaboration between team members easy due to good habits like documentation and commenting in your code.
Remember to keep all of these tips in mind as you build out your first dbt model! Take it one step at a time, doing all of your documentation and testing as you go.