Article

How to Write a High-Quality Data Model From Start to Finish Using dbt

•

April 5, 2023

•

15 min read

Data modeling is a topic that only continues to get more and more traction, especially in the world of analytics engineering. Data modeling and the popular data transformation tool, dbt, go hand-in-hand. While data modeling is a concept that’s been around for a very long time, dbt gave it a fresh update and really redefined what a data model is.

I like to think of a data model as a set of transformations that take data from its raw form and turns it into something usable by business teams. Writing data models has become a primary responsibility of analytics engineers because of their ability to understand technical concepts but also business processes. dbt has only made the lives of analytics engineers easier, by allowing them to write data models that are modular and whose code can be reused across different functions.

In this article, we will discuss the different layers of a dbt data model, best practices of writing data models, and how to properly test them using dbt.

What is dbt Data Modeling?

dbt data modeling is the process of converting raw data into a structured format using dbt’s SQL-based tool. This entails developing clear, structured, and well-documented datasets that are easy to comprehend and analyze. dbt simplifies the transformation steps, thus allowing for the creation of accurate, reliable data models that can be easily tested and versioned. In this way, data teams can use dbt to create efficient and scalable data models that enable the delivery of timely and accurate insights.

How to organize source data

Every data model starts with source data, or data that is ingested directly from an external source into your data warehouse. Rather than use these raw data sources directly within your models, dbt helps you to create an extra layer between sources and models. This way a copy of your raw data always exists untouched within your data warehouse. However, before you can use a source in your models, you need to properly define it.

Defining a source

Sources in dbt are defined in YAML files sitting at the root of a source’s directory. It is a best practice to create a directory for each source within a “staging” directory that sits in the “models” directory. This way the documentation lives next to the corresponding staging models for that source.

Within a YAML file, you define your source using its database name and schema name, and then assigning it a name to use downstream. This is what a source would look like in a source.yml file:

version: 2

sources:
 - name: google_sheets
   database: raw 
   schema*: madison_google_sheets 
   tables:
     - name: customer_orders
     - name: marketing_campaigns

Every raw data source that you want to use in a dbt model needs to be referenced in a YAML file like so. If it’s not defined, dbt won’t understand where to look when you reference that database and schema downstream.

Referencing a source

When writing your staging models in dbt, every staging model will reference a source like the one we defined above. Keep in mind that sources should only be referenced in staging models.

In order to select from one, you would call it like so:

{{ source(‘google_sheets’, ‘customer_orders’) }}

A source reference requires two pieces of information- a source name and a table name. If you look at the YAML file we defined, you can see that “google_sheets” refers to the source name, and “customer_orders” refers to the table name.

Documenting a source

Lastly, it’s important that you document all of the sources that you are using in your dbt models. This is also done in the YAML files where you define your sources, like in the one we wrote above. You can add a description to not only the source name but each table name and column name as well. I recommend doing this from the very beginning so that you don’t have to worry about coming back to it later, or even worse, forgetting about it entirely.

Thorough documentation will help other people on your team understand the code you are writing. It will allow them to contribute to the project as well while keeping questions and knowledge transfer minimal. In my experience, documentation is always worth the extra time upfront for the clarity it provides later down the line.

You can add documentation to your sources like so:

version: 2

sources:
 - name: google_sheets
   database: raw 
   schema*: madison_google_sheets 
   description: This is a replica of the Postgres database used by our app
   tables:
     - name: customer_orders
       description: >
         Inventory of orders placed by customers. 
       columns:
         - name: id
           description: Primary key of the customer orders table
         - name: customer_id
           description: The unique identifier of a customer
         - name: order_id
           description: The unique identifier of an order
         - name: product_quantity
           description: The number of products in the  order

It’s important to document any quirks in your data in the table descriptions as well as the column descriptions. I like to think about the questions people will have ahead of time and write my documentation based on that.

Creating staging models

dbt uses something called staging models to keep the integrity of your raw data sources. Staging models read from these raw data sources that we defined and are then referenced in downstream data models. These staging models exist not only to protect your raw data but to help standardize it. It is in the staging layer that you standardize your data across various data sources, putting in place company standards.

For example, you may cast all of your date columns to be a “date” datatype and end in “_date” so that the person querying the data knows what to expect from that column.

Different functions like datatype casting and column renaming are most common here. Your data should be almost identical to the source data you are referencing. This means you don’t want to use any aggregates or window functions in these models. It should be as simple as possible while still implementing company standards.

A staging model typically looks like this:

SELECT 
	Id AS customer_order_id, 
	Customer_id::varchar AS customer_id, 
	Order_id::varchar AS order_id, 
	Product_quantity
FROM {{ source(‘google_sheets’, ‘customer_orders’) }}

Note that it is reading from the source that we defined and only includes basic SQL functions that keep the integrity of the data.

👋 Say Goodbye to Data Silos. Use Airbyte for Effortless Data Integration.

Talk to our experts

dbt data modeling best practices

Once you’ve defined your sources and have written staging models for each of them, you are ready to build out your data models further. The next step in building a dbt model is to focus on the core model, or the model that produces a dataset that will be directly queried by data analysts.

These models contain more advanced SQL transformations like joins, aggregations, and window functions. Unlike in the staging layer, you can pretty much do whatever you want to produce the results that you need. However, there are some best practices you should be following in order to take advantage of the full power of dbt.

Models should always read from staging models.

We mentioned this in the first section but it is a point worth repeating. All of your models that aren’t staging models should be reading data from staging models. They should never reference sources or be read in using non-dbt syntax. This will help maintain the integrity of your raw data and allow the relationship between different models in dbt to be tracked through a DAG.

If your model isn’t using a {{ source() }} or {{ ref() }} function then you are doing something wrong!

Models should always be written modularly.

The reason dbt is so great is that it allows you to write modular code. Instead of writing a model once and being done with it, you can refactor code you’ve already written to use in other data models. Any pieces of code that you repeat across multiple data models should ideally be its own model. This way, it can be ran once and referenced in multiple locations. This then saves computing resources in the form of time and money!

Intermediate models are the models that contain repeatable code. These are never directly queried by analysts or business users but rather help in producing modular core data models.

Models used by data analysts should be stored in a core or mart directory within the models directory.

All models in dbt are organized by source or by business purpose. All models that aren’t your staging models are called core (or mart) models. You should create a folder for these models under the “models” directory and create a subdirectory under here for each business function. This will help you organize your models by the purpose they serve within the business.

Remember that core models are the end models that are directly used by data analysts and business users. They are the datasets referenced within reports and dashboards whereas staging models and intermediate models are more so for the analytics engineer to use.

SQL best practices

dbt utilizes SQL to write its data models. It’s imperative that you follow SQL best practices when writing dbt models. This will keep your data models clean and concise, making them easy to follow and reuse.

Use CTEs or joins instead of subqueries

It is a dbt best practice to use CTEs or joins instead of subqueries. CTEs and joins make code more readable and allow you to easily follow the flow of data that is being produced. Subqueries can often be messy and hard to understand. dbt SQL best practices aim to simplify code and make it as easy to read as possible.

Replace this subquery….

SELECT 
	customer_orders.Customer_order_id, 
	customer_orders.Customer_id, 
	customer_orders.Order_id 
FROM {{ source(‘google_sheets’, ‘customer_orders’) }} customer_orders
WHERE referrer_campaign_id IN (SELECT marketing_campaign_id FROM {{ source(‘google_sheets’, ‘marketing_campaigns’) }} WHERE platform=’Facebook’)

With something like this…

SELECT 
customer_orders.Customer_order_id, 
	customer_orders.Customer_id, 
	customer_orders.Order_id 
FROM {{ source(‘google_sheets’, ‘customer_orders’) }} customer_orders
INNER JOIN {{ source(‘google_sheets’, ‘marketing_campaigns’) }}  marketing_campaigns
ON customer_orders.referrer_campaign_id = marketing_campaigns.marketing_campaign_id

Isn’t the join easier to read? Using SQL functions that prioritize readability is key for writing high-quality dbt data models.

Use descriptive names for CTEs

Continuing with the idea of CTEs, it’s vital that you give them descriptive names. It makes it easier to follow when you know the exact purpose behind a CTE. I like to use verbs to describe mine. For example, if two tables are being joined in a CTE, I name the CTE <table_1>_joined_<table_2>. This gives the reader an idea of why that CTE was written in the first place.

This is especially important for more complex transformations where the purpose may not be entirely obvious. Descriptive names along with in-line comments will help communicate the purpose of that piece of code without requiring whoever is reading it to come right to you.

Use full name aliases rather than abbreviations

When joining two tables, it’s important to make it clear which columns are from which tables. The best way to do this is by spelling out the full table names, or at least a name similar to the original table name when using them in joins. Is there anything more frustrating than reading through code where someone has assigned tables aliases of random letters? Often this requires more brain work to map back the aliases to the original tables than to write out the full name.

Notice how I did this in a previous example. Instead of using aliases, I spell out the full name of the table.

SELECT 
customer_orders.Customer_order_id, 
	customer_orders.Customer_id, 
	customer_orders.Order_id 
FROM {{ source(‘google_sheets’, ‘customer_orders’) }} customer_orders
INNER JOIN {{ source(‘google_sheets’, ‘marketing_campaigns’) }}  marketing_campaigns
ON customer_orders.referrer_campaign_id = marketing_campaigns.marketing_campaign_id

Which would you prefer to read? That or this?

SELECT 
a.Customer_order_id, 
	a.Customer_id, 
	a.Order_id 
FROM {{ source(‘google_sheets’, ‘customer_orders’) }} a
INNER JOIN {{ source(‘google_sheets’, ‘marketing_campaigns’) }}  b
ON a.referrer_campaign_id = b.marketing_campaign_id

The first option is much easier to read and eliminates any mistakes that can be made in the interpretation of the code!

Properly testing your sources and data models

One of the benefits of writing your models in dbt is the built-in testing that comes along with it. dbt makes it easy to test your sources and models within your documentation itself. It includes built-in generic tests, which are more basic and test for things like null and unique values, as well as the ability to write custom tests.

Testing your sources

When using dbt to test, it’s important that you always write tests for your sources first. This way you are testing your data to ensure it meets your expectations before it is used in any downstream data models. By testing at the source, you help eliminate steps in the debugging process in the case of something going wrong.

Think about it- if you only test your models, you may assume the logic in the model is making them fail. However, if bad data is being used in the model, the issue can be traced back much further. I’ve had many moments of frustration due to debugging on the model level when really the issue was with the source data.

With source data, you typically want to use dbt generic tests. These include:

Unique
Not_null
Accepted_values
Relationships

I always add the first two generic tests on all of my source columns so I can be alerted of unexpected empty columns or duplicates. If we revisit our source documentation, we can add the tests under the columns that we’ve already documented, like so:

version: 2

sources:
 - name: google_sheets
   database: raw 
   schema*: madison_google_sheets 
   description: This is a replica of the Postgres database used by our app
   tables:
     - name: customer_orders
       description: >
         Inventory of orders placed by customers. 
       columns:
         - name: id
           description: Primary key of the customer orders table
	     Tests:
        unique
        not_null
         - name: customer_id
           description: The unique identifier of a customer
	     Tests:
        not_null
         - name: order_id
           description: The unique identifier of an order
	     Tests:
        unique
        not_null
         - name: product_quantity
           description: The number of products in the  order

Make sure you run the right queries on your data to ensure your expectations about the data are correct. You don’t want to add these tests for the sake of adding them, they need to be accurate! Otherwise, you will get a ton of inaccurate alerts that will eventually lead to alert fatigue.

Testing your model

Similarly to your sources, it is imperative to test your models within dbt. Adding tests to your models will help you gain a better understanding of when your transformations are failing due to the code itself. Failures could be due to incorrect logic or changes in upstream data models. The important thing is that you catch them for each individual model so that you don’t need to debug an entire DAG.

While I add the not_null and unique tests to my models, I tend to also utilize the accepted_values and relationship generic tests at this layer. These tests help to make sure nothing unexpected slipped through the cracks in your logic.

The accepted_values test allows you to assert a list of values that are expected in a column. If dbt sees a value not specified in this test, the test will fail.

We can add this test under an individual column like so:

- 		name: order_type_id
		Description: The unique id of the order type
		Tests:
Accepted_values : 
Values: [‘1’, ‘2’, ‘3’]

Now, if there is a value other than 1, 2, or 3 for order_type_id, the test will fail and dbt will throw an error.

The relationships test allows you to set relationships between different models as well as a model and a source. This is great for ensuring your joins are working as expected, and not creating any duplicates. It essentially checks to make sure all of the values in the column you are testing also exist in the column in the model or source you are relating it to.

Like the others, you add this under a column and then specify the model or source the column is related to.

         - name: order_id
           description: The unique identifier of an order
	     Tests:
        unique
        Not_null
 Relationships:
To: ref(‘orders’)
Field: order_id

Here, dbt will check to ensure every value in the order_id column in this model is also found in the order_id column in the orders model. If it finds a value in this column that is not in the orders table, the test will fail.

Example of dbt Data Model

Suppose you have raw data about customer orders stored in a table named raw.orders. You would like to take this raw data and clean it so that it removes any orders that are not valid and enhances the data for analysis.

Here’s a dbt model SQL file:


-- models/orders.sql

with raw_data as (
    select 
        * 
    from 
        {{ source('raw', 'orders') }}
),

renamed_data as (
    select
        id as order_id,
        userid as customer_id,
        orderdate as order_date,
        status
    from
        raw_data
),

final_orders as (
    select
        order_id,
        customer_id,
        order_date,
        case
            when status = 1 then 'Pending'
            when status = 2 then 'Processing'
            when status = 3 then 'Shipped'
            else 'Unknown'
        end as order_status
    from
        renamed_data
)

select 
    * 
from 
    final_orders

1. Source Data Extraction

The raw_data CTE (Common Table Expression) selects all columns from the raw. orders table using the {{ source(‘raw’, ‘orders’) }} function available in dbt. This function points to the raw data source defined in your dbt project.

2. Renaming Columns

The renamed_data CTE renames the columns to more meaningful names. For instance, id is changed to order_id and userid is changed to customer_id. This step makes the data more readable and easier to work with.

3. Data Transformation

The final_orders CTE contains a CASE statement that translates numeric status codes to text descriptions. For example, status = 1 is associated with ‘Pending’, status = 2 with ‘Processing’, and status = 3 with ‘Shipped’. This transformation helps to make the status of each order more comprehensible and significant.

4. Final Selection

The last SELECT * FROM final_orders statement sums up all the transformations and selections made in the previous steps and results in a clean and enriched dataset suitable for analysis.

This example shows how dbt models can be used to clean raw data and structure it in a way that is easier to analyze. With SQL, dbt enables data analysts and engineers to write concise and sustainable transformation code to make the data accurate and meaningful.

Conclusion

The power of using dbt to model your data lies in the ability to write modular code and have documentation as well as testing live next to the models. dbt allows you to define sources, write intermediate and core models using those sources, and then thoroughly document and test each piece. When building dbt data models, it is key that you follow best practices in not only how you build your project, but how you write your SQL code.

Following the best practices outlined in this article will help you minimize any future tech debt, allowing your models to exist for years to come. They will also ensure your models are easily understood by others on your team and last beyond the person who built them. dbt makes collaboration between team members easy due to good habits like documentation and commenting in your code.

Remember to keep all of these tips in mind as you build out your first dbt model! Take it one step at a time, doing all of your documentation and testing as you go.

FAQs

1. Is dbt an ETL tool?

No, dbt is not an ETL tool. It is centered only on the transformation (“T”) step once the data has been extracted and loaded into the data warehouse.

2. What is the main use of the dbt data model?

The primary purpose of the dbt data model is to take the raw data and prepare it for analysis by cleaning, structuring, and enriching it for business intelligence.

3. Does dbt use SQL?

Yes, dbt does use SQL to write data transformation code, and it relies on SQL scripts to define how data should be transformed and structured in the data warehouse.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

The data movement infrastructure for the modern data teams.

Try a 14-day free trial

About the Author

Madison Schott is an Analytics Engineer, author of The ABCs of Analytics Engineering book. Madison blogs about the modern data stack on Medium and writes a newsletter on Substack.