Best practices for data modeling with SQL and dbt

Imagine being tasked with rewriting a bunch of core SQL data models only to find that each model was taking over 24 hours to update, had no comments in the code, used incorrect joins, contained duplicated data, and wasn’t modular, making it impossible to debug. This is every analytics engineer’s worse nightmare (although also quite a fun puzzle). And it was the problem I was tasked to solve when I first became an analytics engineer. 

Data modeling is the process of organizing your SQL code to make data in your databases and warehouses usable. You shouldn’t have to spend hours staring at SQL code, hoping it will all make sense. Luckily, dbt is a data transformation tool that helps with written better SQL data models. It allows you to write modular code, ensuring SQL code isn’t unnecessarily repeated in multiple models. This in turn helps your models run faster and allows you to debug parts of code at a time, rather than the whole model. dbt also offers built-in functionalities like sources, macros, and packages that make it easy to follow best practices and automate code. 

Although dbt makes it way easier to write efficient data models, you still need to write good SQL code. Writing clear, concise SQL is a foundational skill of any analytics engineer. No tool will take that away. In this article, I will address the top SQL mistakes to avoid and how to use dbt to improve your data model functionality. 

Pairing well-written SQL code with the use of dbt for data modeling in SQL will set you up for success. Your data models are guaranteed to be fast, efficient, readable, and easy to debug. Here are some things to keep in mind.

Always create a base model to reference your raw data

This is a best practice I first learned about in dbt’s documentation. dbt describes the use of three different types of SQL data models to help keep your environment clean: base, intermediate (or staging), and core data models. Base models are a view that sits on top of your raw data table. They directly reference the raw data table and include basic SQL functions such as casting dates and renaming columns. 

Using base models prevents you from ever directly touching or manipulating your raw data. In the case that something goes wrong, you always want to have a copy of your raw data to restore your broken data models from.

Base models are set up using a {{ ref() }} function within dbt. This {{ ref() }} function is only used when you are referencing an already existing data table. In my case, Airbyte ingests data into the RAW database within my data warehouse. I then use a {{ ref() }} function in my base models to read from these ingested tables. Base models are the only models that should touch these raw data sources, and therefore the only ones using the ref function.

Keep in mind that your source and table name need to be defined in a src.yml file within your dbt project. These yml files are your direct connection to the data in your warehouse. The source name should match the source as defined in the corresponding yml file and your table name should match the exact name of the table in your raw database.

If you look here, we can see the source is facebook and the table name is basic_ad. In order to properly reference this raw data table, our ref function would look like this:

Now, let’s see what this would look like when we combine all of our base model code into the appropriate QSL file.


SELECT
  ad_id AS facebook_ad_id,
  account_id,
  ad_name AS ad_name_1,
  adset_name,
  month(date) AS month_created_at,
  date::timestamp_ntz AS created_at,
  spend
FROM {{ source('facebook', 'basic_ad')}}

Remember you are only selecting the columns you want to include in your base model, along with any casting or renaming. Be sure to explicitly write the column names rather than doing a select * in order to avoid issues in schema changes. I had a few data models where I used select * and they led to eventual errors that needed to be fixed.

To learn more about the do’s and don’ts of base models, check out best practices for your dbt style guide

Use the correct join and minimize duplicates at the source

This was a tip I learned from one of my previous managers. He noticed that I was using select distinct at the end of my CTEs in order to eliminate duplicate values. He told me that if you have to use that, there’s probably something wrong in your code. Of course, there are situations when distinct is necessary, but when joining two tables, it’s most likely that you used a wrong join.

Using the correct join in different scenarios is key to making your code run faster. When duplicates aren’t created, your code is computing less values. The join is more efficient! Distinct also slows down compute power because of the logic that needs to be applied under the hood. Let’s review the different joins so you can know which one to use in various situations.

Left join: This is the most common join to use because it follows best practices in most cases. When you use a left join you are selecting all of the values from the first table and only the values that match what’s in the first, from the second table. 

Right join: I don’t recommend using a right join. It is always best practice to use a left join instead. They essentially work the same except now you are selecting all of the values from the second table and only the matching values from the first. Reconfigure your code to use a left join instead. 

Inner join: Inner joins only select values that are found in both the first and second tables. You can think of them as the middle overlapping section of a venn diagram. 

Full outer join: This type of join returns everything from both tables. Chances are this will result in a lot of null values in the case that each table’s values are not found in the other table. I rarely see this used. 

Removing duplicates with SQL

Before using any of these joins, it’s important to consider duplicates already present in each table. If there are duplicates already in the tables you are joining, you can expect many more duplicates. Now, this could be what you want, but if it’s not, be sure to get rid of the duplicates before joining the two tables. Depending on your data, you can remove duplicates a few different ways. This could involve using a window function to rank rows or filtering out rows with certain values. 

I’ve personally used both, depending on the data that I’m dealing with. I typically use a window function when I want the most recent value for a certain primary key. But, if I can eliminate a duplicate using a basic filter by some column value, that is usually preferred. 

For example, if I am creating a users data model in SQL and want a column for a user’s subscription_id, it’s possible that a user has multiple subscriptions, ones that have been canceled and ones that are active. First, I would want to filter out all of the canceled subscriptions, then I would find the most recent subscription by sorting them by date and partitioning them by user_id. The code would look something like this:


SELECT 
	User_id,
	Subscription_id 
	ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY date_created DESC) AS subscription_number 
FROM user_subscriptions

Then, I would simply filter the query by subscription_number=1 when using it in the next query. This is a great way to rank value so that you get the most recent one for the primary key of your data model. 

Use CTEs instead of subqueries

It’s a common misconception that subqueries are superior over CTEs, especially in technical interviews. Interviewees typically think subqueries show deeper SQL knowledge and that the complexity is preferred. However, this is far from the truth. When you work on a team, you always want to ensure your code is as simple and readable as possible. CTEs make code easier to read and break down into smaller steps. Subqueries add more of a headache when it comes to debugging, revising code, or reviewing a team member’s data models. 

CTE stands for common table expression. It is when you string together a list of multiple queries, each reading from the previous query, in order to result in a data set. A CTE looks like this:


WITH fb_spend_unioned_google_spend AS (
  SELECT spend_date, spend, 'facebook' AS ad_platform from {{ ref('stg_facebook_ads') }}
  UNION ALL
  SELECT spend_date, spend, 'google' AS ad_platform FROM {{ ref('stg_google_ads') }}
)

spend_summed AS (
  SELECT
    month(spend_date) AS spend_month,
    year(spend_date) AS spend_year,
    spend_date,
    ad_platform 
    sum(spend) AS spend
  FROM fb_spend_unioned_google_spend 
  WHERE spend != 0
  GROUP BY
    month(spend_date,
    year(spend_date),
    spend_date
)

SELECT * FROM spend_summed

When you write a subquery, your code is still executed as one query, but reads from select statements rather than other table names. A subquery looks like this:


SELECT
    month(spend_date) AS spend_month,
    year(spend_date) AS spend_year,
    spend_date,
    ad_platform 
    sum(spend) AS spend
  FROM (
    SELECT spend_date, spend, 'facebook' AS ad_platform from {{ ref('stg_facebook_ads') }}
    UNION ALL
    SELECT spend_date, spend, 'google' AS ad_platform from {{ ref('stg_google_ads') }}
  )
  WHERE spend != 0
  GROUP BY
    month(spend_date,
    year(spend_date),
    spend_date

Which is easier to read? The first probably takes a short glance to know what it’s doing. The second makes you stretch your brain a bit to complete the puzzle. Data models written in SQL are meant to be simple and readable. Complex code is not superior, but easy-to-understand code is. Why make something more complex just for the sake of it?

dbt helps to create modular code and actually makes the use of CTEs easier. When you are writing your code within the same SQL file, it can be easy to want to write it in as few queries as possible. However, since dbt allows you to split one model’s code into multiple SQL files, you can follow best practices by writing CTEs and still have modular code. The separate files allows you to distinguish the important pieces of a model while still using CTEs and not subqueries. 

Create dbt macros for repeatable SQL logic

Lastly, dbt’s macros make it easy to write repeatable, complex SQL logic without cluttering up your data models. dbt macros act as “functions”, using SQL and Jinja to write pieces of code that can then be applied in multiple data models. I’ve used macros to specify schema destinations, clean column values, and drop old tables. They make it so you don’t need to do these complex operations in every data model. Instead, you just call the macro. 

For example, I recently wrote a macro to reformat the naming of my columns, especially those with capital cases and weird characters. The code is stored in its own file within the macros folder in my dbt project. The code looks like this:


{% macro slugify(column_name) %}

  REGEXP_REPLACE(REGEXP_REPLACE(LOWER({{ column_name }}), '[ -]+', '_'), '[^a-z0-9_]+', '') AS {{ column_name }}

{% endmacro %}

Whenever I want to use it in a data model I call the macro and pass in the correct parameters. Here’s an example of how I call it to format the question and label columns in my typeform model.


SELECT
    response_id,
    form_name,
    form_id,
    {{ slugify('question') }},
    question_response,
    {{ slugify('label') }}
  FROM form_questions

dbt macros my SQL data models a lot cleaner and allows me to effectively use code I wrote once multiple times in different models. 

Conclusion

When your SQL data models are written using best practices, you don’t have to worry about models building technical debt. Incorrectly written SQL data models will have to be written over and over again, only to experience the same problems. When you write clean, concise SQL code in combination with dbt, you are guaranteed to have fast, dependable, easy-to-read data models. To review, make sure you keep the following things in mind when writing your data models with:

  • Always create base models to reference your raw data tables.
  • Use the correct join and minimize duplicates at the source. 
  • Choose CTEs over subqueries. 
  • When you have repeatable code that can act as a function, turn them into a dbt macro. 

Remember to follow these practices from the very beginning. “Quick and scrappy” data models often end in way more work down the line. What is quick and easy now will costs you a ton of money and resources later. Take your time to build your data models right and your business will be able to successfully scale.

Open-source data integration

Get all your ELT data pipelines running in minutes with Airbyte.