Data Insights
Article

Best practices for data modeling with SQL and dbt

Madison Schott
August 23, 2022
8 min read
Limitless data movement with free Alpha and Beta connectors
Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program ->

Use the correct join and minimize duplicates at the source

This was a tip I learned from one of my previous managers. He noticed that I was using select distinct at the end of my CTEs in order to eliminate duplicate values. He told me that if you have to use that, there’s probably something wrong in your code. Of course, there are situations when distinct is necessary, but when joining two tables, it’s most likely that you used a wrong join.

Using the correct join in different scenarios is key to making your code run faster. When duplicates aren’t created, your code is computing less values. The join is more efficient! Distinct also slows down compute power because of the logic that needs to be applied under the hood. Let’s review the different joins so you can know which one to use in various situations.

Left join: This is the most common join to use because it follows best practices in most cases. When you use a left join you are selecting all of the values from the first table and only the values that match what’s in the first, from the second table. 

Right join: I don’t recommend using a right join. It is always best practice to use a left join instead. They essentially work the same except now you are selecting all of the values from the second table and only the matching values from the first. Reconfigure your code to use a left join instead. 

Inner join: Inner joins only select values that are found in both the first and second tables. You can think of them as the middle overlapping section of a venn diagram. 

Full outer join: This type of join returns everything from both tables. Chances are this will result in a lot of null values in the case that each table’s values are not found in the other table. I rarely see this used. 

Removing duplicates with SQL

Before using any of these joins, it’s important to consider duplicates already present in each table. If there are duplicates already in the tables you are joining, you can expect many more duplicates. Now, this could be what you want, but if it’s not, be sure to get rid of the duplicates before joining the two tables. Depending on your data, you can remove duplicates a few different ways. This could involve using a window function to rank rows or filtering out rows with certain values. 

I’ve personally used both, depending on the data that I’m dealing with. I typically use a window function when I want the most recent value for a certain primary key. But, if I can eliminate a duplicate using a basic filter by some column value, that is usually preferred. 

For example, if I am creating a users data model in SQL and want a column for a user’s subscription_id, it’s possible that a user has multiple subscriptions, ones that have been canceled and ones that are active. First, I would want to filter out all of the canceled subscriptions, then I would find the most recent subscription by sorting them by date and partitioning them by user_id. The code would look something like this:


SELECT 
	User_id,
	Subscription_id 
	ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY date_created DESC) AS subscription_number 
FROM user_subscriptions

Then, I would simply filter the query by subscription_number=1 when using it in the next query. This is a great way to rank value so that you get the most recent one for the primary key of your data model. 

Streamline Your MySQL Data Pipelines with Airbyte
Talk to our experts

Use CTEs instead of subqueries

It’s a common misconception that subqueries are superior over CTEs, especially in technical interviews. Interviewees typically think subqueries show deeper SQL knowledge and that the complexity is preferred. However, this is far from the truth. When you work on a team, you always want to ensure your code is as simple and readable as possible. CTEs make code easier to read and break down into smaller steps. Subqueries add more of a headache when it comes to debugging, revising code, or reviewing a team member’s data models. 

CTE stands for common table expression. It is when you string together a list of multiple queries, each reading from the previous query, in order to result in a data set. A CTE looks like this:


WITH fb_spend_unioned_google_spend AS (
  SELECT spend_date, spend, 'facebook' AS ad_platform from {{ ref('stg_facebook_ads') }}
  UNION ALL
  SELECT spend_date, spend, 'google' AS ad_platform FROM {{ ref('stg_google_ads') }}
)

spend_summed AS (
  SELECT
    month(spend_date) AS spend_month,
    year(spend_date) AS spend_year,
    spend_date,
    ad_platform 
    sum(spend) AS spend
  FROM fb_spend_unioned_google_spend 
  WHERE spend != 0
  GROUP BY
    month(spend_date,
    year(spend_date),
    spend_date
)

SELECT * FROM spend_summed

When you write a subquery, your code is still executed as one query, but reads from select statements rather than other table names. A subquery looks like this:


SELECT
    month(spend_date) AS spend_month,
    year(spend_date) AS spend_year,
    spend_date,
    ad_platform 
    sum(spend) AS spend
  FROM (
    SELECT spend_date, spend, 'facebook' AS ad_platform from {{ ref('stg_facebook_ads') }}
    UNION ALL
    SELECT spend_date, spend, 'google' AS ad_platform from {{ ref('stg_google_ads') }}
  )
  WHERE spend != 0
  GROUP BY
    month(spend_date,
    year(spend_date),
    spend_date

Which is easier to read? The first probably takes a short glance to know what it’s doing. The second makes you stretch your brain a bit to complete the puzzle. Data models written in SQL are meant to be simple and readable. Complex code is not superior, but easy-to-understand code is. Why make something more complex just for the sake of it?

dbt helps to create modular code and actually makes the use of CTEs easier. When you are writing your code within the same SQL file, it can be easy to want to write it in as few queries as possible. However, since dbt allows you to split one model’s code into multiple SQL files, you can follow best practices by writing CTEs and still have modular code. The separate files allows you to distinguish the important pieces of a model while still using CTEs and not subqueries. 

Create dbt macros for repeatable SQL logic

Lastly, dbt’s macros make it easy to write repeatable, complex SQL logic without cluttering up your data models. dbt macros act as “functions”, using SQL and Jinja to write pieces of code that can then be applied in multiple data models. I’ve used macros to specify schema destinations, clean column values, and drop old tables. They make it so you don’t need to do these complex operations in every data model. Instead, you just call the macro. 

For example, I recently wrote a macro to reformat the naming of my columns, especially those with capital cases and weird characters. The code is stored in its own file within the macros folder in my dbt project. The code looks like this:


{% macro slugify(column_name) %}

  REGEXP_REPLACE(REGEXP_REPLACE(LOWER({{ column_name }}), '[ -]+', '_'), '[^a-z0-9_]+', '') AS {{ column_name }}

{% endmacro %}

Whenever I want to use it in a data model I call the macro and pass in the correct parameters. Here’s an example of how I call it to format the question and label columns in my typeform model.


SELECT
    response_id,
    form_name,
    form_id,
    {{ slugify('question') }},
    question_response,
    {{ slugify('label') }}
  FROM form_questions

dbt macros my SQL data models a lot cleaner and allows me to effectively use code I wrote once multiple times in different models. 

Conclusion

When your SQL data models are written using best practices, you don’t have to worry about models building technical debt. Incorrectly written SQL data models will have to be written over and over again, only to experience the same problems. When you write clean, concise SQL code in combination with dbt, you are guaranteed to have fast, dependable, easy-to-read data models. To review, make sure you keep the following things in mind when writing your data models with:

  • Always create base models to reference your raw data tables.
  • Use the correct join and minimize duplicates at the source. 
  • Choose CTEs over subqueries. 
  • When you have repeatable code that can act as a function, turn them into a dbt macro. 

Remember to follow these practices from the very beginning. “Quick and scrappy” data models often end in way more work down the line. What is quick and easy now will costs you a ton of money and resources later. Take your time to build your data models right and your business will be able to successfully scale.

The data movement infrastructure for the modern data teams.
Try a 14-day free trial