Best Practices for your dbt Style Guide
dbt is the core of analytics engineering and the tool behind the “T” in ELT. It is a powerful way to transform your raw data into usable data models. It is the tool I, and most, analytics engineers use to transform raw data into models called base models which are then used in every other data model built.
Base models are models that reference the raw data you ingest directly into your data warehouse. These are the only models that should read from the raw data source. Once your base models are written, other models then reference these. These models are relatively simple and only use casting and column name changes. You never want to include advanced SQL code like aggregate or window functions. You also don’t want to filter any data here.
Base models are now more frequently referred to as staging models in dbt’s documentation. They changed the naming because they didn’t want to be limited by the name “base”, meaning the first step in building a data model.
If you look at their example DAG, you can see they have both base and staging models. Base models act as the first layer when there are also staging models involved. dbt views staging as the model before the final product and base models as an extra “helper” stage when needed.
Personally, I like using the term “base” because it is more straightforward. “Staging” models can be easily confused with “intermediate” models, another term used in dbt organization. But, just know, if you’re reading dbt’s updated documentation, these same base models are referred to as staging models.
Let’s go into further detail on why base models are so important.
The importance of base models
It is considered a dbt best practice to have a copy of your raw data in your data warehouse at all times. You never know what can go wrong and always want to prepare for the worst. Having your raw data always available to you allows you to pinpoint data discrepancies faster and always rerun changed data models without having to ingest all of your data over again. This is why ELT tools like Airbyte that load all data to the destination before applying any transformation are becoming the norm.
Ingesting all of your historical data is costly and you only want to do it once, when you first set up your data ingestion tool. Having your raw data untouched allows you the flexibility that you wouldn’t have otherwise.
Base models make it so that you never have to reference your raw data sources within your data models. They are essentially a cleaner version of your raw data. While there are not many changes in the data, they make the lives of analysts easier.
Creating a dbt style guide
A dbt style guide should include naming conventions for your models and define the structure of your dbt project. Raw data from different sources often has different naming conventions or repeat column names. Determining a style guide for your data is key. Your data team needs to agree on the best practices to organize your dbt projects to make collaboration easier. You want your data to be uniform across all data sources and you do this by following your team's best practices for defining your base models in dbt.
Here are a few things to consider:
- Column naming style (snake case or camel case?)
- Column naming convention (<table name>_<column name>)
- Date types (timestampntz or date?)
- Decimal places
The most important thing here is uniformity. With whatever styling you choose, you need to make sure it is consistent across all base models, no matter the data source. Analysts and analytics engineers want to be able to code using these data sources without constantly checking the column names in the database tables. The styles you choose should be consistent and intuitive so that they pick it up in no time.
I’ll share with you the best practices I use in my base models when we get to the examples of building them out using dbt.
Ingesting raw data
Before we set up our dbt source files, we need to make sure we have raw data in our data warehouse to reference. I’ll be using Airbyte to ingest data from Mailchimp into Snowflake. You can read more about setting up sources and destinations in Airbyte’s documentation. I also wrote a detailed article about how to ingest Mailchimp data.
Setting up dbt source files
dbt uses YAML files to organize projects. The inner workings of these files are what make dbt special. They include important information like database credentials, table descriptions, model descriptions, and schema details. YAML files contain all the behind-the-scene connections while SQL files contain all the transformation code.
Source files are just one of the YAML files utilized by dbt to help ensure everything runs smoothly. They essentially give dbt access to your raw data. They include the database and schema of the data source, tables, and model descriptions of those made using this raw data.
The first thing you need to do is set up YAML files for your sources. Depending on how many data sources you have, you can set up one or multiple files. Personally, I like to set up one per data source or even multiple per data source depending on how many tables that source contains. This makes it so that the files stay readable and organized. It can be overwhelming to have 50+ sources in one .yml file.
Let’s look at the tables we loaded in from Mailchimp.
You can see there are mainly campaign and list related tables. Instead of having these all in one .yml file, I consider a good practice to create two different folders under my Mailchimp folder, one for campaigns and the other for lists. These folders would then have their own .yml files that declare the data sources and their related tables.
Here’s what the file structure would look like:
Now let’s look at what one of these individual yml files looks like. We will be writing the source file for the campaign-related data in Mailchimp. I would call this file src_mailchimp_campaigns.yml. I always follow this naming structure for my source files:
Choose whatever makes sense for you. But, no matter what you decide on, make sure it is consistent across all of your source files. I prefix all of my source files with src_ to help identify the source files from my models within this same folder. It is easier on the eyes when you have a bunch of base models under one folder.
Within each source file you need to start by listing your sources. Under the “source” section you can list the name, description, and tables that make up your source. I like to only have one of these per file, but, again, this depends on how you’re organizing your sources. If you have one file for all of your sources then you will have multiple listed here.
Here in my src_mailchimp_campaigns.yml file I listed mailchimp_campaigns as my source name, added a description saying this was for all Mailchimp tables related to campaigns, and listed the tables that fall under this grouping.
It can be easy to skip adding descriptions to these tables, but I highly recommend it. As an analytics engineer, your data models are only as good as the documentation describing them. If people don’t understand their purpose or how they work, they aren’t going to be useful. More information is always better than less. Assume nobody knows what you are talking about and make it so that they get all their questions asked just by reading your documentation.
Here are some descriptions I included for the first few tables in this src_mailchimp_campaigns.yml file:
Connecting base models to source files
Connecting your base models to their corresponding source files is pretty simple if you've followed all my advice thus far. You just need to follow some simple syntax. However, if you haven’t been organized about following proper naming conventions, you could run into some trouble.
Base models are the models that reference these source files that you created. You do so by selecting from them using the following syntax:
Source name is the name at the very top of your source file listed under “sources”.
The table name is one of the name blocks you defined under the “tables” section in this file.
These data sources and table names can get confusing to remember if you aren’t following proper naming conventions. That’s why I always recommend creating your own style guide to follow. It’s always nice to have a document to reference and hold yourself accountable to the standards you’ve set.
Another thing to note, if you have sources or tables with the same names, dbt will throw an error when you try to run your model. For example, you can’t have two different campaigns tables under two different sources. You know they are different, but dbt does not. This is why I recommend prefixing your source and table names with the data source itself.
In terms of the content in a base model, like I mentioned earlier, you only want to include basic transformations. This is where you can cast dates or timestamps, rename columns, and round any decimals. Base models are not the place for aggregate functions and any type of filtering.
Here’s what my Mailchimp campaigns social card base model looks like:
Notice how I rename the columns to follow guidelines that I outlined in my style guide. I included a “social_card” prefix here in case I ever join these columns to another table. While it is rare this table will be joined, since it lacks a primary key, it is a good habit to practice in your models for reserved keywords like “title” and “name”.
Defining your models in your source files
After writing your base models, you want to revisit your source files. In addition to the “sources” block within this file, you can add a “models” block. This is a best practice to keep track of which models reference the sources you have defined.
In the corresponding source file, you will want to list the names of the base models that reference the source. You follow the same syntax you did for sources and tables. It should look something like this:
Again, I recommend adding as many details as possible. Always include a description so whoever is looking at your data models knows exactly why that model exists. This is also important in case you ever deploy your documentation.
While I don’t always do this myself, it is also a great habit to add the names of the base model columns here. This is a great thing to do once all of your models have been fully built out and you have some extra time on your hands. It can definitely be a hard thing to prioritize when you have lots of impact to make within your organization. But, the further you move from the start of this process, the more important this documentation will become.
Improving your data pipelines with a style guide
Utilizing source files helps keep the integrity of your raw data. By always selecting from the base models instead of the sources, you ensure your raw data is never touched and always remains the same as when it was ingested. Having a copy of your raw data sources at all times will help prevent any data losses in the case that something goes wrong.
In addition, base models help keep consistency across all data models. Raw data sources all ingest data using different standards and naming conventions. However, by utilizing base models you are the one to set the style of your data models. You put in place the correct column names and data types before building your more complex models.
dbt also has a feature called dbt test which allows you to test your base models to ensure they are following the style guide you have put in place. We will get into these more in another article.
Organization is key when it comes to writing your data models using dbt. And where does organization come from? Your style guide that applies the best practices to organize your dbt projects. It is imperative that you create standards for your dbt environment before starting any coding. Naming conventions and data type casting shouldn’t be left on the back burner as something you think about briefly when writing your models.
The ease of use of your models all comes down to how well everything is documented and how readily available it is to everyone on your team. Make sure your style guide is available to all members of your team to reference whenever they need it. Personally, I push my dbt style guide to a shared Github repo along with all of our SQL and source files. This way it is always there for whoever wants to read it.
dbt is an essential part of your ELT process when you combine with data ingestion tool like Airbyte. It has amazing documentation features, upholds the integrity of your data, and makes data models easy to debug when something goes wrong. Take full advantage of the power it has to offer your team. Set it up right from the beginning and your analytics workflow will be smooth sailing.