The Bytes

Snowflake Data Warehouse Architecture: How to Organize Databases, Schemas and Tables

Your data warehouse is the hub of your modern data stack. Data flows into it through data ingestion tools like Airbyte, making sure raw data is available. Data is transformed within it using SQL and modern data transformation tools like dbt. Data then flows out of it to business users and data visualization platforms. 

All data exists within your warehousing solution. 

It’s a powerful tool. And one that you need to make sure you do right. The integrity of your data depends on the analytics engineers, data engineers, and data analysts that set this solution. It is imperative that it is done correctly, considering different factors like development and production, security, and business use cases.  

In this article, I explain how I organize my Snowflake data warehouse so that as little as possible goes wrong. I cover deciding on your data warehouse solution, determining your databases, schemas and different types of tables.

Why choose Snowflake as your data warehouse? 

I’ll be specifically referencing Snowflake as my data warehouse solution throughout this article. Snowflake is what I believe to be the best data warehouse on the market. Snowflake is built specifically on the cloud, meaning you and your team can access it wherever, whenever. Snowflake is fast and allows you to easily scale up or down your compute power based on your budget. Snowflake also makes it easy to share data among many users.

Before we get into all of the decisions that you need to make about your data warehouse architecture, let's discuss the components of a Snowflake data warehouse. A Snowflake data warehouse architecture is composed of different databases, each serving its own purpose. Snowflake databases contain schemas to further categorize the data within each database. Lastly, the most granular level consists of tables and views. Snowflake tables and views contain the columns and rows of a typical database table that you are familiar with.

Organizing your Snowflake databases

There are three main components to your data warehouse. The location where you ingest your raw data. The location where data transformations occur (usually with dbt). And the location where you store reporting and experimentation.

Storing raw data

Let’s start by talking about the first location. Where will you be ingesting your raw data into Snowflake? I recommend creating a Snowflake database to ingest all of your raw data. This should always be the first location any piece of data lands. 

It is important to always have a copy of the rawest form of your data stored in case something goes wrong. Having a raw copy will allow you to re-run your data models in the case you find an error in one of them.

It is imperative that the only system that has full access to this is your ingestion tool. In my case, it is Airbyte, an open-source data ingestion tool that you can use to load data into Snowflake from any available source connector. Airbyte will dump all of the raw data to your ”RAW” database that you created specifically for ingestion. Nobody else, unless using another ingestion tool, should also be dumping data into this location. 

Storing transformed data

Now, there are a few types of transformations that happen within the data warehouse. First, and the simplest, are base models. These are basic transformations that are done on the raw data in order to make them ready to use in data models by analysts and analytics engineers. I wrote another article about creating a dbt style guide for these base models.

It is a best practice when designing your Snowflake architecture to avoid having your transformations read from your raw database. Your data models should always read from another Snowflake database that contains these base models. This is essentially a database similar to your “RAW” database but with basic transformations such as data type casting and field name changes. I call mine “BASE” for simplicity's sake. dbt commonly refers to this as “STAGING”.

“BASE” data models are also views rather than tables. This saves costs within your Snowflake warehouse because you aren’t storing a full copy of the underlying data. Rather you are creating a layer that lives on top of it. Because this data is always the same, no matter in development or production, there is no need to create separate environments. Views don’t need to be automated and deployed each day because they simply read from your raw data. 

As with your other transformations, these will be the more complex models you build using dbt. These require both a development and production environment. You don’t want one database to contain both development and production models, so it is best to create a different database for each. I call mine “DATA_MART_DEV” and “DATA_MART_PROD”.

Both of these databases read from the tables in the “BASE” database but one is for testing the creation of your data models and one is validated, orchestrated, and depended on by the business. Your “DATA_MART_PROD” tables should be created daily, or more often, by using an orchestration solution.

Keep in mind that tables created in Snowflake using dbt are transient tables. These are similar to permanent tables except they don’t have a full history available on Snowflake. This helps save on storage costs but is another reason why it’s important to always have a copy of all your raw data. 

Storing data for reporting and experimentation

Often analysts will want to write longer one-off queries for reporting and experimentation purposes. With writing queries comes a place to store those queries so that they can be accessed by business users and visualization tools. Because these are usually only written once and don’t need to be automated to run every day, you do not want to store these in development or production. We do most of our reporting within our BI tool, using the data models built in production, rather than within Snowflake. However, if you are automating your reporting in Snowflake, creating these reports in “DATA_MART_PROD” is probably a better idea. 

In my Snowflake environment, I created a database called “RDA” for this data. Here, analysts can create tables from their queries to then be used by others. You may be wondering, if these are one-off analyses, why do I need to store them in the first place? If these queries are quite complicated, you aren’t going to want to use them directly in your visualization tool. You are going to want to harness the speed and power of Snowflake to run them. Also, this way, you’ll be able to easily reference previous reporting code for future queries. 

Here you can see what the Snowflake architecture of my databases look like. Notice there is also “UTIL_DB”, “SNOWFLAKE”, and “DEMO_DB”. These are created by default with any Snowflake account. “SNOWFLAKE” is mainly used for internal Snowflake setup involving privileges, account usage, and account history. “UTIL_DB” and “DEMO_DB” can be deleted. 

Once you’ve determined the databases you wish to create, you can run the following command in your Snowflake worksheet to do so: <span class='text-style-code'>create database [database_name]</span>

Organizing your Snowflake schemas

Now that you’ve created your databases, you can determine the schemas you need to build within each one. Snowflake schemas act as a more granular way to organize tables and views within your database. They should help you write your queries faster, knowing exactly where to find certain data.

Raw data and base models

Your “RAW” and “BASE” databases should be made of the exact same schemas. “BASE” is essentially just a copy of your raw database but with basic transformations applied. In both of these databases, I recommend creating a new schema for every data source. 

For example, if you ingest data from Google Ads, Facebook, Bing Ads, and Mailchimp, you would create a different schema for each of these. Your database would look like this:

Each Snowflake database contains the default schemas “PUBLIC” and “INFORMATIONAL_SCHEMA”. While I recommend deleting “PUBLIC”, “INFORMATIONAL_SCHEMA” contains views describing the content of each of the schemas. This can come in handy, so there’s no harm in keeping it. 

As you add new data sources, you can easily create a new schema for each one. Now, when accessing raw data or base models you know exactly where to look for every piece of data you need. 

Development and production models

Similar to “RAW” and “BASE”, “DATA_MART_DEV” and “DATA_MART_PROD” should have the same schemas. They are exact copies of one another except one is for development purposes and the other is validated and used by business teams. Within these, I only use two schemas: “INTERMEDIATE” and “CORE”.

These terms are commonly used in dbt documentation. Intermediate models are those that come in between base models and the final product, or the core data model. They are the output tables of the SQL files that don’t necessarily get used for analysis, but are an important step in building the final model. The only person that really needs access to these is the analytics engineer, or the one who coded them.

Core models are those that are the final product of a data model. They are the table that results from the very last SQL file in a sequence of code. These are the ones that data analysts and business users will need to access in your production environment. All analyses, dashboards, and reports will be built from these data models. 

Organizing your Snowflake tables and views

Before deciding on how you want to name your tables and views, you need to decide when to use each. If you aren’t familiar, a view acts as a table but isn’t a physical copy of your data. It is a SQL query that sits on top of the underlying table, making it a great option if you are looking for a secure solution or way to save money.

Personally, I use views for all of my base and intermediate models in Snowflake. Because base models only contain basic transformations on the raw data, there isn’t really a need to store another complete copy. As for intermediary models, because analysts don’t directly query them, there is no need to waste storage space on them.

However, my raw data and core data models are always tables. Your raw data needs to be stored as a physical copy in the form of a table. This is important for your core data models as well since these are typically run each day and frequently accessed. You want a historical record to exist.

Naming conventions for tables and views

The name of your Snowflake tables and views follows the same name as your dbt SQL file. dbt automatically generates tables and views with the same names as the file name where the code is present. Because of this, it is imperative that you create a consistent, strong dbt style guide as I shared on a previous article.

If you look at the names of my SQL files within dbt, you will see those exact file names generated as tables within my Snowflake schemas. For example, I have fb_ads.sql, fb_campaigns.sql, and fb_ad_actions.sql as SQL files that contain code for my base models. If you look at my “BASE” database and “FACEBOOK” schema, you can see these exact file names as tables.

Conclusion

Be sure to start the process of building your Snowflake architecture by planning and designing the databases, schemas, and tables/views. You don’t want to create anything until you know what your entire data ecosystem will look like. I recommend documenting each piece of your Snowflake data warehouse architecture and the purpose behind it. Review this document multiple times with your team before deciding on an architecture that works best for you. 

Remember, this process isn’t a one-and-done deal. You will be adding new databases, schemas, and tables as new business use cases pop up. And, if you decide something isn’t working for you, don’t be afraid to change it. There are many different ways to organize your Snowflake data warehouse architecture. I’ve seen teams who separate environments by schema within databases, teams who automate reporting in their production environment, and teams who have multiple development and production databases. It’s all a matter of what makes sense for your use cases.

My team consists of only one analytics engineer, which is me. So I don’t know how this architecture would work with multiple analytics engineers writing data models within dbt. For being the only person writing in dbt, it works great and has created a very streamlined way of doing things. Look out for part two on how to set up users, roles, and permissions within your Snowflake data warehouse.

Open-source data integration

Get all your ELT data pipelines running in minutes with Airbyte.