Article

Snowflake Architecture: Organize Databases, Schemas & Tables

•

March 22, 2022

•

15 min read

Your data warehouse is the hub of your modern data stack. Data flows into it through data ingestion tools like Airbyte, making sure raw data is available. Data is transformed within it using SQL and modern data transformation tools like dbt. Data then flows out of it to business users and data visualization platforms.

All data exists within your warehousing solution.

It’s a powerful tool. And one that you need to make sure you do right. The integrity of your data depends on the analytics engineers, data engineers, and data analysts that set this solution. It is imperative that it is done correctly, considering different factors like development and production, security, and business use cases.

In this article, we discuss how to organize a Snowflake data warehouse to minimize errors. We cover selecting a data warehouse solution, defining databases, schemas, and various table types.

What is a Data Warehouse?

A data warehouse is a centralized repository that stores large volumes of structured and unstructured data from various sources. It is designed for query and analysis rather than transaction processing.

There are mainly three types of data warehouse architectures:

Single-tier Architecture: In this architecture, all data processing components reside on a single server. It is suitable for small-scale data warehouses with limited data volumes and user concurrency.
Two-tier Architecture: This architecture separates the data storage and processing components into two tiers. The data storage layer resides on one server or set of servers, while the data processing layer (query processing, analysis, etc.) is located on another server or set of servers. It's more scalable and suitable for medium to large-scale data warehouses.
Three-tier Architecture: This architecture divides the data warehouse into three tiers: the bottom tier contains the data sources, the middle tier houses the data warehouse itself (including the data storage and processing components), and the top tier is where the end-users interact with the data warehouse through various BI tools and applications.

Suggested Read: What is Data Architecture?

Why choose Snowflake as your data warehouse?

Choose Snowflake as your data warehouse for its:

Scalability: Dynamically scale storage and compute resources as needed.
Performance: Execute queries rapidly with its optimized architecture.
Concurrency: Support multiple users and workloads simultaneously.
Simplicity: Easily manage data without complex configurations.
Flexibility: Seamlessly integrate with various data sources and tools.
Security: Ensure data protection with robust security features.
Cost-effectiveness: Pay only for the resources used, minimizing expenses.
Built-in Features: Access built-in functionalities like data sharing and machine

Cloud data warehouse tools like Snowflake is what we believe to be the best data warehouse on the market. Snowflake is built specifically on the cloud, meaning you and your team can access it wherever, whenever. Snowflake is fast and allows you to easily scale up or down your compute power based on your budget.

Snowflake also makes it easy to share data among many users. Before we get into all of the decisions that you need to make about your data warehouse architecture, let's discuss the components of a Snowflake data warehouse.

Snowflake architecture is composed of different databases, each serving its own purpose. Snowflake databases contain schemas to further categorize the data within each database. Lastly, the most granular level consists of tables and views. Snowflake tables and views contain the columns and rows of a typical database table that you are familiar with.

What is Snowflake Architecture?

Snowflake Architecture refers to the structural design and framework of the Snowflake cloud data platform. It comprises several key components:

Compute Layer: This layer handles query processing and execution. Snowflake's architecture separates compute resources from storage, enabling independent scaling of each component.
Storage Layer: Snowflake's storage layer stores structured and semi-structured data efficiently in cloud storage, such as Amazon S3 or Azure Blob Storage. It utilizes a unique architecture called the "multi-cluster, shared data architecture" for high concurrency and performance.
Metadata Layer: Metadata, including database schema information, query history, and access controls, is stored in Snowflake's metadata layer. This layer ensures efficient management and governance of data within the warehouse.
Virtual Warehouses: Virtual warehouses are compute resources provisioned by Snowflake to execute queries and processes. They can be scaled up or down dynamically based on workload demands.
Data Sharing: Snowflake's architecture supports secure data sharing between different accounts and organizations, allowing seamless collaboration and access to shared data sets.
Security and Governance: Snowflake provides robust security features, including encryption at rest and in transit, role-based access controls, and audit logging, ensuring data privacy and compliance with regulatory requirements.

Overall, Snowflake's architecture is designed to deliver high performance, scalability, and flexibility, empowering organizations to efficiently manage and analyze large volumes of data in the cloud.

Supercharge Your Snowflake Workflows with Airbyte's Data Integration

Try FREE for 14 Days

What is the need to organise Snowflake databases?

Organizing Snowflake databases is essential for several reasons:

Efficient Data Management: Well-organized databases facilitate efficient data storage, retrieval, and management, ensuring that data is structured and accessible when needed.
Improved Performance: Organizing databases can enhance query performance by optimizing data storage, indexing, and partitioning, leading to faster query execution and analysis.
Enhanced Security: Proper organization enables better control over access permissions and data governance, ensuring that sensitive data is protected and only accessible to authorized users.
Simplified Maintenance: Organized databases are easier to maintain, allowing for smoother updates, backups, and data migrations, which reduces the risk of errors and downtime.
Scalability: Organized databases can scale more efficiently to accommodate growing data volumes and evolving business requirements, ensuring that the data infrastructure remains agile and adaptable over time.

How to organise Snowflake databases?

To organize Snowflake databases effectively:

Naming Convention: Use clear and consistent names for databases.
Purpose Classification: Group databases based on their intended use or departmental ownership.
Access Control: Define and enforce access controls to manage user permissions.
Schema Structure: Organize schemas within databases to segregate different types of data.
Metadata Management: Maintain comprehensive metadata for databases to document their purpose and contents.
Resource Allocation: Allocate appropriate resources (e.g., storage and compute) based on the needs of each database.
Data Lifecycle Management: Implement policies for data retention, archiving, and deletion to manage database growth effectively.
Documentation: Document the structure, usage guidelines, and dependencies of each database to facilitate understanding and collaboration.

By following these guidelines, organizations can efficiently organize Snowflake databases to support their data management and analytics objectives.

There are three main components to your data warehouse. The location where you ingest your raw data. The location where data transformations occur (usually with dbt). And the location where you store reporting and experimentation. Let's explorre these in detail:

Storing raw data

Let’s start by talking about the first location. Where will you be ingesting your raw data into Snowflake? We recommend creating a Snowflake database to ingest all of your raw data. This should always be the first location any piece of data lands.

It is important to always have a copy of the rawest form of your data stored in case something goes wrong. Having a raw copy will allow you to re-run your data models in the case you find an error in one of them.

It is imperative that the only system that has full access to this is your ingestion tool. In our case, it is Airbyte, an open-source data ingestion tool that you can use to load data into Snowflake from any available source connector. Airbyte will dump all of the raw data to your ”RAW” database that you created specifically for ingestion. Nobody else, unless using another ingestion tool, should also be dumping data into this location.

Storing transformed data

Now, there are a few types of transformations that happen within the data warehouse. First, and the simplest, are base models. These are basic transformations that are done on the raw data in order to make them ready to use in data models by analysts and analytics engineers.

We have another insightful article about creating a dbt style guide for these base models.

It is a best practice when designing your Snowflake architecture to avoid having your transformations read from your raw database. Your data models should always read from another Snowflake database that contains these base models.

This is essentially a database similar to your “RAW” database but with basic transformations such as data type casting and field name changes. We call “BASE” for simplicity's sake. dbt commonly refers to this as “STAGING”.

“BASE” data models are also views rather than tables. This saves costs within your Snowflake warehouse because you aren’t storing a full copy of the underlying data. Rather you are creating a layer that lives on top of it. Because this data is always the same, no matter in development or production, there is no need to create separate environments.

Views don’t need to be automated and deployed each day because they simply read from your raw data. As with your other transformations, these will be the more complex models you build using dbt. These require both a development and production environment.

You don’t want one database to contain both development and production models, so it is best to create a different database for each. We call “DATA_MART_DEV” and “DATA_MART_PROD”.

Both of these databases read from the tables in the “BASE” database but one is for testing the creation of your data models and one is validated, orchestrated, and depended on by the business. Your “DATA_MART_PROD” tables should be created daily, or more often, by using an orchestration solution.

Keep in mind that tables created in Snowflake using dbt are transient tables. These are similar to permanent tables except they don’t have a full history available on Snowflake. This helps save on storage costs but is another reason why it’s important to always have a copy of all your raw data.

Storing data for reporting and experimentation

Often analysts will want to write longer one-off queries for reporting and experimentation purposes. With writing queries comes a place to store those queries so that they can be accessed by business users and visualization tools.

Because these are usually only written once and don’t need to be automated to run every day, you do not want to store these in development or production. We do most of our reporting within our BI tool, using the data models built in production, rather than within Snowflake.

However, if you are automating your reporting in Snowflake, creating these reports in “DATA_MART_PROD” is probably a better idea. In our Snowflake environment, a database called “RDA” for this data is created. Here, analysts can create tables from their queries to then be used by others.

You may be wondering, if these are one-off analyses, why do we need to store them in the first place?

If these queries are quite complicated, you aren’t going to want to use them directly in your visualization tool. You are going to want to harness the speed and power of Snowflake to run them. Also, this way, you’ll be able to easily reference previous reporting code for future queries.

‍

Here you can see what the Snowflake architecture of our databases look like. Notice there is also “UTIL_DB”, “SNOWFLAKE”, and “DEMO_DB”. These are created by default with any Snowflake account.

“SNOWFLAKE” is mainly used for internal Snowflake setup involving privileges, account usage, and account history. “UTIL_DB” and “DEMO_DB” can be deleted.

Once you’ve determined the databases you wish to create, you can run the following command in your Snowflake worksheet to do so: <span class='text-style-code'>create database [database_name]</span>

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

Replicate data from or into Snowflake, in minutes

Learn more about the Snowflake connector ->

How to organise Snowflake schemas?

Now that you’ve created your databases, you can determine the schemas you need to build within each one. Snowflake schemas act as a more granular way to organize tables and views within your database. They should help you write your queries faster, knowing exactly where to find certain data.

Raw data and base models

Your “RAW” and “BASE” databases should be made of the exact same schemas. “BASE” is essentially just a copy of your raw database but with basic transformations applied. In both of these databases, We recommend creating a new schema for every data source.

For example, if you ingest data from Google Ads, Facebook, Bing Ads, and Mailchimp, you would create a different schema for each of these. Your database would look like this:

‍

Each Snowflake database contains the default schemas “PUBLIC” and “INFORMATIONAL_SCHEMA”. While we recommend deleting “PUBLIC”, “INFORMATIONAL_SCHEMA” contains views describing the content of each of the schemas. This can come in handy, so there’s no harm in keeping it.

As you add new data sources, you can easily create a new schema for each one. Now, when accessing raw data or base models you know exactly where to look for every piece of data you need.

Development and production models

Similar to “RAW” and “BASE”, “DATA_MART_DEV” and “DATA_MART_PROD” should have the same schemas. They are exact copies of one another except one is for development purposes and the other is validated and used by business teams. Within these, we only use two schemas: “INTERMEDIATE” and “CORE”.

‍

‍

These terms are commonly used in dbt documentation. Intermediate models are those that come in between base models and the final product, or the core data model. They are the output tables of the SQL files that don’t necessarily get used for analysis, but are an important step in building the final model. The only person that really needs access to these is the analytics engineer, or the one who coded them.

Core models are those that are the final product of a data model. They are the table that results from the very last SQL file in a sequence of code. These are the ones that data analysts and business users will need to access in your production environment. All analyses, dashboards, and reports will be built from these data models.

How to Organise Snowflake tables & views?

Before deciding on how you want to name your tables and views, you need to decide when to use each. If you aren’t familiar, a view acts as a table but isn’t a physical copy of your data. It is a SQL query that sits on top of the underlying table, making it a great option if you are looking for a secure solution or way to save money.

Personally, we use views for all of our base and intermediate models in Snowflake. Because base models only contain basic transformations on the raw data, there isn’t really a need to store another complete copy. As for intermediary models, because analysts don’t directly query them, there is no need to waste storage space on them.

However, raw data and core data models are always tables. Your raw data needs to be stored as a physical copy in the form of a table. This is important for your core data models as well since these are typically run each day and frequently accessed. You want a historical record to exist.

Naming conventions for tables and views

The name of your Snowflake tables and views follows the same name as your dbt SQL file. dbt automatically generates tables and views with the same names as the file name where the code is present. Because of this, it is imperative that you create a consistent, strong dbt style guide as we've shared in a previous article.

If you look at the names of the SQL files within dbt, you will see those exact file names generated as tables within our Snowflake schemas. For example, we have fb_ads.sql, fb_campaigns.sql, and fb_ad_actions.sql as SQL files that contain code for the base models. If you look at the “BASE” database and “FACEBOOK” schema, you can see these exact file names as tables.

‍

Conclusion

Starting the process of building your Snowflake architecture involves careful planning and design of databases, schemas, and tables/views. Documenting each component and its purpose is crucial. Reviewing this documentation with your team helps in deciding on the best architecture. As business needs evolve, you'll continually add new databases, schemas, and tables.

Don't hesitate to make changes if something isn't working for you. There are various organizational approaches to Snowflake architecture. Consider what aligns best with your use cases. Stay tuned for part two on setting up users, roles, and permissions within your Snowflake data warehouse.

In Snowflake architecture, integrating data science principles can greatly enhance database, schema, and table organization. By employing methodologies like predictive analytics and machine learning, businesses can optimize data storage for efficient insights extraction. This tailored approach enables more informed decision-making processes.

The data movement infrastructure for the modern data teams.

Try a 14-day free trial

About the Author

Madison Schott is an Analytics Engineer, author of The ABCs of Analytics Engineering book. Madison blogs about the modern data stack on Medium and writes a newsletter on Substack.