Snowflake CDC: A Comprehensive Guide to a seamless Data Sync

January 29, 2024
10 min read

In today’s business operational systems, rapidly evolving data makes it necessary to capture and act on data changes as they occur. This is critical if your organization relies on evidence-based, real-time decision-making. By extracting valuable insights from up-to-date data, you can make smarter, informed decisions to drive business growth.

Snowflake, the popular cloud data platform, supports Change Data Capture (CDC), allowing businesses to track and replicate data modifications efficiently. If you want to manage your data better, consider using Snowflake CDC. Let’s look into the details of CDC in Snowflake.

An Overview of Snowflake

Developed in 2012, Snowflake is a fully managed cloud-based data warehousing platform that provides Platform-as-a-Service to store, analyze, and manage vast amounts of data. Snowflake service is available across Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) infrastructure, providing an unbounded platform for data storage and retrieval.

The  Snowflake architecture consists of three key layers:

  • Database Storage: Data that is loaded into Snowflake will be reorganized into its internal, optimized, and compressed columnar format. Snowflake then stores this optimized data in cloud storage files..
  • Query Processing: Snowflake processes queries in its processing layer using virtual warehouses; each virtual warehouse is a Massively Parallel Processing (MPP) compute cluster composed of multiple compute nodes.
  • Cloud Services: A collection of services that coordinates activities across Snowflake. The cloud services layer also runs on compute instances provisioned by Snowflake from the cloud provider.

Some of the services managed in this layer include authentication, access control, infrastructure management, and metadata management. Let’s look at some of the features that make Snowflake a popular data warehousing solution in the market.

  • Scalability: Snowflake separates its storage and compute units. This allows you to scale each independently and cost-effectively, paying only for the services you use.
  • Semi-Structured Data Support: Snowflake has the ability to handle unstructured and semi-structured data. The variety of supported data types includes JSON, Avro, Parquet, etc.
  • Standard and Extended SQL Support: Snowflake supports most of the SQL statements, DDL and DML commands for querying data. Apart from the standard insert, delete, and update statements, it also supports transactions, lateral views, stored procedures, and statistical aggregate functions.
  • Security: Snowflake offers several security features for data security. This includes multi-factor authentication, OAuth, SSO via federated authentication, and role-based access control.

What is Change Data Capture?

Change Data Capture (CDC) is the process of capturing changes made to data in a database, such as SQL Server, and delivering those changes to a downstream system or process in real time. It ensures that any inserts, updates, and deletes in a database are automatically detected and processed. Rather than processing the entire dataset, CDC systems capture the specific data that has been altered, making it an efficient data synchronization approach.

Change Data Capture encompasses a set of software design patterns used to detect and track changes in a database. It triggers an event connected with the data, causing the execution of a specific action for each CDC occurrence.

The primary purpose of Change Data Capture (CDC) is to enable real-time data integration and maintain up-to-date data across systems. This facilitates timely data synchronization and real-time analytics. You can keep team members updated in near real-time and send the latest data updates to BI tools.

The four CDC approaches are as follows:

  • Log-based CDC: Considered the most efficient way to implement CDC, log-based CDC reads the transaction log of the source database to capture new transactions. This CDC type has minimal impact on the source system.
  • Query-based CDC: This requires you to run queries to detect changes in the database. Query-based CDC retrieves only the rows that have changed since the last extraction. It is more invasive to the source systems and is resource-intensive.
  • Trigger-based CDC: This method relies on database triggers that detect changes, creating a change log in shadow tables, which provide a detailed log of all transactions. It requires multiple writes for each row’s data modification.
  • Polling-based CDC: The source database is polled at regular intervals in polling-based CDC; changes are detected and stored in a separate table. The changes are then propagated to the target system. This approach may cause increased latency and possibly miss some changes if the polling interval is too long.

Here are some key benefits of Change Data Capture (CDC):

  • It facilitates zero-downtime database migrations and supports real-time analytics, fraud protection, and data synchronization across geographically distributed systems.
  • Change Data Capture (CDC) ensures the data in multiple systems stays in sync. This is especially important for making time-sensitive decisions in a high-velocity data environment.
  • Change Data Capture can efficiently move data across a wide area network, making it suitable for the cloud.
  • It eliminates the need for bulk data updating and inconvenient batch windows.

What is Snowflake CDC?

Snowflake Change Data Capture (CDC) is a technique used to track and capture changes made to the data within Snowflake data warehouse and transmit them to other systems, databases, or applications for real-time synchronization. It is a log-based CDC that is executed using change streams.

A stream takes logical snapshots of external tables, underlying tables, or views of a source object. It can also record the information of DML changes to source objects in Snowflake. The change stream, which is a log of all data modifications in Snowflake, is continuously updated as new changes occur.

The data capture process begins with a CDC-enabled table creation in Snowflake. This table is configured to capture data changes through Snowflake CDC and contains all the changed data that must be replicated. Any changes made to the table are automatically captured and sent to the target database in real time.

After a stream is created for the table, additional columns are added to the source table. The new columns store metadata tracking data changes in the source table. Any changed data that a stream tracks must be consumed or moved to permanent storage within the retention period. Otherwise, the changes won’t be accessible, and a new stream will be created to track further data changes from that point.

Supercharge Your Snowflake Workflows with Airbyte's Data Integration
Talk to Our Data Experts

What is a Snowflake Stream?

A Snowflake stream, also known as a table stream, is an object that enables Snowflake Change Data Capture by tracking DML changes made to a source table in Snowflake. The metadata of each change is stored and later used to retrieve the changed data when you query the stream.

Snowflake streams capture an initial snapshot of each row of the tables in the source object. Then, it initializes an offset as the object’s current transactional version. After this snapshot is taken, every time you modify data in your source table, streams will enable Change Data Capture.

The stream’s change tracking system will record DML-change information, reflecting the state of a row before and after the change. Such information typically includes the source object’s column structure and additional metadata columns that describe each change event. The additional columns returned for any query to a stream include:

  • METADATA$ACTION: Column value indicates the kind of DML statement (INSERT, DELETE) recorded. An UPDATE statement is represented by two-row entries for INSERT and DELETE.
  • METADATA$ISUPDATE: Indicates whether the row entry was part of an UPDATE. The value is TRUE if the entries were part of an update; else, it’s FALSE.
  • METADATA$ROW_ID: The unique and immutable row ID used to track changes on a particular row over time.

How to set up Snowflake CDC with Streams?

To start with CDC Snowflake, first log in to Snowflake Web-Based UI or Snow SQL. Then, follow these steps:

Step 1: Create Database CDC Stream

Run the following command:

create or replace database CDC_STREAM;
use CDC_STREAM;

Step 2: Create Source Table

To create a source table, run the following command:

create or replace table members_source (id int,first_name varchar (200),last_name varchar (200) ); 

Step 3: Create Destination Table

Similarly, to create a destination table, run the following command:

create or replace table members_destination (id int,first_name varchar (200),last_name varchar (200) );

Step 4: Create Stream to Track Changes

After the source and destination tables have been created, you must create a stream on top of the source table to track any changes made in the table.

create or replace stream member_stream on table members_source;

Step 5: Populate the Source Table with Some Data

For demonstration purposes, let’s add some records to the members_source table. Run the following command to do this:

insert into members_source values (1,’Wayne’,’Bell’);
insert into members_source values (2,’Anthony’,’Allen’);
insert into members_source values (3,’Eric’,’Henderson’);
insert into members_source values (4,’Jimmy’,’Smith’);
insert into members_source values (5,’Diana’,’Wheeler’);
insert into members_source values (6,’Karen’,’Hall’);
insert into members_source values (7,’Philip’,’Rodriguez’);
insert into members_source values (8,’Ashley’,’Bryant’);
insert into members_source values (9,’Norma’,’Grant’);
insert into members_source values (10,’Helen’,’Lewis’);
insert into members_source values (11,’Larry’,’Mccoy’);
insert into members_source values (12,’Emily’,’Wood’);
insert into members_source values (13,’Patrick’,’Alvarez’);

Step 6: View the Change Log in the Stream

Before proceeding with the next step, view the change log in the stream with the following command:

select * from member_stream;

Step 7: View Results

Since this is an initial load, everything gets recorded as inserts. The records and metadata fields will be as follows:

The destination table will not have any records yet; these records will be in the stream.

Step 8: Use MERGE Statement to Move Records

Here’s the command to merge into the destination everything that came from the source table from the stream.

This will move the data to the destination, and there will be nothing in the stream. The stream object becomes empty after the MERGE command is run. This cycle continues, with the stream continuously recording the changes that happen in the members_source table.

Step 9: View Destination Table

To view the destination table for the updated records, run the following command:

select * from members_destination;

The destination table view will look like this:

Step 10: Make Changes in Source Table

To demonstrate Change data capture in Snowflake, make a change in the source table by updating a record. Then, observe the stream. Here’s how you can update the first record from Bell to Wright or ID equals to 1.

update members_sourceset last_name=’Wright’where id=1;

Now, when you observe the stream, you’ll find two records—INSERT and DELETE. This is because the original last name was deleted for id=1, and the new last name was updated. Here’s what the stream will look like:

Step 11: Use MERGE Command to Update Destination Table

Next, update the destination table with the MERGE command:

Step 12: View Results

Finally, run the following command to view the destination table:

select * from members_destination;

You can observe that the changes made to the source table have been updated in the destination table. This completes the Snowflake Change Data Capture process, which repeats every time there is a change in the source table.

Use Cases For Snowflake CDC

1. Real-Time Data Warehousing

Real-time data warehousing ensures your business intelligence tools and dashboards have the most current view of your data. It provides immediate analysis and reporting—essential for changing and fast-moving sectors like finance, retail, and healthcare. For instance, an e-commerce company can use Snowflake CDC to trace real-time sales transactions, inventory levels, and customer behavior. This will help the company dynamically change pricing strategies, manage stock levels, and provide customers with personalized recommendations.

2. ETL/ELT Processes

Snowflake CDC captures and processes only changed data to optimize ETL/ELT workflows, thus reducing time and resource costs for the maintenance and update of data warehouses for users. For instance, a financial services company can use Snowflake CDC to optimize ETL processes. By capturing real-time changes in customer transactions and market data, it can instantly transform and load this information into Snowflake for analysis, minimizing latency and maximizing operational efficiency.

3. Data Replication

You can use Snowflake CDC when you want to replicate data across environments for backup, disaster recovery, or regional availability, or when consistent data over different systems is required. Snowflake CDC can guarantee data consistency and reliability across replicated environments. It supports continuous data replication, reducing the risk of losing your data or causing inconsistency.

4. Data Integration for Analytics

The marketing agency can use Snowflake CDC to integrate data from social media platforms, customer databases, and sales systems. Now, as the data gets synchronized in real-time, it will be relatively easy for the agency to generate this campaign performance report and drive actionable insights back to the clients.

Advanced Snowflake CDC Techniques

Handling Large Volumes of Changes

High-performance handling of huge volumes of change is a major Snowflake CDC technique. In the case of massive datasets, it is imperative to have batch processing that consolidates changes in manageable chunks for processing. This reduces the load on your system and minimizes latency. Partitioning data and auto-scaling with Snowflake are techniques that can achieve essential performance gains. Moreover, parallel processing that balances the workload can be a power accelerator for data synchronization. It can also process vast volumes of changes quickly and accurately.

Implementation of Incremental CDC

Incremental CDC is a highly effective method for monitoring changes after the last update. It significantly reduces the volume of data that needs to be processed and transmitted, instilling confidence in the efficiency of Snowflake CDC. With incremental updates, new or modified record identification through timestamps or versioning columns is possible, enhancing performance and ensuring data consistency and accuracy. This approach allows businesses to update their data warehouses with only the changed data, avoiding the overhead of full data reloads and making the process efficient and cost-effective.

Handling Schema Changes

Schema changes are inevitable in dynamic data environments. Snowflake CDC has to be designed to support schema changes elegantly without affecting data integrity. However, one of the most efficient ways of implementing changes to the schema would be schema evolution, which allows the system to adapt to source schema changes without interrupting data capture. These must be automated with techniques such as adding new columns, changing data types, and column deletion management for smooth transitions. Snowflake's flexible architecture supports dynamic schema changes to ensure seamless integration and fewer disruptions. Robust schema management practices enable a business to ensure that its data pipelines recover quickly from failures and are adaptable to change.

Do's and don'ts of Snowflake CDC

Do's

  1. Enable CDC only on tables that need it
  2. Handle errors with a retry mechanism
  3. Monitor and tune the CDC performance regularly
  4. Test schema changes extensively
  5. Document Your CDC Processes

Don'ts

  1. Do not ignore the effect of CDC on query performance
  2. Not using CDC for real-time data synchronization
  3. Avoiding data security and compliance
  4. Avoid relying on the CDC alone as a means of data backup
  5. Failing to review and refresh CDC configurations periodically

Summing It Up

Snowflake CDC is an efficient solution for synchronizing and managing data in real-time. Its ability to track and replicate data modifications with minimal impact on system performance makes Snowflake a suitable choice for Change Data Capture.

If your organization handles millions of transactions on a daily basis, but you want to update only the modified ones, consider leveraging CDC in Snowflake data warehouse. This will save you time and resources instead of having to perform a full load. From real-time analytics and database migration to ensuring data consistency across distributed systems, Snowflake Change Data Capture (CDC) is a robust approach.

To sync data from Snowflake into a target database, data warehouse or data lake, etc. consider using Airbyte’s Snowflake source connector. It supports full refresh and incremental syncs. Whether you want to copy all rows in the tables or only the changed data every time a sync is run, this connector will perform the task efficiently.

Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program
The data movement infrastructure for the modern data teams.
Try a 14-day free trial