Snowflake CDC: A Comprehensive Guide to a seamless Data Sync

January 29, 2024
10 min read

In today’s business operational systems, rapidly evolving data makes it necessary to capture and act on data changes as they occur. This is critical if your organization relies on evidence-based, real-time decision-making. By extracting valuable insights from up-to-date data, you can make smarter, informed decisions to drive business growth.

Snowflake, the popular cloud data platform, supports Change Data Capture (CDC), allowing businesses to track and replicate data modifications efficiently. If you want to manage your data better, consider using Snowflake CDC. Let’s look into the details of CDC in Snowflake.

An Overview of Snowflake

Developed in 2012, Snowflake is a fully managed cloud-based data warehousing platform that provides Platform-as-a-Service to store, analyze, and manage vast amounts of data. Snowflake service is available across Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) infrastructure, providing an unbounded platform for data storage and retrieval.

The  Snowflake architecture consists of three key layers:

  • Database Storage: Data that is loaded into Snowflake will be reorganized into its internal, optimized, and compressed columnar format. Snowflake then stores this optimized data in cloud storage.
  • Query Processing: Snowflake processes queries in its processing layer using virtual warehouses; each virtual warehouse is a Massively Parallel Processing (MPP) compute cluster composed of multiple compute nodes.
  • Cloud Services: A collection of services that coordinates activities across Snowflake. The cloud services layer also runs on compute instances provisioned by Snowflake from the cloud provider.

Some of the services managed in this layer include authentication, access control, infrastructure management, and metadata management. Let’s look at some of the features that make Snowflake a popular data warehousing solution in the market.

  • Scalability: Snowflake separates its storage and compute units. This allows you to scale each independently and cost-effectively, paying only for the services you use.
  • Semi-Structured Data Support: Snowflake has the ability to handle unstructured and semi-structured data. The variety of supported data types includes JSON, Avro, Parquet, etc.
  • Standard and Extended SQL Support: Snowflake supports most of the SQL DDL and DML commands for querying data. Apart from the standard insert, delete, and update statements, it also supports transactions, lateral views, stored procedures, and statistical aggregate functions.
  • Security: Snowflake offers several security features for data security. This includes multi-factor authentication, OAuth, SSO via federated authentication, and role-based access control.

What is Change Data Capture?

Change Data Capture (CDC) is the process of capturing changes made to data in a database and delivering those changes to a downstream system or process in real time. It ensures that any inserts, updates, and deletes in a database are automatically detected and processed. Rather than processing the entire dataset, CDC systems capture the specific data that has been altered, making it an efficient data synchronization approach.

Change Data Capture encompasses a set of software design patterns used to detect and track changes in a database. It triggers an event connected with the data, causing the execution of a specific action for each CDC occurrence.

The primary purpose of Change Data Capture (CDC) is to enable real-time data integration and maintain up-to-date data across systems. This facilitates timely data synchronization and real-time analytics. You can keep team members updated in near real-time and send the latest data updates to BI tools.

The four CDC approaches are as follows:

  • Log-based CDC: Considered the most efficient way to implement CDC, log-based CDC reads the transaction log of the source database to capture new transactions. This CDC type has minimal impact on the source system.
  • Query-based CDC: This requires you to run queries to detect changes in the database. Query-based CDC retrieves only the rows that have changed since the last extraction. It is more invasive to the source systems and is resource-intensive.
  • Trigger-based CDC: This method relies on database triggers that detect changes, creating a change log in shadow tables, which provide a detailed log of all transactions. It requires multiple writes for each row’s data modification.
  • Polling-based CDC: The source database is polled at regular intervals in polling-based CDC; changes are detected and stored in a separate table. The changes are then propagated to the target system. This approach may cause increased latency and possibly miss some changes if the polling interval is too long.

Here are some key benefits of Change Data Capture (CDC):

  • It facilitates zero-downtime database migrations and supports real-time analytics, fraud protection, and data synchronization across geographically distributed systems.
  • Change Data Capture (CDC) ensures the data in multiple systems stays in sync. This is especially important for making time-sensitive decisions in a high-velocity data environment.
  • Change Data Capture can efficiently move data across a wide area network, making it suitable for the cloud.
  • It eliminates the need for bulk data updating and inconvenient batch windows.

What is Snowflake CDC?

Snowflake Change Data Capture (CDC) is a technique used to track and capture changes made to the data within Snowflake data warehouse and transmit them to other systems, databases, or applications for real-time synchronization. It is a log-based CDC that is executed using change streams.

A stream takes logical snapshots of external tables, tables, or views of a source object. It can also record the information of DML changes to source objects in Snowflake. The change stream, which is a log of all data modifications in Snowflake, is continuously updated as new changes occur.

The data capture process begins with a CDC-enabled table creation in Snowflake. This table is configured to capture data changes through Snowflake CDC and contains all the changed data that must be replicated. Any changes made to the table are automatically captured and sent to the target database in real time.

After a stream is created for the table, additional columns are added to the source table. The new columns store metadata tracking data changes in the source table. Any changed data that a stream tracks must be consumed or moved to permanent storage within the retention period. Otherwise, the changes won’t be accessible, and a new stream will be created to track further data changes from that point.

What is a Snowflake Stream?

A Snowflake stream, also known as a table stream, is an object that enables Snowflake Change Data Capture by tracking DML changes made to a source table in Snowflake. The metadata of each change is stored and later used to retrieve the changed data when you query the stream.

Snowflake streams capture an initial snapshot of each row of the tables in the source object. Then, it initializes an offset as the object’s current transactional version. After this snapshot is taken, every time you modify data in your source table, streams will enable Change Data Capture.

The stream’s change tracking system will record DML-change information, reflecting the state of a row before and after the change. Such information typically includes the source object’s column structure and additional metadata columns that describe each change event. The additional columns returned for any query to a stream include:

  • METADATA$ACTION: Column value indicates the kind of DML operation (INSERT, DELETE) recorded. An UPDATE statement is represented by two-row entries for INSERT and DELETE.
  • METADATA$ISUPDATE: Indicates whether the row entry was part of an UPDATE. The value is TRUE if the entries were part of an update; else, it’s FALSE.
  • METADATA$ROW_ID: The unique and immutable row ID used to track changes on a particular row over time.

How to set up Snowflake CDC with Streams?

To start with CDC Snowflake, first log in to Snowflake Web-Based UI or Snow SQL. Then, follow these steps:

Step 1: Create Database CDC Stream

Run the following command:

create or replace database CDC_STREAM;
use CDC_STREAM;

Step 2: Create Source Table

To create a source table, run the following command:

create or replace table members_source (id int,first_name varchar (200),last_name varchar (200) ); 

Step 3: Create Destination Table

Similarly, to create a destination table, run the following command:

create or replace table members_destination (id int,first_name varchar (200),last_name varchar (200) );

Step 4: Create Stream to Track Changes

After the source and destination tables have been created, you must create a stream on top of the source table to track any changes made in the table.

create or replace stream member_stream on table members_source;

Step 5: Populate the Source Table with Some Data

For demonstration purposes, let’s add some records to the members_source table. Run the following command to do this:

insert into members_source values (1,’Wayne’,’Bell’);
insert into members_source values (2,’Anthony’,’Allen’);
insert into members_source values (3,’Eric’,’Henderson’);
insert into members_source values (4,’Jimmy’,’Smith’);
insert into members_source values (5,’Diana’,’Wheeler’);
insert into members_source values (6,’Karen’,’Hall’);
insert into members_source values (7,’Philip’,’Rodriguez’);
insert into members_source values (8,’Ashley’,’Bryant’);
insert into members_source values (9,’Norma’,’Grant’);
insert into members_source values (10,’Helen’,’Lewis’);
insert into members_source values (11,’Larry’,’Mccoy’);
insert into members_source values (12,’Emily’,’Wood’);
insert into members_source values (13,’Patrick’,’Alvarez’);

Step 6: View the Change Log in the Stream

Before proceeding with the next step, view the change log in the stream with the following command:

select * from member_stream;

Step 7: View Results

Since this is an initial load, everything gets recorded as inserts. The records and metadata fields will be as follows:

The destination table will not have any records yet; these records will be in the stream.

Step 8: Use MERGE Statement to Move Records

Here’s the command to merge into the destination everything that came from the source table from the stream.

This will move the data to the destination, and there will be nothing in the stream. The stream object becomes empty after the MERGE command is run. This cycle continues, with the stream continuously recording the changes that happen in the members_source table.

Step 9: View Destination Table

To view the destination table for the updated records, run the following command:

select * from members_destination;

The destination table view will look like this:

Step 10: Make Changes in Source Table

To demonstrate Change data capture in Snowflake, make a change in the source table by updating a record. Then, observe the stream. Here’s how you can update the first record from Bell to Wright or ID equals to 1.

update members_sourceset last_name=’Wright’where id=1;

Now, when you observe the stream, you’ll find two records—INSERT and DELETE. This is because the original last name was deleted for id=1, and the new last name was updated. Here’s what the stream will look like:

Step 11: Use MERGE Command to Update Destination Table

Next, update the destination table with the MERGE command:

Step 12: View Results

Finally, run the following command to view the destination table:

select * from members_destination;

You can observe that the changes made to the source table have been updated in the destination table. This completes the Snowflake Change Data Capture process, which repeats every time there is a change in the source table.

Summing It Up

Snowflake CDC is an efficient solution for synchronizing and managing data in real-time. Its ability to track and replicate data modifications with minimal impact on system performance makes Snowflake a suitable choice for Change Data Capture.

If your organization handles millions of transactions on a daily basis, but you want to update only the modified ones, consider leveraging CDC in Snowflake data warehouse. This will save you time and resources instead of having to perform a full load. From real-time analytics and database migration to ensuring data consistency across distributed systems, Snowflake Change Data Capture (CDC) is a robust approach.

To sync data from Snowflake into a target database, data warehouse or data lake, etc. consider using Airbyte’s Snowflake source connector. It supports full refresh and incremental syncs. Whether you want to copy all rows in the tables or only the changed data every time a sync is run, this connector will perform the task efficiently.

Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program
The data movement infrastructure for the modern data teams.
Try a 14-day free trial