Snowflake CDC: A Comprehensive Guide to a seamless Data Sync

•

June 9, 2025

•

10 min read

Summarize with ChatGPT

In today’s always-on economy, real-time data synchronization is critical for business success. Organizations require data synchronization across analytics platforms, operational dashboards, and AI models. Change Data Capture (CDC)—specifically Snowflake CDC—provides a low-latency, cost-effective solution to keep every environment synchronized while processing only the changed data.

This guide will demystify Snowflake’s stream-based change tracking system, walk through an end-to-end setup, and explore advanced techniques, troubleshooting tactics, and best practices for maintaining data integrity at scale.

Quick Definition: A Snowflake stream object takes logical snapshots of underlying tables, records Data Manipulation Language (DML) events (INSERT, UPDATE, DELETE), and exposes those changes so your data pipelines can propagate them downstream, often within seconds.

An Overview of Snowflake

Developed in 2012, Snowflake is a fully managed cloud data warehouse designed for massively parallel processing (MPP) and elastic scaling. It runs on AWS, Azure, and Google Cloud, abstracting infrastructure so data teams can focus on analytics instead of management.

Snowflake’s architecture is made up of three layers: Database Storage (Columnar, compressed data storage in cloud storage locations), Query Processing (Independent virtual warehouses that auto-scale compute), and Cloud Services (Security, metadata, and task orchestration).

Why it Matters for CDC:

Independent scaling of storage vs. compute lets you isolate heavy CDC merges in a dedicated virtual warehouse.
Extended SQL support simplifies complex merge logic, including statistical aggregate functions.
Native semi-structured data support allows change capture for formats like JSON or Parquet, making it easier to work with data lakes and external tables.

What is Change Data Capture (CDC)?

Change Data Capture (CDC) refers to the practice of identifying and propagating row-level changes from a source database to a target system in near real time. Instead of copying all the data, CDC only transfers the modified ones, reducing network load, compute requirements, and time-to-insight.

There are four common techniques for CDC: Log-based CDC (Reads database redo/transaction logs with minimal source impact), Query-based CDC (Polls for differences using SQL; higher overhead), Trigger-based CDC (Uses database triggers to populate shadow tables), and Polling-based CDC (Scheduled extracts; simple but less real-time).

Key Benefits:

Zero-downtime migrations.
Real-time data synchronization.
Guaranteed data consistency across distributed systems.

What is Snowflake CDC?

Snowflake CDC leverages Snowflake’s native, log-based CDC powered by streams—lightweight metadata objects referencing an underlying table, view, or external table. A Snowflake stream:

Creates an initial snapshot of the source object’s structure and current transactional version.
Records DML changes as they happen (changed data).
Exposes additional metadata columns such as METADATA$ACTION, METADATA$ISUPDATE, and METADATA$ROW_ID to facilitate downstream processing.

Streams are efficient as they capture logical snapshots rather than copying table data, keeping overhead minimal. Streams position advance when consumed, akin to Kafka’s offset mechanism. Retention Tip: Consume streams within their retention period (1–14 days). After that, you must recreate the stream or enable retention extensions.

What is a Snowflake Stream?

A table stream (or standard stream) is defined with:

CREATE OR REPLACE STREAM member_stream
ON TABLE members_source
APPEND_ONLY = FALSE; -- Set TRUE for insert-only workloads

Streams record:

METADATA$ACTION – INSERT / DELETE (UPDATE shows as DELETE + INSERT).
METADATA$ISUPDATE – Boolean flag for updates.
METADATA$ROW_ID – A unique row-level identifier (immutable row id).

Streams can also work with external tables and directory tables for staged files, enabling more versatile data capture across various data sources.

How to Set Up Snowflake CDC with Streams

Here’s a simple setup for capturing database events from a source table and applying them to a destination table.

Step 1 – Create a Database

CREATE OR REPLACE DATABASE CDC_STREAM;
USE CDC_STREAM;

Step 2 – Create a Source Table

CREATE OR REPLACE TABLE members_source (
 id INT,
 first_name VARCHAR(200),
 last_name VARCHAR(200)
);

Step 3 – Create a Destination Table

CREATE OR REPLACE TABLE members_destination (
 id INT,
 first_name VARCHAR(200),
 last_name VARCHAR(200)
);

Step 4 – Create a Stream to Track Changes

CREATE OR REPLACE STREAM member_stream
ON TABLE members_source;

Step 5 – Populate the Source Table

INSERT INTO members_source VALUES
 (1,'Wayne','Bell'),
 (2,'Anthony','Allen'),
 ...;

Step 6 – View the Change Log

SELECT * FROM member_stream;

Step 7 – Merge Into the Destination Table

MERGE INTO members_destination dst
USING member_stream str
ON dst.id = str.id
WHEN MATCHED AND str.METADATA$ACTION = 'DELETE' THEN DELETE
WHEN MATCHED THEN UPDATE

SET first_name = str.first_name,
 last_name = str.last_name
WHEN NOT MATCHED THEN
INSERT (id, first_name, last_name)
VALUES (str.id, str.first_name, str.last_name);

Step 8 – Update the Source Table

UPDATE members_source
SET last_name = 'Wright'
WHERE id = 1;

Step 9 – Merge Again

Run the same merge logic; only the modified row is processed.

Step 10 – Verify the Destination

SELECT * FROM members_destination
ORDER BY id;

Managing Stream Offsets and Data Consistency

When implementing Snowflake CDC, one of the key aspects to ensure the system's robustness is maintaining data consistency across all data sources and tables. This is where understanding the stream offset and current transactional version of the object becomes crucial. The stream offset is a critical mechanism that tracks which data has been processed and when. It helps maintain the integrity of the data pipeline, preventing any missed or duplicated changes.

Stream Offset and Transactional Version

Each stream object captures a logical snapshot of the underlying table's current transactional version. This allows CDC processes to track precisely which data changes need to be propagated downstream. When a merge command is executed, it references the stream offset to determine which changes have been processed, ensuring that the pipeline doesn't reprocess already applied data or miss any updates.

By managing the current offset properly, Snowflake CDC ensures that all data is accurately propagated to the destination, preserving data consistency across environments. This is vital in preventing discrepancies, especially when multiple streams are in play or when working with insert-only streams that do not track deletions but instead allow appending new records.

Supercharge Your Snowflake Workflows with Airbyte's Data Integration

Talk to Our Data Experts

Best Practices for Stream Offsets and Updates

Always keep track of the stream offset to ensure you’re processing the most recent changes without duplicating data.
Utilize update statements within CDC processes to modify only the records that have changed since the last snapshot.
For insert-only streams, ensure that the logic doesn’t rely on deletes but instead propagates only the inserted data.
When applying the merge command, ensure that all data is matched correctly between the source and destination tables, updating or inserting as necessary while maintaining data integrity.

This section provides an explanation of how stream offsets, current transactional version, and data consistency play a role in the effectiveness of Snowflake CDC. It also touches on key practices like update statements and using insert-only streams to ensure that data changes are handled efficiently.

Prerequisites & Environmental Setup

Before enabling change tracking, ensure the following:

Roles & Privileges: CREATE STREAM and SELECT on the source table, INSERT, UPDATE, MERGE, DELETE on the destination, USAGE on the virtual warehouse, database, and schema.
Warehouse Sizing: Assign a dedicated compute cluster for CDC merges, or use Snowflake’s multi-cluster warehouses to avoid contention.
Primary Keys: Ensure the source object has a reliable primary key.
Retention Settings: Configure DATA_RETENTION_TIME_IN_DAYS on the source table and CHANGE_TRACKING_RETENTION_TIME_IN_DAYS on the stream.
Stored Procedures & Tasks: Automate merges using tasks or store logic in stored procedures for reusable data pipelines.

Use Cases for Snowflake CDC

1. Real-Time Data Warehousing

Monitor inventory or fraud signals instantaneously by combining multiple streams into fact tables.

2. ETL/ELT Optimization

Capture just the CDC data instead of bulk data, cutting run-times and costs.

3. Cross-Region Replication

Use CDC to keep Snowflake accounts synchronized across regions or clouds.

4. Micro-Batch Machine Learning

Feed historical data and latest changes to feature stores without full reloads.

Advanced Snowflake CDC Techniques

Technique	Why It Matters
Dynamic Tables	Auto-refresh derived tables in response to stream changes—great for materialized aggregations.
Create Multiple Streams	Track the same source table with different filters (e.g., append-only stream for audit, full stream for analytics).
Handling Schema Drift	Supports “ADD COLUMN” without recreating streams.
Partitioning Data	Cluster large destination tables on high-cardinality keys to speed up merges.
External Tables & Cloud Storage	Use streams on external tables (e.g., S3) to capture file-arrival events.

Common Challenges & Best-Practice Solutions

Challenge	Practical Fix	Keyword Tie-In
Missing or duplicate records	Enforce primary keys, add validation queries; leverage METADATA$ROW_ID to dedupe.	data integrity
Stream staleness	Increase retention or recreate the stream.	data retention period
Scaling to high volume	Parallelize with multiple streams and auto-scaling warehouses.	batch processing
Latency creep	Switch from cron-based tasks to event-driven tasks or micro-batches.	real-time data synchronization
Handling deletes	Use logical tombstones or propagate DELETE operations exactly.	changed data

Troubleshooting Cheat-Sheet

Check Stream Health:

DESCRIBE STREAM member_stream;
SHOW STREAMS LIKE 'member_stream';

2. Investigate Task Failures:

SELECT * FROM

TABLE(INFORMATION_SCHEMA.TASK_HISTORY(TASK_NAME=>'cdc_merge_task')) ORDER BY scheduled_time DESC;

3. Validate Row Counts:

SELECT COUNT(*) FROM members_source;
SELECT COUNT(*) FROM members_destination;

4. Rebuild Stream (last resort):

DROP STREAM member_stream;
CREATE STREAM member_stream ON TABLE members_source;

5. Monitor Costs & Performance:

SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_LOAD_HISTORY WHERE warehouse_name = 'CDC_WH';

Do’s and Don’ts of Snowflake CDC

Do’s

Implement retry logic for merges in stored procedures.
Periodically vacuum old partitions and archives in cloud storage.
Test schema changes in a staging environment.

Don’ts

Don’t rely on CDC as your only backup—keep point-in-time snapshots.
Don’t ignore monitoring; streams don’t alert you by themselves.

Summing It Up

Snowflake CDC provides a robust, scalable solution for real-time data synchronization. By leveraging streams, tasks, dynamic tables, and careful privilege management, data teams can build fault-tolerant pipelines that scale effortlessly. Ready to move data beyond Snowflake? Check out Airbyte’s Snowflake source connector to replicate CDC data into any lake house, database, or analytics platform—and keep your entire ecosystem in sync.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program

The data movement infrastructure for the modern data teams.

Try a 14-day free trial