How Do I Load Only Changed Data (CDC) From a Source System?
Your nightly ETL job processes 10 million rows when only a few hundred customers actually changed. Every run taxes your production database, burns compute credits, and stretches backup windows. You still can't guarantee that deletes or late updates are captured.
Most pipelines rely on full-table reloads or coarse time-window queries. Those tactics move massive volumes of unchanged data and inflate cloud bills. A single full refresh consumes gigabytes of network traffic and locks transactional tables, turning maintenance windows into operational fire drills.
Change Data Capture (CDC) solves this problem. By streaming only the inserts, updates, and deletes recorded in your database logs, CDC moves just changed data without locking source tables while preserving a precise history of every change. The result is leaner pipelines, faster syncs, and an audit trail that satisfies compliance requirements.
What Are the Main Approaches to CDC?
When you decide to stop moving entire tables and start capturing only what changes, four CDC patterns usually come into play. Each one detects modifications differently, and that choice shapes performance, maintenance, and how quickly downstream systems see fresh data.
1. Log-Based CDC
A log-based pipeline tails the database's transaction log (the same file the engine writes for durability) to stream every INSERT, UPDATE, and DELETE in order. Because it never queries production tables, you avoid locking or extra read load, and you get a complete, chronological history of changes.
This approach delivers minimal source impact since reading the log is effectively invisible to applications. It also handles high throughput, processing millions of events per minute.
Downsides:
- Setup complexity can be challenging since you need elevated permissions and careful retention tuning
- Parsing logic becomes tied to each database's log structure, creating vendor lock-in
- Some SaaS databases won’t expose their logs, limiting access entirely
Pro tip: Use log-based CDC when you care about sub-second latency or need to replicate high-volume OLTP systems with zero production slowdown.
2. Trigger-Based CDC
Here you create database triggers that fire on every data change and write a record into an audit table. The pipeline then reads that audit stream.
This method works anywhere since nearly every relational engine supports triggers. You can embed custom logic directly in the trigger body, enriching events with user IDs or masking sensitive fields. The shadow tables also double as an internal audit trail for compliance purposes.
Downsides:
- Each transaction executes extra writes, slowing busy workloads and adding a measurable performance hit
- Every schema change requires updating triggers and audit schema, increasing maintenance overhead
- Possible trigger failures can roll back production transactions
Pro Tip: Choose triggers when log access is off-limits but you still need precise, row-level history.
3. Timestamp Columns (Query-Based CDC)
In this pattern you add a last_modified column and periodically query for rows newer than the last sync. It's simple SQL and no special privileges are required.
The benefits are immediate: you can add one column and a scheduled query to get started quickly. No extra infrastructure is needed since everything runs on plain database connections.
Downsides:
- Frequent scans hammer large tables, creating a heavy read load
- Deletes are tricky since vanished rows have no timestamp, so you’ll miss them unless you add soft-delete flags
- Adding a timestamp column may require schema changes and potential downtime in legacy applications
Pro Tip: Poll-based approaches fit smaller datasets or interim projects where real-time speed isn't critical.
4. Diffing Approach
Diffing captures two snapshots: yesterday's table and today's, and computes the difference outside the database. It needs no native features, which is both its strength and its weakness.
This database-agnostic method works even on flat files or systems without triggers or logs. For tiny tables with a few thousand rows, nightly diffs are straightforward and painless.
Downsides:
- Brute-force full scans and comparisons become costly on large datasets
- Latency is high, limited by the snapshot interval
- Changes lose event context — you know only that a row changed, not how or when
Pro Tip: Reserve diffing for niche cases like reference tables or ad-hoc reconciliation where the other three options aren't feasible.
How Do You Implement CDC in Practice?
Your implementation succeeds or fails based on five sequential steps: map your source's change streams, confirm your destination handles merges properly, enable capture, test schema changes, then monitor like any production system.
Skip steps or jump around, and you'll spend weeks debugging issues that proper sequencing prevents.
1. Identify Source Capabilities
Map every source database to its native change stream first. PostgreSQL gives you logical decoding and replication slots, MySQL serves up the binary log, SQL Server has built-in CDC, and Oracle offers LogMiner or GoldenGate. Reading transaction logs keeps you out of hot execution paths and captures every insert, update, and delete with minimal overhead.
Log-based approaches have become the standard for production systems. If log access is locked down, fall back to triggers or timestamp columns, but know the performance costs upfront. Tools like Debezium, Striim, and AutoMQ's log readers connect to these native feeds, so you rarely write parsing code from scratch.
2. Plan Your Destination
Your warehouse or lake must handle incoming changes without full reloads. Snowflake combines Streams for storing change sets with Tasks for automated merges, BigQuery pairs Datastream with native MERGE statements, and Databricks uses Delta Lake Change Data Feed.
Before enabling anything, verify your destination sustains the write rate and that upsert/delete operations fit within maintenance windows. Poor merge performance turns near-real-time pipelines into hourly batch jobs faster than anything else.
3. Configure CDC Capture
With source and target ready, enable the capture mechanism:
-- PostgreSQL: logical replication
ALTER SYSTEM SET wal_level = logical;
SELECT pg_create_logical_replication_slot('airbyte_slot', 'pgoutput');
-- MySQL: row-formatted binary log
SET GLOBAL binlog_format = 'ROW';
After restarting the instance (if required), point your tool at the slot or binlog and record the initial log position to avoid replaying old events. For trigger-based approaches, create "after insert/update/delete" triggers that write to shadow tables; for timestamp polling, add an updated_at column and index it properly to prevent full-table scans.
4. Validate Schema Changes
Pipelines break during schema evolution more than any other time. Before going live, simulate common changes: adding columns, changing data types, and dropping unused fields, and confirm both capture and merge steps still work.
Log-based tools usually handle these automatically because transaction logs carry column metadata, while trigger-based or timestamp methods need manual updates to trigger logic or polling queries.
5. Monitor and Maintain
Treat CDC like any critical service: instrument it, alert on failures, and budget maintenance time. Track three key metrics:
- Latency from source commit to destination commit
- Error rate by stream
- Daily volume processed
Watch log retention policies; if the database purges logs faster than you consume them, you get irreversible gaps. During peak load, audit replay latency regularly, rotate replication slots to prevent bloat, and restrict log access to your pipeline only.
What Are the Best Practices for CDC?
You unlock real value when the pipeline runs as reliably as the data it moves. These practices come from teams that run CDC at scale in production environments.
Start Small With a Pilot
Begin with a single pilot table to prove out latency, merge performance, and rollback procedures before expanding to your entire database. This reduces risk and lets you validate your setup without affecting critical systems. Modern platforms let you flag individual streams for CDC while keeping other tables in full refresh mode, giving you precise rollout control.
Configure Log Retention Carefully
Log retention directly impacts reliability. Pin your database log retention to match your catch-up window. If your consumer falls behind and outruns available logs, you face immediate data loss. Extend retention periods so processes can recover from temporary outages without missing changes.
Test Schema Evolution in Staging
Schema changes require careful testing before production. Add columns, modify data types, and simulate deletes in your staging environment first. Even automated schema propagation tools need dry runs to ensure downstream merge logic works correctly. Store configuration files, lineage diagrams, and checkpoint locations in version control so auditors can replay the exact state that produced any report.
Build Robust Error Handling
Small issues can cascade into major incidents without safeguards. Automate retries and implement dead-letter queues for batch failures. Malformed events should flow to a separate queue without halting the stream. Track metrics like lag, change volume, and error rates, with alerts based on latency thresholds rather than only job failures.
Prioritize Security and Recovery
Logs often contain sensitive data. Encrypt streams at rest and restrict access with least-privilege roles on replication slots. Plan for disaster recovery by maintaining archived logs and synced checkpoints. This ensures you can rebuild targets after corruption or regional outages.
Document Troubleshooting Procedures
On-call engineers need clear runbooks for common issues like schema drift, permission changes, and log overflow. Well-documented playbooks allow teams to restore service quickly during incidents.
Following these practices positions you for reliable, compliant replication. Modern platforms with 600+ connectors include built-in monitoring features needed to enforce these practices in day-to-day operations.
How Do You Get Started?
Change Data Capture (CDC) reduces costs and speeds up pipelines by syncing only inserts, updates, and deletes instead of reloading unchanged rows. It also provides a complete audit trail, helping you meet GDPR or SOX compliance requirements.
Start with a single low-risk table to validate latency, merge performance, and rollback procedures. Enable log-based replication in your connector. Platforms like Airbyte use Debezium to capture database logs and add _ab_cdc_ metadata for easy replay or audits. Monitor latency, errors, and change volume to confirm accuracy before expanding.
Once the pilot runs smoothly, scale out to more tables and tighter sync schedules. The workflow is consistent whether you run cloud or self-managed. Get started with CDC using Airbyte and keep your data moving reliably at scale.
Frequently Asked Questions
What is the main benefit of using CDC over full table reloads?
CDC moves only inserts, updates, and deletes, which reduces network traffic, compute usage, and database load. This makes pipelines faster, cheaper, and more reliable compared to full-table reloads that process large volumes of unchanged data.
Which CDC method should I choose for my system?
The right method depends on your environment. Log-based CDC is best for high-volume, low-latency replication. Triggers work when log access is blocked, but precision is required. Timestamps are useful for smaller datasets where latency is less critical, while diffing is a last resort for small reference tables or systems without log or trigger support.
How do I handle deletes when using timestamp-based CDC?
Deletes are not naturally captured by timestamps since removed rows have no updated value. You can address this by adding a soft-delete flag column or combining timestamps with another mechanism like audit tables to ensure deletes are properly tracked.
What are common pitfalls to avoid with CDC?
Teams often overlook log retention policies, schema evolution, and error handling. If logs are purged before the pipeline consumes them, data loss occurs. Schema changes can break capture if not tested in staging. Without robust monitoring and retries, small issues can cascade into full pipeline failures.