Article

Unexpected Schema Changes? How Airbyte Schema Propagation Feature Can Help

•

November 9, 2023

•

8 min read

Schema migration is a popular term in both software and data engineering. It refers to controlled changes made to relational databases to meet a desired outcome. Schema migrations are often done to update an outdated schema and move to a new schema. This happens often in software engineering, impacting data engineers and how they work downstream.

I’ve been on the data engineering side way too often, finding out one of my data models has broken because of a change made to a source table. This can be quite frustrating, making it so you need to react fast in order to avoid a chain of broken models, dashboards, and reports. Ideally, schema changes wouldn’t happen at all, or would happen with plenty of notice so you can act accordingly.

Luckily, if using Airbyte to ingest source data, you can use its schema propagation feature. This allows you to control how you want to handle schema changes, ensuring there are never any surprises, even if the software engineers on your team push a change without notifying you.

Airbyte schema propagation

Schema propagation refers to the handling of changes in your source data. Airbyte allows you to choose how to handle source changes, ensuring there are no surprises in your data when a change occurs.

Airbyte does this automatically based on your connection settings so that you don’t need to manually select this each time. You can choose to handle all changes or just column changes, depending on your use case.

Airbyte has a process called DiscoverSchema which runs in the background, comparing the old schema to the new schema. This is actually the same process that runs when you initially create a connection! DiscoverSchema makes it possible to detect any changes at the source and then automatically handle them.

When you configure a connection, you'll see the following options for "Detect and propagate schema changes":

Propagate field changes only

“Propagate field changes only” refers to the fields or columns that are added or removed when the schema changes. In this case, Airbyte will automatically handle these for you without you needing to step in. This may be ideal for small data teams that may not have time to manually accept changes.

However, you also need to be careful in case a field you are using downstream gets removed. I recommend adding testing on the source if you choose to use this setting. A dbt_expectations expect_column_to_exist test would be great here.

Example of columns being added to a source:

Propagate all field and stream changes

“Propagate all field and stream changes” includes the column changes above as well as additions or deletions of streams to your schema. A stream refers to a specific data table in a relational database or a route in an API call. For example, Hubspot has different streams for companies, contacts, and deals- 3 of their main objects.

With this setting, Airbyte will automatically handle streams. When a stream is removed, it will do so seamlessly. As mentioned with columns, it is ideal to have a test in place on these tables that notifies you if a stream is removed at the source.

If a stream is added, Airbyte will automatically ingest this data. While you can’t really test for new streams of data, if you do have this setting enabled, I recommend keeping a close eye on data warehouse costs. You want to ensure you aren’t ingesting and paying for the storage of data that you don’t truly need.

Detect changes and manually approve, or pause connection

The “detect changes and manually approve” option tells Airbyte to detect any changes made at the source and continue on as usual until those changes are manually approved, while “pause connection” tells Airbyte to pause the connector, preventing any data from being ingested, until you resolve the changes. This ensures your data isn’t changing unless you first review the changes being made.

This may be the best solution to handling schema changes if you are looking to withhold a data contract between software engineering and data engineering. This ensures you have full control over your data, no matter how it is changing at the source.

Notification for when a schema change is detected:

Airbyte will then prompt you to review and accept these changes before making them.

‍

Pausing the connection is particularly good because it evaluates any breaking changes, pausing if a primary key or cursor is removed. This goes back to the question- would you rather have no data or bad data? Again, adding tests at the source is key to answering this and knowing what is coming downstream.

Conclusion

Airbyte’s ability to handle schema changes via schema propagation is extremely powerful. This will lead to fewer breaking changes, less worry about validating the data as you go, and stronger data contracts.

Airbyte offers a clear UI that keeps you in control of how you want to handle your data and harness these different features. In the end, more control over your data means higher quality data. When you have insight into how changes are being handled at the source, you have tight control over what happens downstream as well.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

‍

The data movement infrastructure for the modern data teams.

Try a 14-day free trial

About the Author

Madison Schott is an Analytics Engineer, author of The ABCs of Analytics Engineering book. Madison blogs about the modern data stack on Medium and writes a newsletter on Substack.