Schema migration is a popular term in both software and data engineering. It refers to controlled changes made to relational databases to meet a desired outcome. Schema migrations are often done to update an outdated schema and move to a new schema. This happens often in software engineering, impacting data engineers and how they work downstream.
I’ve been on the data engineering side way too often, finding out one of my data models has broken because of a change made to a source table. This can be quite frustrating, making it so you need to react fast in order to avoid a chain of broken models, dashboards, and reports. Ideally, schema changes wouldn’t happen at all, or would happen with plenty of notice so you can act accordingly.
Luckily, if using Airbyte to ingest source data, you can use its schema propagation feature. This allows you to control how you want to handle schema changes, ensuring there are never any surprises, even if the software engineers on your team push a change without notifying you.
Airbyte schema propagation
Schema propagation refers to the handling of changes in your source data. Airbyte allows you to choose how to handle source changes, ensuring there are no surprises in your data when a change occurs.
Airbyte does this automatically based on your connection settings so that you don’t need to manually select this each time. You can choose to handle all changes or just column changes, depending on your use case.
Airbyte has a process called DiscoverSchema which runs in the background, comparing the old schema to the new schema. This is actually the same process that runs when you initially create a connection! DiscoverSchema makes it possible to detect any changes at the source and then automatically handle them.
When you configure a connection, it will give you the following options:
Propagate column changes only
“Propagate column changes” refers to the columns that are added or removed when the schema changes. In this case, Airbyte will automatically handle these for you without you needing to step in. This may be ideal for small data teams that may not have time to manually accept changes.
However, you also need to be careful in case a field you are using downstream gets removed. I recommend adding testing on the source if you choose to use this setting. A dbt_expectations expect_column_to_exist test would be great here.
Example of columns being added to a source:
Propagate all changes
“Propagate all changes” includes the column changes above as well as additions or deletions of streams to your schema. A stream refers to a specific data table in a relational database or a route in an API call. For example, Hubspot has different streams for companies, contacts, and deals- 3 of their main objects.
With this setting, Airbyte will automatically handle streams. When a stream is removed, it will do so seamlessly. As mentioned with columns, it is ideal to have a test in place on these tables that notifies you if a stream is removed at the source.
If a stream is added, Airbyte will automatically ingest this data. While you can’t really test for new streams of data, if you do have this setting enabled, I recommend keeping a close eye on data warehouse costs. You want to ensure you aren’t ingesting and paying for the storage of data that you don’t truly need.
Ignore and pause connection
“Ignore” tells Airbyte to ignore any changes made at the source and continue on as usual while “pause” tells Airbyte to pause the connector, preventing any data from being ingested, until you resolve the changes. This ensures your data isn’t changing unless you first review the changes being made.
This may be the best solution to handling schema changes if you are looking to withhold a data contract between software engineering and data engineering. This ensures you have full control over your data, no matter how it is changing at the source.
Notification for when a schema change is detected:
Airbyte will then prompt you to review and accept these changes before making them.
“Pause” is particularly good because it evaluates any breaking changes, pausing if a primary key or cursor is removed. This goes back to the question- would you rather have no data or bad data? Again, adding tests at the source is key to answering this and knowing what is coming downstream.
Airbyte’s ability to handle schema changes via schema propagation is extremely powerful. This will lead to fewer breaking changes, less worry about validating the data as you go, and stronger data contracts.
Airbyte offers a clear UI that keeps you in control of how you want to handle your data and harness these different features. In the end, more control over your data means higher quality data. When you have insight into how changes are being handled at the source, you have tight control over what happens downstream as well.