As data practitioners, we all know the value and importance of dimensional data modeling . The advent of high scale analytical MPP databases like Snowflake and BigQuery, coupled with SQL transformation tools like dbt , have led to a heavy focus on dimensional modeling as a means of preparing data for BI and ML use cases.
While these tools make sophisticated and performant data transformation layers possible, there are still operational and scaling challenges around such pipelines. One common pain point is managing change in models, i.e. deploying modifications to the model logic itself, or simply re-materializing models after source data refreshes.
This post will be a deep dive into managing change in your production data models. It will answer questions like “How do I safely deploy changes to models during the business day without impacting users?”, or “How can I minimize the ‘blast radius’ of a production model or test failure?” Stay tuned for a breakdown of a few different approaches to tackling these challenges, and some related considerations for your data pipeline.
Prerequisites To follow along with this post, you’ll need prerequisite knowledge in a few areas. First, you should be familiar with dimensional data modeling . You should also be aware of modern tooling used to build such data models, such as Snowflake, BigQuery, dbt, or SQLMesh. Next, you should have a basic understanding of Continuous Integration and Continuous Deployment (CI/CD). Finally, you should be familiar with data orchestration as a concept, and how tools like Prefect or Airflow can be used in an ELT pipeline context to automate data processing.
For the purposes of this article, let’s not assume usage of any particular transformation tool or framework (like dbt or SQLMesh ) for building data models. Instead, let’s focus conceptually on the orchestration and CI/CD concepts relevant to deploying such models. and how to handle change management for dimensional data models.
Deploying Changes to your Production Data Models Context & Sample Architecture “Change” can be thought of in two contexts: (1) modifying model logic (CI/CD), or (2) re-materializing models after source data refreshes (scheduled data transformations orchestrated by a workflow automation tool). In either context, having a way to safely deploy models to production is important. Let’s consider the following simple pipeline architecture to be what we’re working with.
Suppose we have a single RAW database where we land all raw data in our warehouse. Some data transformation framework (dbt, SQLMesh, etc) is used to transform that data and materialize it in a production analytics database, which serves consumer services like BI tools. Data transformations can be invoked via a CI/CD pipeline, or via an orchestration tool like Prefect or Airflow, depending on whether model definitions are being modified, or models are just being refreshed with new source data.
Important Considerations for Deploying Models Two important aspects of model deployments are consistency and transparency .
Consistency From the perspective of a data consumer, it is important that related data models which use shared source data are consistent with each other. This means they should all represent a complete picture of the shared source data available at the time of transformation.
To share one example of an undesirable scenario where this is not the case, suppose we have two models, A and B, which reference the same source. A data transformation job is invoked, and A succeeds but B fails. Suppose after this job, model A contains data up through midnight this morning, but B is lagging behind and only contains data up through midnight yesterday. At this point, the two models are inconsistent with each other.
Consuming users will likely be confused about the fact that there is missing data in exposures utilizing model B, while that data is available in exposures utilizing model A. Or, even worse, maybe the consumer doesn’t even realize there is an inconsistency but makes decisions based using the incomplete / incorrect data anyway.
Transparency Transparency for data consumers means knowing what data is available at a given time, and whether that data meets consistency and freshness SLAs.
In the scenario described above, we can still achieve transparency for data consumers by explicitly communicating which exposures (e.g. charts in a dashboard) have partially degraded data quality due to the upstream data transformation failure. However, this can become quite challenging in practice when dealing with complex DAGs. It is easy to overwhelm and confuse consumers with attempts to explain the “state” of the pipeline.
Regardless of how you deploy your transformation jobs, you will likely face tradeoffs between the amount of consistency and transparency you can offer data consumers, as well as the overall complexity and maintainability of your project. To illustrate these tradeoffs, let’s look at a few different methods for deploying changes to models in production.
Running Transforms Directly Against Production Analytics Database A “naive” approach to running your models might be to run them directly against the production analytics database / data marts. This means models will get overwritten or updated with the results of the latest transformation job when each node invocation succeeds.
Many organizations employ this approach, and it probably makes a lot of sense for smaller, less mature data platforms. The prime advantage to this approach is simplicity. All transformations take place in a single database. If a model succeeds, it should never need to be re-built until more data arrives or more model definition changes are made. Another advantage is flexibility - with this approach, it is possible to successfully deploy just some models (i.e. a subgraph) even if not all the models submitted by your transformation job succeed.
Speaking of which, what exactly happens if a data transformation job fails, and what are the implications? Well, that depends greatly on which tools you’re using to orchestrate those transformations. T
ools like dbt and SQLMesh construct a DAG for you just based on the query definitions and references in your models. These tools also have built-in node selector and testing capabilities. This means they offer a lot of sophistication and flexibility in which models get built for a given data transformation job, and which get potentially skipped in the event of a failure.
If your data transformation DAG is more ‘home rolled’, using just templated queries submitted by Airflow jobs for instance, then advanced node selection and model skipping features may not be available. With this said, most modern transformation frameworks do support these constructs, so let’s assume as much for this discussion.
Suppose our transformation framework supports data quality tests, and those tests are implemented as queries which behave like any other node in the DAG. After a tests’ upstream nodes build, the test executes, and if it fails, any downstream nodes are skipped. This is demonstrated in the below diagram.
If data transformations are running directly against the production analytics database and something goes wrong, it can be difficult to know which models were updated and which weren’t. In order to achieve transparency, a sophisticated metadata persistence layer may be necessary.
Another concern with this approach is that it doesn’t provide strong consistency guarantees. If only some models get updated from a given data transformation job due to a model or test failure, then consumers may be working with models containing inconsistent sets of source data.
Blue-Green Deployments One solution to the challenges described above is “blue-green” deployments. “Blue-green” deployments are a software deployment approach which can be used in many different contexts, from provisioning of cloud infrastructure to deploying new versions of an application. AWS gives a good explainer here . In the context of data transformation deployments, we can consider the current version of our production analytics database to be “blue”.
When we need to re-deploy models, we can create a clone of that database (“green”) where transformations can run. If the transformation job succeeds and all tests pass, then we can atomically swap the blue and green databases, thus safely deploying all model changes.
This deployment approach solves both the consistency and transparency issues with the “naive” approach. If the database swap occurs, we know that transformations succeeded and all models are consistent with respect to their source data and each other. We also have clear transparency into the state of the database, because we’ve simplified the deployment to be “all or nothing”.
The major problem with this approach is that deployments will be “monolithic”, in the sense that there can be no partial DAG deployments. Either all models succeed and are deployed, or none of them do. This can be annoying if a build fails for something trivial, like a single duplicate record in a non-critical data model. Maybe that model wasn’t even related to your changeset, but it still held up the whole deployment. To address this rigidness, the blue-green deployment model can be extended further to accommodate a more flexible deployment pattern.
Best of Both Worlds: Making Blue-Green Deployments “Robust” Blue-green data transformation deployments can be made more “robust” by supporting partial subgraph deployments. This refers to a scenario where a model or associated test fails, and its subgraph gets excluded from the set of models that get swapped (“deployed”) into the production database. All other unaffected models will still be deployed. This is a happy medium between the “naive” approach and the “all or nothing” approach. This can be achieved by “rolling back” affected models in the cloned database before swapping.
Using the same example as in the naive deployment section, suppose a uniqueness test fails on a model. Assuming the “skip models downstream of failures” mechanism is in place, we already know those are safe to swap into production (there will be no change).
For the upstream models, it gets a little tricky. If a test fails on a model, we don’t necessarily know if the issue was introduced directly in that model, or somewhere else upstream. Maybe a uniqueness test is failing because a bad many-to-many join was introduced, causing a record fanout. Or maybe duplicates were just added in the source data.
Since the true source of the failure usually isn’t immediately clear, the safest approach is to just rollback the entire upstream subgraph, as well as their children. This guarantees that no bad data gets deployed to production (at least, nothing we are testing for). If all tests were placed optimally 100% of the time, we wouldn’t have to take this conservative rollback approach, but that is a big if.
Before rolling back any models, we must first identify the affected subgraph for a given node invocation failure. This can be challenging and the implementation depends largely on your data transformation framework. Frameworks like dbt can help us out a lot here. For those of you using something else, I’ll leave this as something for you to figure out. For dbt users, here is a practical way to identify the affected subgraph.
For dbt Users: Identifying Affected Subgraphs for Node Failures We can use dbt’s “result” node selector method to do this. For example, to find all nodes which either failed to build, or were associated with a failing test, we can run dbt ls --select “result:error” "1+result:fail" --state path/to/artifacts . This node selector can be extended further using graph operators to get the upstream parents of these nodes, and their children.
Once we have a list of affected models, there are a variety of methods to actually achieve the rollbacks. Here are two potential options:
Loop through every affected model and overwrite it with a clone from the existing production database. This clone will contain the model “as it was” before it was successfully built with potentially bad data from the latest failed data transformation job Re-clone and run the data transformation job again, but this time exclude the affected models I prefer approach #1, because we don’t need to re-clone and re-run any models that already succeeded, thus saving on compute costs. However this approach does come with some more complexity in your CI pipeline and orchestration DAG.
Conclusion Data transformation pipelines can be a moving sea of change and complexity. Between DAGs that look like spaghetti, off-schedule source data refreshes, and unexpected data quality issues, it can be easy to get lost. Hopefully this post provides some guiding principles for how to navigate it all, and ensure change is smooth and seamless for your users.