Article

The Data Engineer’s Guide to Testing, Monitoring, and Observability

•

December 14, 2024

•

15 min read

Data pipelines are complex pieces of software with many moving parts. Given the size, richness, and complexity of modern datasets and software, it is a guarantee that you will encounter numerous errors and data quality issues in your pipelines, no matter your industry or experience. These errors and quality issues can potentially be very damaging to your business or team. It’s a good idea to get out ahead of these challenges by testing and monitoring pipeline behavior.

This post will define various concepts related to data pipeline testing and monitoring, describe their benefits, and review some effective strategies for achieving these benefits.

Defining Testing, Monitoring, Alerting, and Observability

Let’s start by defining some basic concepts in data pipeline management. Each of these concepts are interrelated and overlapping in their goals, benefits, and implementation.

Data pipeline testing can be defined as the process of evaluating the flow of data from source to destination, and ensuring the extraction, loading, and transformation processes occur without error and as expected. This definition focuses on the flow of data and the successful integration of various pipeline components. Another important aspect is data quality testing, which Monte Carlo defines as “the process of validating that key characteristics of a dataset match what’s expected prior to consumption by downstream users”. This definition focuses on the data itself, which can of course be impacted by the various pipeline processes, but isn’t always in data teams’ control. Sometimes (often) data can arrive with quality issues, before it has even flowed through any pipeline.

Data monitoring and alerting fall under the umbrella of data observability. Bigeye defines data observability as “the ability of an organization to see and understand the state of their data at all times.” A data pipeline is observable if its creators, maintainers, and consumers have transparent visibility into the flow, quality, and state of the data it consumes and produces at all times. This means knowing the status of the pipeline (e.g. stable, degraded, or failed), as well as the characteristics and quality of its data.

In order for a pipeline to be observable, it is important that it produces metadata describing its own internal state and status. Data monitoring refers to the continuous observation and tracking of such metadata to understand pipeline status, as well as observation of the core data the pipeline is operating on. Monitoring enables proactive identification of pipeline errors and anomalies, and allows data teams to respond to and resolve such issues. Alerting refers to automated notification systems for surfacing patterns identified by monitoring tools via mediums like email or messaging applications like Slack.

Types of Data Pipeline Monitoring & Testing

Pipeline Health, Uptime, and Performance

Pipeline health or uptime monitoring involves continuously checking on pipeline status to ensure correct functioning without interruptions or performance degradation.

For example, suppose a pipeline is expected to ingest and transform new data every 10 minutes, and each invocation should normally complete in 2 minutes. This pipeline is considered “healthy” if it is indeed successfully ingesting and transforming new data on schedule (every 10 minutes), on a consistent basis. It is considered “degraded” if it is intermittently failing, or experiencing performance issues (for instance, runtime is taking much longer than the expected 2 minutes).

Performance and runtime is one important element of pipeline health. If a pipeline is taking longer than expected to complete, costing more than expected, or otherwise in breach of any Service Level Agreements (SLAs), it is not fully healthy. This may or may not affect pipeline uptime, which is defined as the percentage of time a pipeline is operational and available for data processing.

Data Quality

Data quality testing and monitoring focuses on making assertions about the data being processed, rather than the flow of data or integrations between pipeline components. With that said, both of those things can also affect data quality. Data quality is most often considered during the transformation stage of a typical ELT pipeline. This is because the goal of data transformation is to produce datasets that are usable for analytics and reporting use cases, and such efforts typically require reliable, trustable, error free data.

Some common examples of data quality metrics and tests are missing values (null counts), uniqueness/duplicates, and referential integrity. These are often baseline “table stakes” metrics to measure, because they are easy to implement, and can tell you a lot about the quality of your data.

Data practitioners who are more involved with upstream components of a pipeline (extract and load) might be more interested in data freshness and loaded rowcount anomalies. Data freshness metrics measure the time since data was last updated, which can be an indicator of pipeline health and uptime. Data freshness is important from the perspective of business intelligence and analytics use cases too, as stale data might not be appropriate for decision making. Rowcount anomalies occur when the actual volume of records loaded or transformed in a pipeline deviates significantly from the expectation based on historical volume. Data freshness and rowcount anomalies are valuable tests both from a pipeline health and a data transformation and reporting quality perspective.

dbt is a great tool for implementing standard data quality tests. The tests described above, plus many more, are available natively with the dbt, or via packages like dbt_utils.

Data Change Management & Resource Drift Detection

Another category of data pipeline testing and monitoring is change management and resource drift detection.

Change management in this context refers to tracking and handling change in data assets, such as schema changes or row-level changes in data models resulting from logic changes. Various strategies for managing such change, specifically in dimensional data models, are described in my other post here. This post focuses mostly on anticipated sources of change.

Explicit testing and monitoring can help us deal with unexpected sources of change, such as column-level schema changes in a transactional database owned by another team. The gold standard for handling this type of change would be some sort of data contract implementation. With that said, it can be valuable (and easier) to implement alerts for such schema changes in an upstream data source. In highly regulated industries, where strict data control measures are required, a network diode can act as a safeguard against unauthorized schema modifications.

Another type of data change management is data diffing. This is the process of comparing a dataset before and after some data processing job, often transformation, and understanding and quantifying what changed. For example, if a metric definition on an existing data model is changed, or a join expression is changed, a data diff can describe which values changed or the change in rowcount of the result set. This is useful in CI/CD contexts. Datafold is a popular provider for data diff tooling.

Goals and Benefits of Testing and Monitoring

The primary goals of data pipeline testing and monitoring are to enable transparency and awareness for data teams, and to ensure data reliability, accuracy, and trust for consumers. Effective testing and monitoring also provide benefits such as early defect detection and faster resolution, easier troubleshooting, and more stable and long lived code.

A reliable data pipeline builds trust with stakeholders by providing consistent, accurate results. When teams can confidently rely on the data for decision-making, it strengthens the credibility of the data engineering function and reduces friction between teams. Pipeline monitoring can also reveal bottlenecks or inefficiencies in pipeline design and lead to performance optimizations and cost savings.

Risks of Not Testing and Monitoring

A lack of pipeline testing and monitoring can expose data teams to several risks. First, it can result in data inconsistencies that propagate through the pipeline, causing inaccurate analytics and reports. These errors are often hard to detect and correct once they reach downstream systems.

Delayed detection is another risk–issues might go unnoticed until they cause significant disruptions, making them more expensive to fix. Such incidents result in erosion of trust in the data which undermines stakeholder confidence in the data team and the decisions based on their work.

In some industries, poor data quality can result in regulatory non-compliance, leading to legal penalties or reputational damage. Additionally, without monitoring, maintenance costs increase significantly as teams spend more time troubleshooting problems reactively instead of preventing them proactively.

Strategies for Effective Testing and Monitoring

Test Placement

Concentrate tests where data crosses “system boundaries”. For example, freshness tests can be effectively placed in the raw data layer of a data warehouse, where data initially lands after an extraction process. If these tests fail or trigger an alert, it is easy to see that there is probably an issue with the extraction and loading pipeline. This is preferable to placing such tests further downstream, after a transformation process, where it is hard to see whether the data is actually stale because it has not been delivered or loaded, or if it just “appears” to be stale because it “dropped out” as a result of some data transformation like an inner join or filter.
Shift tests left. The best placement for data quality tests is often the earliest, furthest upstream location where we might expect them to fail. For example, uniqueness tests should be placed at every location where a data processing step might introduce duplicates. This might be the raw data layer, where duplicates can be introduced in extracted source data, and on any data models containing joins (accidental fan-outs from many-to-many joins might introduce duplicates). An example of a location where it doesn’t make sense to test for duplicates is a data model where only filtering is performed. If duplicates are detected there, the issue must be traced upstream to some other location, which is tedious. This issue is discussed briefly in my other post about change management for dimensional data models (it touches on additional pipeline complexity that might result from improper test placement). The earlier data quality issues are detected, the less likely they are to propagate downstream through the pipeline and cause other issues.
Place schema change tests and other tests for breaking changes on terminal data models exposed to downstream reporting systems and APIs. Downstream consumers of terminal data models expect their schemas to remain the same unless otherwise notified. If schemas are changed without notice (for example, a column is removed), downstream exposures are likely to break (for example, because a query will request a column that no longer exists). As I mentioned before, one robust approach to ensuring such handshakes between data producers and consumers is to use data contracts; however, in the absence of a data contract implementation, breaking schema change testing is an effective way to ensure exposure uptime.

Test Types and Strategies

Bare minimum tests. It is worthwhile to define some set of “bare minimum” tests that should be applied to most or all resources in your data pipeline. For many teams, this is testing model primary keys for not null and uniqueness. These characteristics will be enforced by default in relational databases for primary key columns, but won’t be in most data warehouse implementations. Tools like dbt make it very easy to apply these tests to many resources with minimal effort. Despite their simplicity, they are surprisingly effective at catching a wide class of errors. Depending on your data and use cases, your “bare minimum” tests might vary.
Test generalization. Applying a generic test definition to many resources in your pipeline is a good way of increasing test coverage with minimal effort. Again, dbt provides various generic data tests out of the box, and also supports defining your own via macros.

Test Driven Development

Test driven development is a popular strategy for developing robust, well tested software. The general idea is to develop tests before your actual code, then ensure that code passes all the developed test cases. It is particularly useful when dealing with data pipelines, because as new data flows into your pipeline, there is always a chance for new unhandled edge cases.

When combined with existing tests and alerting, test driven development can be used effectively in data pipelines as follows: Whenever something breaks, add a test, then commit the change to make that test pass. Then keep that test around to check for future regressions, which might result from new data, or code changes.

Responding to Test Failures and Alerts

What should happen when a data test fails? Usually that depends on the type of test, mode of failure, criticality of the data, and magnitude of the failure. A decision must be made about whether to stop the flow of data entirely, or just generate some alert and allow an engineer to investigate further.

Data flow should stop (i.e. new data should not be deployed to a production environment, consumable by stakeholders) if the particular data which triggered the test failure will lead to poor decision quality, erosion of trust, or significant performance degradation or cost increases.

For example, at my company, we perform real estate market supply and demand analytics, and make recommendations to our customers about where to transact based on those analyses. If listing counts are highly skewed due to a data quality issue, we should stop the flow of that data entirely rather than allow stakeholders to use it to make recommendations to customers. Making recommendations based on incorrect data would lead to poor decisions and erode trust in our platform and data.

On the other hand, sometimes it is appropriate to simply generate alerts regarding data quality issues, but still allow data to flow. For example, maybe a particular column contains an anomalous value that is significantly larger than the average historical value, but there is a high likelihood that it is a true outlier and won’t negatively affect reporting. Such alerts are appropriate for “low risk” data issues.

One key risk to be mindful of is alerting fatigue. This can occur if a large volume of alerts are generated, but many of them are insignificant, trivial, or can’t be resolved. Be sure to only create alerts for the most critical, valuable pipeline components, where you can be reasonably sure that it will be worth the effort to investigate and resolve.

Persisting Test Failure Metadata

In any large data pipeline with many data sources continuously updating, tests are guaranteed to fail repeatedly and for different reasons. Another effective testing and monitoring strategy is to persist the results of these test failures, and analyze them over time. This can provide insight on the typical failure modes of your data pipeline, which can inform future design decisions to mitigate such failures and their associated risks.

Using Airbyte for Alerting & Metadata Persistence

We’ve reviewed a variety of concepts about data pipeline testing and monitoring, as well as strategies for implementing them effectively. Now let’s conclude this post with a couple of concrete examples of how Airbyte can be used for alerting and monitoring.

Webhook notifications for replication failures or schema changes. Airbyte natively supports webhook notifications for various events, including replication failures and schema changes. This is an easy way to stay on top of your data replication health, and ensure they are running smoothly.
Using Airbyte to replicate backend databases for various self-hosted applications in your data stack. Most applications store state and various metadata in a relational database such as Postgres. These application databases often contain a lot of rich data about the usage patterns and functioning of the tool. If you self host an application and have access to this db, it can be highly valuable to inspect this data, and even combine it with other metadata. My team uses Apache Superset for dashboarding, reporting, and visualization. It stores state in a Postgres backend, which contains a ton of rich metadata about dataset definitions, dashboard usage, role based access control policies, etc. We replicate this database to our data warehouse using Airbyte, and create reporting on dashboard usage and access for data governance use cases. We also analyze dataset definitions (column schemas) and create alerting for breaking schema changes in our dimensional data models.
Analyze the Airbyte backend database itself to understand patterns in replication failures. This can provide insight into which replications are failing most frequently, and under what failure modes. It can also provide programmatic access to scheduling data, as well as connection sources and destinations.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

The data movement infrastructure for the modern data teams.

Try a 14-day free trial