What Is a Slowly Changing Dimension? A Comprehensive Guide
Data warehouses are crucial elements for organizations, storing extensive historical data for effective business analysis. This historical data can help you generate meaningful insights for impactful decision-making.
From customer addresses and employee roles to product specs, there are attributes or dimensions in your data store that may change over time. Such changing dimensions, referred to as slowly changing dimensions (SCDs), are essential to maintaining historical records in data management. With SCDs, you can be certain that historical and current data are considered for your insights generation.
This article comprehensively discusses the concept of a slowly changing dimension, including its types, implementation, and maintenance practices.
What Is a Slowly Changing Dimension?
A Slowly Changing Dimension is a framework that defines how data in your data warehouse changes over time. SCDs enable you to store and manage current and historical data, which is essential for analyzing trends and making informed business decisions.
Changes in dimension tables occur slowly and unpredictably without a fixed schedule or pattern. Whether to store current data only or also include historical data depends on specific business and analytical needs.
For example, consider a scenario where the manager of a franchise branch changes periodically. When a new manager takes over, the record for the manager’s name, associated with a unique ID, is updated to reflect the latest information.
In this scenario, analyzing historical sales data under previous managers is difficult since the manager’s name is overwritten. To overcome this limitation, you can rely on slowly changing dimensions. Implementing SCDs allows you to maintain historical records for each manager along with the current data.
Types of SCDs
Although there are multiple types of slowly changing dimensions, three types are commonly encountered in real-world applications.
Type 1 SCDs - Overwriting
Type 1 SCD, also known as overwriting, usually occurs when new data replaces the existing data within a table. In this approach, historical data is not retained, making it difficult to track changes over time. As a result, this method is suitable when historical changes aren’t crucial to your analysis.
When implementing Type 1 slowly changing dimension, you must ensure that the trends in the data columns are not a critical component of your workflow. For instance, in a customer database, you can overwrite the column containing customer addresses when a customer relocates.
If John Doe moves to a new location, you can update his Address record in the database to reflect this change.
Maintaining only the updated addresses allows you to ensure the delivery of products to your customer's current location. This results in efficient logistics and customer service.
Type 2 SCDs - Creating Another Dimension Record
In contrast to SCD type 1, Type 2 SCDs maintain both historical and current records. This type involves adding a new row to the table for each update without overwriting or eliminating the previous record. The new row shares the same natural key but has a different primary key, allowing you to track changes across different versions.
To handle type 2 SCDs, you can use two different methods:
- Flag Column: One method requires defining a flag column that indicates the currently active record. When a new record is added, the flag is updated across relevant rows to reflect the current active state.
- Timestamp Column: Alternatively, you can use a timestamp column to outline the record creation time. The most recent timestamp indicates when the current record was created or made active.
For example, when updating an address, a new row with the updated address is added and marked as active (True). The previous address’s row is updated to reflect inactive (False) status. With this, you can maintain historical addresses for reference, and also ensure the most current address is considered for operational use.
Type 3 SCDs - Creating a Current Value Field
In Type 3 SCDs, the updates are tracked by adding a new column instead of a row for changes. For this SCD type, the primary key remains the same, maintaining record uniqueness. However, this type preserves the most recent history.
A new column is added to capture changes, allowing you to track only one historical change per record. Due to this, Type 3 SCD is suitable for data elements that are not expected to change frequently.
For example, here’s a table representing the changed address using SCD Type 3.
The resulting table highlights both the current address and the previous address without maintaining a full history of past addresses. Updates to the address involve shifting the current address to the Previous Address and adding the new address in the Current Address column.
How to Implement Slowly Changing Dimensions in a Data Warehouse?
To implement slowly dimension changing principles, you must take action while designing and creating a table in a data warehouse. Redesigning the entire data infrastructure can be a difficult task.
When working with SCDs in an existing table, there are steps you can follow that can lead to a least resistant path.
- The first step requires you to assess the existing tables to identify the type of SCDs applicable.
- Analyzing the business requirements is crucial to ensuring that the implementation of SCDs supports the data analysis and reporting goals.
- After the analysis, you can choose the type of modifications needed for managing historical data effectively. This includes choosing whether you want to add a flag or a timestamp column to the dataset.
For implementing Type 1 SCD, you can use the MERGE INTO command to upsert data into an existing table. This SQL command enables conditional updates based on a match key, ensuring the data update occurs only when specified conditions are met.
An instance of this is when updating customer addresses. The MERGE command checks if an existing address matches the new data; if it does, the address is updated.
Implementing Type 2 slowly changing dimensions is a bit more challenging than operating on Type 1. In Type 2, a staging table will store the new and old data. You can implement flags or timestamps to mark the current data value.
For example, if a franchise manager changes, the timestamps and flag columns change. The end date of the previous manager’s record is updated. At the same time, the end date is set as null for the new manager’s record using the MERGE INTO command with specific conditions.
To implement SCD Type 3, you will use a staging table to prepare data updates. When updating, the previous and current values are stored in the same row but in different columns.
How to Maintain Slowly Changing Dimensions?
Maintaining slowly changing dimensions requires you to understand data management processes.
When setting up a table, you must account for the methods of tracking historical records. Ask questions like: “How are new records added to a table?” and “Does the current method of ingesting data consider historical data tracking?” These questions can help you identify and update your current data-handling process.
Here are two ways to maintain SCDs:
ETL
ETL—extract, transform, and load—is the process of extracting data from multiple sources, transforming it, and storing it in your preferred destination. It can be crucial in adding new data rows to your data store and enhancing data lineage tracking from source to destination. This can create a robust audit trail for improved data integrity and compliance.
However, manually creating an ETL pipeline can be technically demanding and resource-intensive. To overcome this complication, you can leverage SaaS-based tools like Airbyte.
Airbyte is an efficient data integration tool that enables you to migrate data from multiple sources to a destination of your preference. It offers 400+ pre-built connectors, allowing you to manage as well as move structured, semi-structured, and unstructured data across different platforms. If the connector you seek is unavailable, you can create a custom connector using Airbyte’s Connector Builder or Connector Development Kits (CDKs).
Along with these features, Airbyte also has a Python library—PyAirbyte—that enables you to leverage Airbyte connectors in a developer environment. By utilizing this library, you can effortlessly build ETL pipelines.
Here are a few impressive features offered by Airbyte:
- AI-Powered Connector Builder: The Connector Builder comes with an AI-assist functionality that reads through your connector’s API documentation and auto-fills most configuration fields in the UI. This simplifies your connector development journey.
- Vector Databases Support: Airbyte supports popular vector databases, including Milvus, Weaviate, and Pinecone. You can use Airbyte to store vector embeddings in these databases, which can then be used to train LLMs.
- Streamline GenAI Workflows: By supporting RAG techniques, such as chunking, embedding, and indexing, Airbyte allows you to transform raw data into vector embeddings. You can store these vector embeddings in vector databases to streamline LLM responses.
CDC
Another method to maintain slowly changing dimensions is to use the change data capture (CDC) tools. CDC allows you to track historical data changes, ensuring synchronization between data sources and the destinations.
Airbyte offers built-in CDC functionality, which lets you identify incremental data changes made at the source. With this feature, you can maintain data consistency by automatically replicating these changes to the destination data store of your choice.
For example, consider a situation where you need to replicate data from PostgreSQL to Snowflake. You can configure CDC while setting up PostgreSQL as a source in Airbyte Cloud by following these steps:
- Login to your Airbyte account.
- Click on the Sources tab on the left panel of the dashboard.
- In the Search Airbyte Connectors box, enter Postgres and click on the available option.
- On the New source page, enter all the necessary information. Under the Advanced section, enable the CDC feature in the Update Method.
- After filling out all the information, click the Set up source button.
- Click on the Destinations tab on the left panel of the page and follow the same instructions to configure Snowflake as a destination.
- Finally, click on the Connections tab on the left pane to set up the Postgres to Snowflake integration by mentioning the replication requirements.
Key Takeaways
A slowly changing dimension is a crucial component of well-managed data workflows. Tracking data updates and changes is beneficial, especially for scenarios that require historical data analysis. It is often considered a good practice to include SCDs while designing tables or databases, as it becomes difficult to manage changes later on.
Depending on your business use case, you must ensure that historical data is preserved during updates. Implementing different SCD types enables you to maintain data records efficiently. Along with this, utilizing processes like ETL and CDC in your data storage is essential to maintaining SCDs for the long run.