3 Critical Differences Between DISTKEY and SORTKEY in Redshift

February 13, 2024
20 min read

As cloud data warehouses and Big Data platforms made inroads into the data landscape, many businesses chose to store their large datasets here. However, simply storing your data in one of the leading cloud data warehouses is not enough. You must know how to navigate the warehouse and use the correct commands to enhance the query performance.

Amazon Redshift is known to be a preferred choice among several cloud data warehouses. If you are planning to migrate your data to this warehouse, you must know how to use some of its crucial elements. This article will provide you an understanding of DISTKEY and SORTKEY in Redshift, and how they differ from each other.

Overview of Amazon Redshift

Redshift is a cloud-based data warehouse service under the expansive AWS umbrella. It is designed to utilize a columnar storage method, making the warehouse ideal for handling large volumes of data and processing quick analytical queries.

At its core, Redshift operates on a massively parallel processing (MPP) architecture. The infrastructure allows the cloud data warehouse to distribute and parallelize queries across multiple nodes. This scalability and flexibility make Redshift easy to set up and deploy for organizations of varying sizes.

To organize your data into tables and optimize Amazon Redshift’s overall performance, you can utilize the DISTKEY and SORTKEY. To comprehend the functionalities of both keys, take a look at Redshift’s architecture in a little more detail.

What is DISTKEY

DISTKEY is a column designation that determines how data is distributed across compute nodes in a Redshift cluster. It helps optimize query performance by minimizing data movement between nodes during joins and aggregations.

DISTKEY example

CREATE TABLE sales (

    sale_id INT,

    customer_id INT DISTKEY,

    product_id INT,

    sale_date DATE,

    amount DECIMAL(10,2)

);

In this example, customer_id is set as the DISTKEY. This might be useful if you frequently join the sales table with a customer table, as it would ensure that all data for a given customer is stored on the same node.

What is SORTKEY

SORTKEY is a column designation that determines the order in which data is stored on disk within each compute node. It improves query performance by allowing Redshift to skip reading unnecessary data blocks during table scans and range-restricted queries.

SORTKEY example

CREATE TABLE orders (

    order_id INT,

    order_date DATE SORTKEY,

    customer_id INT,

    total_amount DECIMAL(10,2)

);

Here, order_date is set as the SORTKEY. This would be beneficial if you often run queries that filter or sort by date, as it allows Redshift to quickly locate and retrieve relevant data blocks.

Are DISTKEY and SORTKEY the same?

No, DISTKEY and SORTKEY are not the same. They serve different purposes in Redshift:

  • DISTKEY determines how data is distributed across nodes in the cluster.
  • SORTKEY determines how data is ordered within each node.

While both aim to improve Redshift’s query performance, they do so in different ways and at different levels of the data storage hierarchy.

Why is the SORTKEY Required?

In Amazon Redshift, sorting data plays a vital role, primarily when the infrastructure handles data with range-restricted predicates. The platform stores columnar data in 1 MB disk blocks, with metadata including the minimum and maximum values for each block. Since the metadata is sorted during storage using the SORTKEY, Redshift’s query processor can swiftly skip over numerous blocks while scanning tables. Thus, you enjoy data retrieval at greater speeds.

Why is the Distribution (DIST) Key Required?

An Amazon Redshift cluster comprises multiple nodes, each with its dedicated operating system, memory, and disk storage. Among these nodes, there is a leader node, which is responsible for managing data distribution and query processing tasks across compute nodes. The compute nodes are responsible for providing resources to execute your query. The disk storage of each compute node is divided into several slices.

When you load data into a table, Redshift distributes rows of the dataset to node slices based on the table distribution style (DISTKEY) you configured. Subsequently, the query optimizer identifies the location for data blocks to execute your query efficiently. Thus, defining a DISTKEY is important as it aligns your data with the compute resources effectively.

3 Differences Between DISTKEY and SORTKEY in Redshift

DISTKEY and SORTKEY play crucial roles, but their functions differ significantly. Let’s understand some of these differences.

1. When and Where to Use?

SORTKEY

When you are creating a table in Redshift, you have the option to designate one or more columns as the SORTKEYs. Defining the key at the very beginning helps in later processes. The query planner in Redshift receives information about the SORTKEY columns and utilizes it to generate plans to sort the data structures. Once you start loading data into the table, the rows on the disk get stored in a sorted order based on their designated columns.

DISTKEY

When you are creating a table in Amazon Redshift, you must define a distribution style for it using the DISTKEY. This is an important step because, at the time of loading data into the table, the rows will be distributed to the compute nodes based on the defined distribution style. While executing queries, Redshift’s query optimizer redistributes the rows among the compute nodes, allowing you to perform joins and aggregation commands on your dataset efficiently. Thus, selecting a table distribution style while creating a table is vital as it places your data in optimal locations beforehand. It greatly reduces the processing time for redistributing queries, giving you swift results.

Experience Hassle-Free Data Ingestion into Redshift with Airbyte
Try FREE for 14 Days

2. Types of SORTKEY And DISTKEY in Redshift

SORTKEY

When using the CREATE TABLE command, you are required to specify the sorting method. You can choose between COMPOUND and INTERLEAVED SORTKEYs. Redshift usually picks the COMPOUND key by default for sorting methods.

The COMPOUND SORTKEY consists of all the columns that you list in the specified order while defining the sort key. This key is quite effective when you perform queries to regularly update your tables through INSERT, UPDATE, or DELETE operations. It is used by Redshift to enhance the speed of joins, GROUP BY, and ORDER BY operations, as well as window functions utilizing PARTITION BY and ORDER BY clauses.

On the other hand, the INTERLEAVED SORTKEY works by assigning equal importance to each column or subset of columns defined under your sort key. This key is utilized by Redshift to boost query performance when you want to execute multiple queries to filter results from different columns. 

The INTERLEAVED key employs an internal compression scheme for zone map values, enabling it to differentiate column values with lengthy common prefixes. This feature is useful when you want to sort column values that have URLs in them. However, it is better if you do not use the INTERLEAVED key for sorting columns that have dates, timestamps, or attributes that increase monotonically.

The multidimensional data layout sort key is a variant of the AUTO SORTKEY. This key is used to identify and organize data based on the repetitive predicates in your workload. It is possible that your dataset may contain multiple repetitive predicates.

DISTKEY

  • EVEN Distribution: The leader node distributes rows evenly across the node slices in a round-robin method. This key does not consider the values of any specific column. You can use this distribution type on tables that do not need to be joined with other tables of the dataset. 
  • KEY Distribution: Here, the rows are distributed based on the values in a designated column. The leader node ensures that matching values are placed on the same node slice, facilitating efficient joins. If you have two tables that are distributed on the basis of their joining key, the leader node will collocate rows on node slices. It will be based on the values of the joining column, ensuring matching values stay together during storage.
  • ALL Distribution: In ALL Distribution, a copy of your entire table is distributed to every node in the cluster. It is unlike the above two distribution types, where only a portion of your table’s rows are placed on each node. Here, every row is collocated for all join operations that you conduct on the table. However, this approach can significantly increase storage requirements and query processing time.
  • AUTO Distribution: If you do not specify any distribution style, Amazon Redshift employs AUTO distribution. This distribution type is automatically based on the size of your table and a few other factors.

3. Recommended Type of Key

SORTKEY

Amazon Redshift recommends you create tables with the SORTKEY AUTO setting. This key allows Redshift to optimize the selection of sort keys automatically based on your data’s usage patterns. By leveraging automatic table optimization, you can get efficient query performance.

DISTKEY

For DISTKEY, Redshift recommends you create tables using the DISTSTYLE AUTO setting. With this key, the selection of your data is automatically optimized based on the volume of your dataset and your usage patterns.

Store Data into Redshift: An Alternate Way

The DISTKEY and SORTKEY in Redshift are quite important when you are creating tables and storing data in them directly. However, the process can be slightly tedious and time-consuming, especially when you are unfamiliar with SQL code. Apart from this, to further speed up data management in your Amazon Redshift, you can use platforms like Airbyte.

Airbyte is one of the leading data integration and replication platforms. It has over 350+ pre-built connectors, and you can find a dedicated Redshift destination connector among its offerings. Without writing a single line of code, you can extract data from diverse sources and migrate it into Redshift in under ten minutes!

The data pipeline that you set up is maintained by the platform itself. Moreover, you can take advantage of Airbyte’s CDC techniques. Here, only new or modified data get ingested into Redshift each time there are changes made at the source.

Wrapping Up

The choice of DISTKEY and SORTKEY in Amazon Redshift is crucial for optimizing query performance. Knowing which type of key should be used while conducting analysis is important. Wrong query output may mislead you and cost you a fortune if you base your strategic decisions on them.

Set up a data pipeline with Airbyte in two simple steps and speed up your data analytics operations on Redshift. This way, all your data will be loaded in a structured and organized manner, granting you more time to conduct in-depth analysis. Sign up for free to get started today!

FAQs

Can the Redshift table have multiple sort keys?

Redshift tables can have multiple sort keys. This can be implemented as a compound sort key, where columns are sorted in the order specified, or as an interleaved sort key, which gives equal weight to each column. The choice depends on your specific query patterns and data access needs.

CREATE TABLE sales (

    sale_id INT,

    sale_date DATE,

    product_id INT,

    customer_id INT,

    amount DECIMAL(10,2),

    SORTKEY (sale_date, product_id)

);

In this example, the data would be primarily sorted by sale_date, and within each date, it would be sorted by product_id.

How to choose SORTKEY & DISTKEY in Redshift?

Choosing SORTKEY:

Select columns frequently used in WHERE clauses, JOIN conditions, or ORDER BY statements. Consider using columns with high cardinality that are often filtered or sorted.

Choosing DISTKEY:

Pick a column that's commonly used in JOIN operations with large tables. Aim for even distribution of data across nodes, avoiding skew. The chosen column should have high cardinality.

Can DISTKEY and SORTKEY be the same?

Yes, the distkey and sortkey can be in the same column in a Redshift table. This can be beneficial in certain scenarios, particularly when the column is frequently used in both JOIN operations and WHERE clauses. However, it's not always the optimal choice and depends on your specific query patterns and data distribution.

How does a SORTKEY improve performance?

Each block of data has a min and max value ("Zone Maps"), allowing Redshift to easily skip blocks that don't contain desired values. This speeds up operations as disk access is typically the slowest part.

Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program
The data movement infrastructure for the modern data teams.
Try a 14-day free trial