Most data companies spend thousands of dollars on computing and storage of data in data warehouses. Hence, companies are always looking for ways to optimize their data warehousing finances and invest money in other layers of the data stack.
Amazon Redshift is a fully managed cloud data warehouse used to collect, store and analyze data. Amazon Redshift handles large volumes of data and can scale from a few gigabytes to petabytes of data or more. Redshift simplifies handling large datasets in a parallel and distributed manner. Because of this efficiency in processing large datasets, more than ten thousand companies make Amazon Redshift their first choice when powering their analytical workloads in a data warehouse.
If you are a Redshift user, you will be happy to know that you can save your company thousands of dollars by levering some Redshift optimization techniques without interrupting your data flows.
In this article, you will learn how to optimize the performance of your Amazon Redshift cluster.
Why optimize Redshift performance? One of the reasons you need to optimize the performance of Redshift is to improve query performance. By optimizing Redshift, you can increase the speed of execution of your queries when you work with large volumes of data.
Another reason to optimize your Redshift cluster is to reduce storage. If Redshift’s storage is not optimized, file sizes will get bigger and require additional disk space. You can also save money by optimizing your Redshift cluster.
How Redshift costs are calculated Amazon Redshift costs depend on the node type you choose, the use of Redshift spectrum, the amount of data stored in the clusters, and concurrency scaling. If you use Redshift serverless, you would not have to pay independently for concurrency scaling and Redshift spectrum. In Redshift, pricing can either be on-demand or based on a reserved instance. In on-demand pricing, you pay for only the services you use. Charges are not incurred for periods when the cluster is paused.
Redshift node type costs When creating your Redshift cluster, you have to specify a node type. Redshift has different node types to accommodate your workloads. The node types in Redshift are RA3, DC2, and DS2. With the RA3 node type, you pay for compute and managed storage. With the RA3 node type, you choose the number of nodes based on the amount of data you process. If you expect your data to grow, it is recommended to select the RA3 node type. If your pricing model is on-demand, you will pay for the time the cluster is used.
You can also use the pause and resume feature to suspend on-demand billing for periods when your Redshift cluster is not in use. For instance, if your cluster runs for twenty minutes, you will be charged independently for compute and storage for those twenty minutes.
Redshift serverless costs In Amazon Redshift serverless, you also pay for periods your cluster is active. Redshift serverless automatically scale up when the traffic in your cluster increases and shuts down for periods of inactivity. For Redshift serverless, you do not have to pay for concurrency scaling and Redshift spectrum separately because they are included with Redshift serverless.
Redshift managed storage costs In Amazon Redshift, you pay for the data stored at a fixed GB monthly rate for your region. Managed storage charges do not include backup storage charges, such as charges due to manual or automated snapshots.
How to detect and debug Redshift performance issues One way to debug Redshift performance issues is by monitoring cluster performance metrics. With cluster performance metrics, you can get information on the amount of CPU utilized, the amount of disk space used, the health status of your cluster, and the time it took to execute a query. Using cluster performance metrics, you can easily detect and debug performance issues in your Redshift cluster. This metric helps you know the possible cause of performance issues for your Redshift cluster. You can view your performance issue graph by clicking on the cluster performance tab on your Redshift cluster.
It is essential to know what these metrics mean. For instance, if the percentage of your disk space used goes above seventy-five percent, you will experience performance issues. If your disk space usage exceeds 90%, your cluster performance and stability will reduce drastically.
Similarly, for CPU performance, Redshift makes use of all available resources while running queries. So, it is normal to experience spikes in your CPU utilization when the traffic in your Redshift cluster increases. However, the issue comes when CPU utilization starts affecting your cluster performance and query time. When this happens, you can debug it by applying these steps . You can monitor the health status of your Redshift cluster by using Amazon Cloudwatch or Redshift cluster performance metrics dashboards.
You can debug Redshift’s performance issues by using Amazon Redshift Advisor recommendations. To see the Redshift advisor recommendations, go to the Redshift navigator pane and click on Advisor.
When you click on Advisor, you will see some recommendations. You can then sort the recommendations by cluster or by impact. You can also monitor and make proactive decisions about your Redshift cluster by using alarms. To create an alarm, go to the Amazon Redshift navigation pane and click on Alarms and then Create alarm.
In the image above, an alarm was created when the maximum disk space is higher than sixty percent. You also have the option to send this notification to an Amazon SNS topic.
5 ways to optimize your Redshift performance and costs Next, we will go through the most popular advice to optimize Redshift performance
Enable concurrency / auto-scaling Using auto-scaling, you can automatically scale up and down your cluster. When concurrency scaling is enabled, Redshift will automatically add more cluster capabilities when there is a spike in your cluster queries. Redshift will also automatically scale down when there is no activity in your cluster to enable you to free up resources and save costs in Redshift. This helps you effectively handle spikes in workload at specific periods.
To configure auto-scaling, go to the Properties tab on your Redshift cluster. You then select your parameter group.
You then select Edit workload queues.
You then change the concurrency scaling mode from off to auto .
The key is that when the activities grow in the redshift-custom-pg queue, Redshift automatically creates a new cluster to process the queries and shuts down the cluster when the activities in the queue drop.
By default, Redshift spins up one cluster. You can also modify the number of clusters Redshift spins up by heading to the Parameters tab in your parameter group and clicking on Edit parameters .
You can then modify the maximum concurrency scaling clusters.
In the image above, we increased the concurrent scaling clusters to three.
Use sort keys and distribution keys Another way to optimize Redshift performance is by using sort keys and distribution keys. Sort keys determine the order in which rows in your Redshift table are stored. A table can have a single sort key or multiple sort keys (compound sort key). When you create a sort key, metadata is automatically generated, containing information about the block. For example, if you create a table with the sort key of the date column, Redshift automatically creates blocks where chunks of the date column are sorted.
If you want to run a query such as <span class="text-style-code">SELECT * FROM {your_redshift_table} WHERE date=’15-09-2022’</span>, the query optimizer would not have to read through the entire data. It knows that the data would be in the second chunk (as in the image above). This is because the chunks of data get sorted by the sort key. This optimization technique increases query speed and query performance.
Distribution keys determine where and how data is stored in Redshift. If you have a large dataset and all the data is stored on a single Redshift cluster node, there will be a decrease in the query performance. There are three major distribution keys: key distribution, all distribution, and even distribution. The data is distributed across slices using a particular column (distribution key) in the distribution key. In all distributions, the leader node replicates your data across multiple nodes. Even distribution is majorly used for small dimension tables. Here, the leader node distributes your table evenly across slices.
If you perform frequent range filtering on a particular column, specify that column as the sort key. Similarly, if you perform frequent joins on a particular table, the join column should be specified as the distribution key.
Amazon Redshift also uses machine learning models to automatically optimize your tables. Tables created without explicitly defining a column for the sort keys or distributed keys are set to AUTO.
Create materialized views This is another way of optimizing the performance of Redshift queries. Materialized views simplify queries by saving query information. This makes it easier to retrieve data. It is a way to increase the processing speed of queries. Materialized views can consist of a single table or a join of multiple tables.
Imagine you have a sales table with millions of records. In the image below, you want to get the first ten records.
You also have a date table with millions of records, as shown below.
If you want to generate some insights, like the days with the highest sales, you can easily create a materialized view, as shown below.
Here, a materialized view with the name sales_by_date was created. You can preview the results of your materialized view by running the query in the image below.
A very important feature of materialized views is incremental refresh. For instance, if fifty thousand new records are added to both the sales table and the date table, materialized views will read only the updated records. This optimizes query processing because only the updated records are refreshed instead of the entire table.
When new records are added, you can refresh the materialized view by running the command: <span class="text-style-code">REFRESH MATERIALIZED VIEW {name_of_your_materialized_view}</span>.
Using the REFRESH command, your materialized view will get updated with new records. This decreases your query’s processing time and performance.
Rather than manually refreshing the materialized views, Amazon Redshift can also automate refreshing the data in the materialized views. AutoMV (meaning Automated Materialized Views) feature in Redshift has the same behavior as user-created materialized views. AutoMV uses machine learning algorithms to monitor user queries, detect changes in data, and incrementally refresh the materialized view. Asides from improving the performance of redshift queries, AutoMV also improves dashboard performance. Dashboards generally have a common set of queries used repeatedly. Using materialized views, changes are easily captured and refreshed, which helps improve performance and speed up queries.
Use the appropriate compression encoding format Compression is necessary because it saves storage space and reduces the size of data read from the disk. When storage space is saved, and data size is reduced, query performance increases. There are different compression techniques used in Amazon Redshift.
Raw encoding is the default encoding for columns that are sort keys, boolean or double precision data types. Using the COPY command creates an automatic compression.
For VARCHAR and CHAR data type, LZO and ZSTD encoding type works well in most use cases. Also, AZ64 is effective for INT, TIMESTAMP, and DATE data types as it uses SIMD for parallel processing of AZ64 encoded columns.
You can also use the ANALYZE COMPRESSION command to find the optimal compression and generates a report with the suggested compression type.
If you want to see the compression types for your sales table, you can run the query <span class="text-style-code">ANALYZE COMPRESSION {table_name}</span>.
You can also specify a compression encoding of your choice while creating a table using the ENCODE keyword, as shown in the image below.
Using compression encoding, you can reduce storage footprints and increase query performance.
Optimize SQL Queries Another thing to look out for while improving Redshift’s performance is optimizing your SQL queries. For example, if a table has millions of records and you run the SQL query <span class="text-style-code">SELECT * FROM {table_name}</span>.
This goes through all the records and outputs them. If you check your query duration graph, you will notice that it takes longer for the queries to get completed. While running your Redshift queries, it is a good practice to select some particular fields instead of using SELECT * to select all the fields. By selecting some particular columns, the SQL optimizer will retrieve some particular data, and this will help to reduce your Redshift costs.
It is also a good practice to use common table expressions (CTE). Using CTEs, your code will be more organized and readable. You will also avoid repeating subqueries. It is also key to avoid using WHERE for creating joins. If you have a customers table and an orders table, you should avoid running queries like the one shown below.
SELECT customers.customerID, customers.first_name, customers.last_name, orders.order_amount
FROM customers, orders
WHERE customers.customerID=orders.customerID
Instead, you can write more efficient queries by using an inner join. Inner joins are more efficient because it filters down the data and returns rows that have matching records in both tables. This is more computationally efficient and generally faster than outer joins.
SELECT customers.customerID, customers.first_name, customers.last_name, orders.order_amount
FROM customers
INNER JOIN orders
ON customers.customerID=orders.customerID
Conclusion Optimizing your Redshift performance is necessary to improve performance, reduce compute costs and save storage. In this tutorial, you have seen various ways to optimize Redshift’s performance. The methods covered are:
Enable concurrency / auto-scaling Utilize sort keys and distribution styles Make use of materialized views Use the appropriate compression encoding format Optimize SQL queries Do you need to load data to Redshift from APIs, databases, and files? Airbyte is an open-source data integration tool with hundreds of connectors including a Redshift destination connector . You can quickly get started integrating your business data with Airbyte Cloud .