Redshift INSERT INTO Command: 3 Critical Aspects

Jim Kutz
August 4, 2025

Summarize with ChatGPT

What is the INSERT INTO Statement in Redshift?

Amazon Redshift is an enterprise-class relational database query and management system. It is known for its efficient storage and optimal query performance, which is achieved through massively parallel processing and columnar data storage.

In Redshift, each data block stores values of a single column for multiple rows. You get effective compression and retrieval of specific columns while conducting query operations. Redshift leverages SQL to analyze structured and semi-structured data across various data repositories. As new records enter the system, Redshift utilizes DML commands, such as INSERT, to conduct operations.

The INSERT statement in Redshift is used to add new data through rows into a table. You can insert a single row using the VALUES syntax or add multiple rows through the same syntax. Take a look at the syntax:

INSERT INTO table_name [(column[, ...])]
{ DEFAULT VALUES
  | VALUES (expression | DEFAULT [, ...])
    [, (expression | DEFAULT [, ...])
    [, ...]]
  | query }

Let's understand each parameter of the syntax in more detail:

Table_name

The table_name parameter in the Redshift INSERT INTO statement represents either a temporary or persistent table. You must be the owner of the specific table in the dataset to add more rows to it. You should also grant INSERT privileges to your team members, such as data analysts who are well-versed in SQL operations. Without the SELECT statement, your team cannot get access to objects and their attributes from the table, making it difficult to comprehend the functionalities of the INSERT privilege.

Column

The column parameter in the INSERT statement allows you to add data values into one or more specified columns of the table. You can list the target column names in any order. If you do not specify the column list, the values that you are inserting must align with the table columns in the order declared in your CREATE TABLE statement. If the values you provide are fewer than the table columns, the first n columns get populated with the data. Any columns that are not listed in the INSERT statement are declared default or null values by Redshift.

Default Values

The DEFAULT VALUES clause is utilized to add a row consisting of default values assigned to columns when you are creating the table. If a column lacks a default value, null values are inserted into those specified columns. However, at the time of creating the table, if you declare any columns as NOT NULL, your INSERT statement will result in an error.

Values

The VALUES keyword is employed to insert one or more rows in the table, where each row is composed of one or more values. The VALUES list for each row must correspond to the specified column list. If you want to insert multiple rows, you must use a comma delimiter between each list of expressions, and the lists must have an identical number of values.

It is important to note that you must not repeat the VALUES keyword in the syntax again.

Expression

The expression parameter in the Redshift INSERT INTO statement represents a single data value. Each value you add must align with the data type of the column into which it is being inserted. If the value has a different data type than the column's declared data type, Redshift automatically converts the new entry to a compatible data type. This conversion can be better understood with the following example:

Let's consider a column with the data type as INT. Initially, it contains a few entries, such as 12, 29, 45, etc. Now, if you attempt to add a decimal value of 3.2 into this column, the value will automatically be converted to 3.

In a single-row INSERT … VALUES statement, you can use a scalar subquery as an expression. In this case, the result of the subquery will be inserted into the corresponding column. However, subqueries are not supported as expressions for multiple-row INSERT … VALUES statements.

Default

The DEFAULT keyword is employed to insert a default value for a column, as defined during the creation of the table. If you have not specified a default value for the column earlier, a null value is inserted. However, you cannot insert a default value into a column with a NOT NULL constraint unless the column has an explicit default value assigned to it.

Query

The query parameter in the INSERT statement allows you to insert one or more rows into the table. The insertion can only happen when you define a query, such as INSERT INTO … SELECT.
All the resultant rows produced by the query are automatically inserted into your table. The query you specify must return a column list that is compatible with the columns in the table, although the column names do not have to necessarily match.

What Are the Most Important Considerations When Using INSERT INTO in Redshift?

Data Format Matching

The data type for the inserted values must align with the format you specified during the table creation.

Post-insert Operations

After inserting new rows in your table, you must vacuum the table to reclaim storage space and re-sort the rows. You can also analyze the table to update statistics for the query planner.

Handling Decimals

When you insert values into DECIMAL columns and exceed the specified scale, the loaded values get rounded up accordingly. For instance, if you insert a value of 30.369 into a DECIMAL(8,2) column, your final stored value will be 30.37.

Data Loading With COPY Command

You can use the COPY command to load large amounts of data into Redshift. Employing individual INSERT statements to add vast datasets to your table might be prohibitively slow. Alternatively, if your data exists in other Amazon Redshift database tables, you can consider using INSERT INTO … SELECT or CREATE TABLE AS statements to enhance performance.

Concurrent Operations and Modern Enhancements

Recent enhancements to Amazon Redshift have introduced concurrent insert capabilities, allowing multiple pure write operations to execute simultaneously during their pre-ingestion phases. This improvement addresses traditional bottlenecks where INSERT and COPY operations would execute sequentially, significantly improving throughput in multi-user environments. The concurrent execution model enables scanning, sorting, and aggregation operations to proceed in parallel while maintaining data consistency through serialized final insertion steps.

What Are the Most Common Performance Challenges When Using INSERT INTO in Redshift?

Performance optimization represents one of the most critical aspects of working with INSERT INTO operations in Amazon Redshift. Understanding these challenges enables you to make informed decisions about when to use INSERT statements versus alternative approaches like the COPY command.

Row-by-Row Processing Limitations

The fundamental architecture of INSERT operations creates inherent performance bottlenecks when dealing with large datasets. Each INSERT statement processes data row-by-row, creating substantial overhead that becomes exponentially problematic as data volumes increase. Unlike bulk loading mechanisms, INSERT commands require the cluster to perform multiple operations including query parsing, execution planning, data distribution across nodes, and transaction management for each individual operation.

This limitation becomes particularly pronounced in production environments where you might attempt to load substantial volumes of data. INSERT operations can achieve only modest throughput rates, with some implementations reporting processing capabilities limited to approximately 30,000 records per hour, which falls far short of enterprise-level requirements for modern data pipelines.

Concurrency and Locking Issues

Multiple concurrent INSERT operations can create deadlock scenarios, particularly in environments where multiple processes or applications attempt to perform concurrent data modifications. Redshift's table-level locking mechanism creates potential conflict scenarios that can result in query blocking and overall system performance degradation when not properly managed.

When multiple processes attempt to insert data simultaneously into the same tables, you may encounter situations where all remaining processes fail, creating cascading failures that disrupt entire data processing workflows. These deadlock conditions require careful orchestration and error handling strategies to maintain system reliability.

Resource Utilization Inefficiencies

INSERT operations typically result in suboptimal cluster utilization, as the distributed nature of Redshift clusters cannot be fully leveraged when processing individual row insertions. This inefficient resource utilization translates directly into higher operational costs, as you pay for cluster capacity that cannot be effectively utilized due to the inherent limitations of INSERT-based data loading approaches.

Multi-Row INSERT Optimization Strategies

To address some performance limitations, you can implement multi-row INSERT statements that batch multiple rows within a single SQL statement. This approach reduces the per-row processing cost by amortizing transaction and coordination overhead across multiple rows. Optimal batch sizes typically range between 100 and 1,000 rows per INSERT statement, though the ideal size varies based on row width, data types, and cluster configuration.

The syntax for multi-row INSERT operations follows this pattern:

INSERT INTO table_name (column1, column2, column3)
VALUES 
    (value1a, value2a, value3a),
    (value1b, value2b, value3b),
    (value1c, value2c, value3c);

Staging Table Strategies

Professional implementations often utilize staging tables to optimize INSERT performance while maintaining flexibility for complex data transformation requirements. This approach involves loading data into temporary staging tables using high-performance COPY operations, then processing the data using INSERT INTO SELECT statements to transfer processed data to production tables.

Staging table strategies enable you to leverage the performance benefits of bulk loading while maintaining sophisticated data processing capabilities. The staging approach also provides better error isolation, allowing you to validate and clean data before it reaches production tables.

How Can You Troubleshoot Common INSERT INTO Errors and Issues in Redshift?

Effective troubleshooting of INSERT INTO operations requires systematic approaches to identify, diagnose, and resolve common issues that can impact data loading reliability and performance. Understanding Redshift's diagnostic capabilities and error patterns enables you to maintain robust data integration processes.

System Table Utilization for Diagnostics

Redshift provides comprehensive system tables that enable detailed analysis of INSERT operation performance and error conditions. The STLERROR system table contains information about query errors, while STLQUERY provides execution metrics for all queries including INSERT operations. These tables enable you to correlate error information with specific INSERT operations and analyze performance patterns over time.

For troubleshooting INSERT operations, you can query system tables to identify patterns such as:

SELECT query, starttime, endtime, aborted, substring(querytxt,1,60) as sql_text
FROM stl_query
WHERE querytxt LIKE '%INSERT%'
AND starttime >= dateadd(hour, -1, getdate())
ORDER BY starttime DESC;

Data Type Mismatch Resolution

Data type mismatch errors represent one of the most common categories of INSERT-related problems, occurring when source data formats do not align with target table column definitions. These errors can be particularly challenging to diagnose when they occur sporadically within large datasets or when source data contains subtle formatting variations.

To resolve data type issues, you should implement comprehensive data profiling and validation capabilities to identify potential type mismatch issues before they cause INSERT operation failures. This includes validating data ranges, format consistency, and null value handling across your data sources.

Lock Detection and Resolution

Redshift's locking system includes three distinct lock modes that can impact INSERT operations: AccessExclusiveLock acquired during DDL operations, AccessShareLock acquired during SELECT operations, and ShareRowExclusiveLock acquired during INSERT, COPY, DELETE, or UPDATE operations. Understanding these lock types enables you to diagnose and resolve concurrency conflicts.

You can identify current locking situations using system queries that examine lock ownership and blocking relationships:

SELECT l.pid, l.relation, l.mode, l.granted, 
       a.usename, a.query_start, a.state,
       substring(a.query,1,50) as query_text
FROM pg_locks l
LEFT JOIN pg_stat_activity a ON l.pid = a.pid
WHERE l.mode = 'ShareRowExclusiveLock'
ORDER BY l.pid;

Error Handling and Recovery Procedures

Comprehensive error handling strategies should account for various failure modes that can occur during INSERT operations, including network connectivity issues, resource constraints, and data quality problems. Implementing robust retry mechanisms with exponential backoff can help recover from transient failures while avoiding system overload.

For production environments, you should establish monitoring systems that can detect INSERT operation failures and provide sufficient diagnostic information to enable rapid problem resolution. This includes implementing alerting systems that can notify administrators of performance degradation or error conditions before they impact business operations.

Performance Monitoring and Optimization

Regular monitoring of INSERT operation metrics enables proactive identification of performance degradation before it significantly impacts data pipeline operations. Key metrics to monitor include execution times, resource utilization patterns, and concurrent operation conflicts.

You can create monitoring queries that track INSERT performance trends over time:

SELECT DATE_TRUNC('hour', starttime) as hour,
       COUNT(*) as insert_count,
       AVG(datediff(seconds, starttime, endtime)) as avg_duration,
       MAX(datediff(seconds, starttime, endtime)) as max_duration
FROM stl_query
WHERE querytxt LIKE '%INSERT INTO%'
AND starttime >= dateadd(day, -7, getdate())
GROUP BY DATE_TRUNC('hour', starttime)
ORDER BY hour DESC;

How Does INSERT INTO Work With External Tables in Redshift?

The INSERT (external table) command is used to insert the results of a SELECT query into your existing tables on external catalogs. These catalogs include AWS Glue, AWS Lake Formation, or an Apache Hive metastore. To use this command, you require the same AWS Identity and Access Management (IAM) role that you used for the CREATE EXTERNAL SCHEMA command. This command is usually used to interact with the external databases in Amazon S3.

For non-partitioned tables, you can use the INSERT command to write data into the Amazon S3 location using specified table properties and file formats. For partitioned tables, the command writes data to the S3 location according to the partition key specified in the table. After the INSERT operation is completed, the external table automatically registers the new partitions made in the catalog.

The syntax for the INSERT INTO command is different for external tables:

INSERT INTO external_schema.table_name
select_statement

Externalschema.tablename

The externalschema.tablename is the identifier for an existing external schema and the target external table where you have to conduct the INSERT operation.

Select_statement

The select_statement is used to insert one or more rows into the external table by specifying a query. All rows produced by the query are written to Amazon S3 based on the table definition, either in TEXTFILE or PARQUET format only. Your query must return a column list compatible with the column data types in the external table.

It is important to note that your SELECT query must have the same number of columns as the sum of data and partition columns in the external table. You must position partition columns towards the end of your query and in the same order as defined in the CREATE EXTERNAL TABLE command. While the column names do not have to necessarily match, their location, data types, and order are crucial for a successful operation.

How Does Airbyte Optimize INSERT INTO Operations for Redshift?

When dealing with large-scale data integration requirements, traditional INSERT INTO approaches often fall short of enterprise performance expectations. Airbyte's integration with Amazon Redshift addresses these limitations through sophisticated optimization strategies that move beyond row-by-row processing toward bulk loading methodologies.

Strategic COPY Command Implementation

Airbyte's architecture strategically avoids the performance limitations of traditional INSERT operations by implementing a COPY-based approach that leverages Amazon S3 as an intermediary staging layer. This approach ensures that data replication occurs by first uploading data to an S3 bucket and issuing optimized COPY commands, aligning with Amazon's recommended best practices for high-performance data loading.

The platform automatically handles the complexity of S3 staging, including data format optimization, file organization, and manifest file creation that maximizes Redshift's parallel processing capabilities. This staging mechanism enables near-linear scaling of loading performance as cluster size increases, addressing the fundamental scalability limitations of INSERT-based approaches.

Advanced Performance Optimization Features

Airbyte's integration includes sophisticated data compression and format optimization that occurs automatically without requiring manual intervention. The platform handles data compression before loading, which significantly reduces storage requirements and improves query performance in Redshift. This optimization eliminates the manual compression strategy implementation typically required with custom INSERT-based solutions.

The platform also provides intelligent scheduling capabilities that enable strategic data loading during off-peak hours, minimizing resource contention and ensuring consistent cluster performance during peak usage periods. This scheduling optimization helps distribute data loads evenly over time, preventing resource utilization spikes that commonly occur with batch-oriented INSERT processes.

Comprehensive Error Handling and Data Validation

Unlike traditional INSERT operations that often provide limited visibility into failure causes, Airbyte implements comprehensive error reporting mechanisms that identify specific data quality issues and provide detailed diagnostic information. The platform automatically performs data validation checks and compares loaded data against source data to identify and address discrepancies early in the process.

This automated validation eliminates the manual verification processes typically required with INSERT-based approaches, reducing operational overhead while improving data quality assurance. The platform's approach to handling data size limitations includes intelligent processing strategies that automatically manage Redshift's constraints while preserving data integrity.

Multiple Synchronization Modes for Different Use Cases

Airbyte provides sophisticated synchronization capabilities that address different data loading scenarios while maintaining optimal performance characteristics. The platform supports Full Refresh operations for complete data replacement, Incremental Append modes for efficient change processing, and Incremental Append + Deduped modes for automatic data quality management.

These synchronization modes utilize cursor-based change tracking that syncs only new or modified data, preventing the re-fetching of previously replicated data. This approach utilizes fields like updated_at timestamps to determine which records require replication, enabling efficient processing of large datasets with minimal resource utilization compared to traditional INSERT approaches that process all data regardless of change status.

Enterprise-Grade Security and Governance

The integration addresses enterprise security requirements through comprehensive encryption and access control mechanisms that exceed the capabilities typically available with custom INSERT implementations. All Redshift connections utilize SSL encryption, while the S3 staging architecture enables encryption at rest for staged data, providing comprehensive protection for sensitive information throughout the loading pipeline.

Airbyte's platform includes sophisticated IAM role management that enables secure S3 bucket access while maintaining principle of least privilege. The integration requires specific permissions for staging bucket operations while limiting access to only necessary resources, providing superior security compared to INSERT approaches that may require broader database privileges.

What Are the Key Takeaways for Using INSERT INTO in Redshift?

With Redshift's INSERT INTO command, you can add substantial data to your tables for analyzing consumer behavior, sales patterns, and business metrics. You can also create machine learning models to improve forecasting techniques by integrating your datasets with Amazon Redshift ML.

However, understanding the performance characteristics and limitations of INSERT operations is crucial for making informed decisions about data loading strategies. While INSERT provides flexibility for real-time data processing and complex transformation logic, it creates significant performance bottlenecks when dealing with large datasets due to its row-by-row processing nature.

For enterprise-scale data operations, platforms like Airbyte provide optimized alternatives that leverage COPY-based architectures to achieve superior performance while maintaining the flexibility and control that data teams require. These modern approaches address the fundamental limitations of traditional INSERT operations while providing comprehensive error handling, automated optimization, and enterprise-grade security capabilities.

The recent enhancements to Redshift, including concurrent insert capabilities and improved system monitoring tools, provide expanding opportunities to optimize INSERT operations. However, these advances also introduce new complexity layers that require ongoing education and adaptation of existing practices to fully realize available benefits.

Since the INSERT INTO command is not efficient for loading vast quantities of data in short timeframes, you can leverage modern data integration platforms that provide bulk loading optimizations, automated error handling, and comprehensive monitoring capabilities. Create optimized data pipelines to Redshift and manage large data volumes seamlessly while maintaining the flexibility and control your data operations require.

Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program
The data movement infrastructure for the modern data teams.
Try a 14-day free trial