Amazon Redshift is an enterprise-class relational database query and management system. It is known for its efficient storage and optimal query performance, which is achieved through massively parallel processing and columnar data storage.
In Redshift, each data block stores values of a single column for multiple rows. You get effective compression and retrieval of specific columns while conducting query operations. Redshift leverages SQL to analyze structured and semi-structured data across various data repositories. As new records enter the system, Redshift utilizes DML commands, such as INSERT, to conduct operations. Read ahead to know some of the critical aspects of the INSERT command.
What is the INSERT INTO Statement in Redshift?
In Redshift, the INSERT statement is used to add new data through rows into a table. You can insert a single row using the VALUES syntax or add multiple rows through the same syntax. Take a look at the syntax:
INSERT INTO table_name [(column[, ...])]
{DEFAULT VALUES |
VALUES ({expression | DEFAULT}[, ...])
[, ({expression | DEFAULT}[, ...])
[, ...]] |
query}
Let’s understand each parameter of the syntax in more detail:
Table_name
The table_name parameter in the Redshift INSERT INTO statement represents either a temporary or persistent table. You must be the owner of the specific table in the dataset to add more rows to it. You should also grant INSERT privileges to your team members, such as data analysts who are well-versed in SQL operations. Without the SELECT statement, your team cannot get access to objects and their attributes from the table, making it difficult to comprehend the functionalities of the INSERT privilege.
Column
The column parameter in the INSERT statement allows you to add data values into one or more specified columns of the table. You can list the target column names in any order. If you do not specify the column list, the values that you are inserting must align with the table columns in the order declared in your CREATE TABLE statement. If the values you provide are fewer than the table columns, the first n columns get populated with the data. Any columns that are not listed in the INSERT statement are declared default or null values by Redshift.
Default Values
The DEFAULT VALUES clause is utilized to add a row consisting of default values assigned to columns when you are creating the table. If a column lacks a default value, null values are inserted into those specified columns. However, at the time of creating the table, if you declare any columns as NOT NULL, your INSERT statement will result in an error.
Values
The VALUES keyword is employed to insert one or more rows in the table, where each row is composed of one or more values. The VALUES list for each row must correspond to the specified column list. If you want to insert multiple rows, you must use a comma delimiter between each list of expressions, and the lists must have an identical number of values.
It is important to note that you must not repeat the VALUES keyword in the syntax again.
Expression
The expression parameter in the Redshift INSERT INTO statement represents a single data value. Each value you add must align with the data type of the column which it is being inserted into. If the value has a different data type than the column’s declared data type, Redshift automatically converts the new entry to a compatible data type. This conversion can be better understood with the following example:
Let’s consider a column with the data type as INT (Integer). Initially, it contains a few entries, such as 12, 29, 45, etc. Now, if you attempt to add a decimal value of 3.2 into this column, the value will automatically be converted to 3.
In a single-row INSERT VALUES statement, you can use a scalar subquery as an expression. In this case, the result of the subquery will be inserted into the corresponding column. However, subqueries are not supported as expressions for multiple-row INSERT VALUES statements.
Default
The DEFAULT keyword is employed to insert a default value for a column, as defined during the creation of the table. If you have not specified a default value for the column earlier, a null value is inserted. However, you cannot insert a default value into a column with a NOT NULL constraint unless the column has an explicit default value assigned to it.
Query
The query parameter in the INSERT statement allows you to insert one or more rows into the table. The insertion can only happen when you define a query, such as INSERT INTO…SELECT.
All the resultant rows produced by the query are automatically inserted into your table. The query you specify must return a column list that is compatible with the columns in the table, although the column names do not have to necessarily match.
Important Points to Remember When Using Redshift’s INSERT INTO Statement
Data Format Matching:
The data type for the inserted values must align with the format you specified during the table creation.
Post-insert Operations:
After inserting new rows in your table, you must vacuum the table to reclaim storage space and re-sort the rows. You can also analyze the table to update statistics for the query planner.
Handling Decimals:
When you insert values into DECIMAL columns and exceed the specified scale, the loaded values get rounded up accordingly. For instance, if you insert a value of 30.369 into a DECIMAL (8,2) column, your final stored value will be 30.37.
Data Loading With COPY Command:
You can use the COPY command to load large amounts of data into Redshift. Employing individual INSERT statements to add vast datasets to your table might be prohibitively slow. Alternatively, if your data exists in other Amazon Redshift database tables, you can consider using INSERT INTO SELECT or CREATE TABLE AS statements to enhance the performance.
Using Redshift’s INSERT INTO Command For External Tables
The INSERT (external table) command is used to insert the results of a SELECT query into your existing tables on external catalogs. These catalogs include AWS Glue, AWS Lake Formation, or an Apache Hive metastore. To use this command, you require the same AWS Identity and Access Management (IAM) role that you used for the CREATE EXTERNAL SCHEMA command. This command is usually used to interact with the external databases in Amazon S3.
For non-partitioned tables, you can use the INSERT command to write data into the Amazon S3 location using specified table properties and file formats. For partitioned tables, the command writes data to the S3 location according to the partition key specified in the table. After the INSERT operation is completed, the external table automatically registers the new partitions made in the catalog.
The syntax for the INSERT INTO command is different for external tables. Take a look at the syntax below:
INSERT INTO external_schema.table_name
{select_statement}
Let’s understand each parameter of the syntax in more detail:
External_schema.table_name
The external_schema.table_name is the identifier for an existing external schema and the target external table where you have to conduct the INSERT operation.
Select_statement
The select_statement is used to insert one or more rows into the external table by specifying a query. All rows produced by the query are written to Amazon S3 based on the table definition, either in TEXTFILE or PARQUET format only. Your query must return a column list compatible with the column data types in the external table.
It is important to note that your SELECT query must have the same number of columns as the sum of data and partition columns in the external table. You must position partition columns towards the end of your query and in the same order as defined in the CREATE EXTERNAL TABLE command. While the column names do not have to necessarily match, their location, data types, and order are crucial for a successful operation.
Insert Data Into Redshift: An Alternate Way
There are a few different ways to insert data into Redshift, including using programming languages like Python. You can also insert, update, and merge different tables in your Redshift database using DML SQL queries. However, when there is a sudden influx of large datasets, or if you have data spread across multiple locations, it may be challenging to bring it all together. To achieve this, you can use data integration and replication platforms like Airbyte.
Airbyte allows you to extract and load data from diverse sources. In the platform’s expansive library of 350+ pre-built connectors, you can easily find a connector for Amazon Redshift. Configure your source and destination in two simple steps, and set up your data pipeline without writing a single line of code. It is definitely a much quicker way to load data directly into Redshift, as Airbyte is known for creating robust data pipelines within minutes.
Airbyte also provides CDC capabilities, so minute changes that occur at the source get synchronized with the Redshift database. Thus, you can conduct in-depth analysis or move data into external tables, knowing that the dataset at source is accurate and up-to-date.
The Final Word
With Redshift’s INSERT INTO command, you can add substantial data to your tables. Subsequently, this data can be used for analyzing consumer behavior, sales patterns, and business metrics. You can also create machine learning models to improve forecasting techniques by integrating your datasets with Amazon Redshift ML.
Since the INSERT INTO command is not efficient in loading vast quantities of data in a short span, you can sign up with Airbyte. Create a data pipeline to Redshift and manage large data volumes seamlessly.