Google BigQuery features two SQL dialects that users can employ to query data—Legacy SQL and Standard SQL. Legacy SQL is the syntax introduced by Google for the BigQuery platform, while Standard SQL fits closer to widely accepted ANSI SQL standards. BigQuery’s SQL options may initially seem identical, but understanding the key differences between Legacy SQL and Standard SQL is essential. This article will dive into the core distinctions between BigQuery’s Legacy and Standard SQL capabilities across syntax, functionality, portability with other tools, and recommended use cases.
What is Legacy SQL?
BigQuery Legacy SQL was the initial version of SQL used by Google. It was created to provide a SQL interface for querying and analyzing large datasets stored in BigQuery. The syntax and query structure of Legacy SQL is different from Standard SQL.
Legacy SQL supports simple SQL operations like SELECT, ORDER BY, and WHERE clauses. In addition, it supports INNER, [FULL | RIGHT | LEFT] OUTER, and CROSS JOIN operations. However, if left unspecified, the default is INNER. With Legacy SQL, you can also use various window functions like RANK(), ROW_NUMBER(), FIRST_VALUE(), AVG(), MIN(), MAX(), etc.
Why Legacy SQL?
Legacy SQL arose because it provided users with a simple and familiar SQL interface to interact with their data. However, as the technology and SQL standards evolved, Google introduced Standard SQL in BigQuery to align with broader industry standards, which offered more features and enhanced compatibility.
Here are some critical syntax and query structure details when working with BigQuery’s Legacy SQL:
- Legacy SQL uses legacy syntax like # instead of -- for comments.
# This is a Legacy SQL comment
- Table data is referenced using [dataset.table] convention in the query.
- Legacy SQL has limited support for things like JOINs, window functions, etc. Queries are more straightforward as a result.
- Each window function of Legacy SQL requires an OVER clause, which specifies the top and bottom of the window. The OVER clause components (partitioning, ordering, and framing) give additional control over the window.
Benefits of Legacy SQL
- Compatibility with Existing Codebase: If a significant amount of code is written in Legacy SQL, maintaining compatibility with the existing codebase might be a reason to continue using it. Migrating to Standard SQL would require updating and testing all current queries.
- Minimal Learning Curve for Legacy Users: Users already familiar with Legacy SQL may find it more convenient to continue using it. Switching to Standard SQL might require additional training and adjustments.
- Transition Flexibility: Organizations may continue using Legacy SQL for existing projects while gradually transitioning to Standard SQL for new projects.
- Stability and Mature Syntax: Legacy SQL might be better suited for specific use cases or scenarios in some cases. Users should evaluate their requirements and consider whether Legacy SQL provides advantages for their data processing needs.
What is Standard SQL?
Standard SQL refers to BigQuery’s SQL dialect that aligns with standard ANSI SQL syntax and capabilities. It adheres closely to syntax and functionality defined in ANSI SQL standards widely used across databases. Standard SQL supports ANSI SQL elements like subqueries, common table expressions (CTE), window functions, CASE statements, etc.
Data handling and manipulations leverage Standard SQL methods like EXTRACT, DATE_TRUNC, and DATETIME data types rather than legacy options. Joins between tables use ANSI SQL JOIN clause semantics and operators like INNER JOIN, LEFT JOIN, etc.
Overall, it provides a portable SQL environment that is familiar to other database platforms like PostgreSQL, SQL Server, and MySQL.
How does it align with ANSI SQL standards?
Here’s how BigQuery’s Standard SQL aligns with the ANSI SQL standard:
- Syntax and Semantics: Standard SQL adheres to the syntax and semantics defined by the ANSI SQL standard. This ensures a consistent and interoperable way of interfacing with relational databases.
- Basic SQL Operations: Standard SQL includes fundamental SQL operations such as SELECT, INSERT, UPDATE, DELETE, and CREATE statements. These operations are part of the ANSI SQL standard and provide a common foundation for database interaction.
- Data Types: Standard SQL defines a set of data types for representing different kinds of data. ANSI SQL standardizes these data types, ensuring consistency across database systems.
- Joins and Relationships: Standard SQL supports defining relationships between tables and performing joins. ANSI SQL specifies the syntax and semantics for various types of joins, allowing for consistent query execution.
- Constraints: StandardSQL includes features for defining constraints on tables, such as primary keys, foreign keys, unique constraints, and check constraints. These constraints follow the guidelines set by the ANSI SQL standard. Here,
- The PRIMARY KEY is the unique key in a database; a table cannot have more than one unique key.
- The FOREIGN KEY is a field in one table labeled as a primary key in another.
- The UNIQUE constraint ensures that every data item in a table is unique.
- The CHECK constraint limits the value range that the user placed in a column.
Benefits of Standard SQL
Standard SQL in Google BigQuery provides several benefits compared to using Legacy SQL:
- Standardization and Familiarity: Standard SQL adheres to the SQL 2011 standard, making it more consistent with other relational databases. Users familiar with traditional SQL syntax will find it easier to work with Standard SQL.
- Improved Handling of Semi-Structured Data: It offers enhanced support for nested and repeated data structures commonly found in a semi-structured format like JSON. This way, you can efficiently handle various data types.
- Advanced Analytical Functions: Standard SQL provides a broad set of analytical functions, including window functions, advanced aggregations, and statistical functions. Users can perform more sophisticated analyses directly within their SQL queries.
- User-Defined Functions (UDFs): It allows the creation and use of User-Defined Functions (UDFs), enabling users to write custom functions in JavaScript. The extensibility enhances the capabilities of SQL queries by incorporating user-defined logic.
- Compatibility with BI Tools: Standard SQL is often the preferred choice for integrating various BI tools and data visualization platforms, including Google Data Studio. This compatibility makes it easier for users to connect and analyze data using their preferred tools.
Example of Standard SQL
BigQuery Legacy SQL vs Standard SQL
Using Airbyte to Enhance Efficiency Beyond SQL Dialects
While understanding the difference between the two SQL dialects is important to enhance the efficiency of your workflows, the technique you use to move data into your BigQuery data warehouse can further enhance your productivity.
By incorporating Airbyte into your workflow, you can seamlessly automate the ELT process, facilitating the extraction and loading of data from various sources into Google BigQuery. This integration provides flexibility and scalability and enhances data management efficiency for analytics and reporting requirements. Airbyte is an open-source platform with 350+ data connectors, allowing you to connect to various data sources. However, if you wish to create your custom connector to connect with specific databases, APIs, or systems, Airbyte provides you with the connector development kit (CDK), where you can build your connectors in only 10 min without any code.
Conclusion
When deciding between BigQuery Legacy SQL and Standard SQL, considerations like an existing codebase, data complexity, and desired functionality play a pivotal role. Legacy SQL suits projects valuing simplicity and backward compatibility, while Standard SQL offers standardization and compatibility with modern tools. Transition decisions should align with specific project needs, ensuring optimal performance and ease of use.