BigQuery 101: A Beginner's Guide to Google's Cloud Data Warehouse
Google BigQuery is popular among data analysts and engineers due to its impressive features and capabilities, like its serverless architecture. This blog post will delve into what makes BigQuery such a powerful and widely used tool. From its scalable design to its ability to quickly and efficiently process and analyze large datasets, I will cover all the essential aspects you need to know about BigQuery.
By the end of this post, you will have a comprehensive, high-level understanding of BigQuery and be able to decide if it’s the tool you need to enhance your data analysis workflows and achieve new levels of insights and knowledge from your data.
The beginnings of BigQuery
In 2010, a team of forward-thinking engineers at Google's Seattle office had a groundbreaking idea: to leverage the company's advanced internal storage, computing, and analytics tools – the same ones used to crawl and index the entire internet – to transform businesses through an external service. This idea would eventually become Google Cloud Platform (GCP) and its flagship product, BigQuery.
BigQuery was created to make web-scale analytics accessible to all types of businesses without the need for them to invest in or manage their own data warehouses. Despite starting with a small team of fewer than ten engineers, BigQuery has consistently evolved and listened to customer feedback, resulting in a powerful yet user-friendly system quickly gaining a devoted following.
By 2013, BigQuery had established itself as a go-to platform for cloud-based analytics, particularly among tech startups and digital-native companies.
BigQuery architecture overview
BigQuery is a cloud-native data warehouse explicitly designed for the needs of data-driven enterprises in today's cloud-focused world. As a key component of Google Cloud's data analytics platform, BigQuery offers a range of features for ingesting, processing, storing, and analyzing data and facilitates collaboration among your team members.
One of the key advantages of BigQuery is its serverless design, which allows for independent scaling of storage and compute on demand. This design enables you to take advantage of flexible, cost-effective resource management without maintaining expensive computer infrastructure around the clock.
Unlike on-premise massively parallel processing (MPP) systems or traditional node-based cloud data warehouses like Amazon Redshift, BigQuery's serverless approach simplifies getting data into the warehouse and performing analysis using standard SQL. All you need to do is load your data and start issuing SQL statements – no need to set up clusters, provision resources, configure disks or replication, or perform any other complex setup tasks.
Under the hood, BigQuery relies on a range of multi-tenant services powered by advanced Google infrastructure technologies such as Dremel, Colossus, Jupiter, and Borg. Let’s discuss each of these in more detail.
BigQuery is built on top of Dremel, a revolutionary internal Google tool first introduced in a 2010 white paper. Dremel was developed to enable engineers and product managers to perform interactive ad-hoc analysis of read-only nested data on web-scale applications. Before Dremel, analyzing such data would often require hours of MapReduce processing. However, with the ability to query billions of rows of data and return results in seconds, Dremel revolutionized how we think about data analysis.
At its core, Dremel is a large, multi-tenant cluster that executes SQL queries by transforming them into execution trees. The leaves of these trees, known as "slots," handle the heavy lifting of reading data from storage and performing necessary computations—the branches of the tree, known as "mixers," handle aggregation. Dremel dynamically allocates slots to queries as needed, ensuring fairness for concurrent queries from multiple users. A single user can obtain thousands of slots for running queries, making Dremel a highly scalable and robust platform.
BigQuery relies on Colossus, a system optimized for reading large volumes of structured data using Capacitor – Google’s columnar storage format – and efficient compression. Colossus also manages replication, recovery in the event of disk failure, and distributed administration to ensure no single point of failure. Colossus allows you to seamlessly scale up to dozens of petabytes of data storage without incurring the cost and complexity of adding additional compute resources, as is often the case with traditional data warehouses.
Jupiter and Borg
BigQuery's compute and storage resources communicate via the high-speed Jupiter network, which quickly transports data from one location to another. The platform is also orchestrated by Borg, Google's predecessor to Kubernetes, which handles resource allocation for Dremel’s mixers and slots.
BigQuery storage overview
As previously mentioned, BigQuery stores data in Colossus, Google's distributed file system. In Colossus, data is automatically compressed, encrypted, replicated, and distributed to ensure durability and security.
Data tables are organized into units called datasets, which are further grouped within the scope of a GCP project. This hierarchical structure of “project, dataset, and table” helps you logically organize and manage your data within the platform.
Traditional relational databases, like MySQL, store information in a row-based format known as record-oriented storage, which makes them well-suited for transactional updates and online transaction processing (OLTP) applications. In contrast, BigQuery uses a columnar storage method, storing each column in its own file block, making it an excellent choice for online analytical processing (OLAP) applications.
BigQuery takes advantage of innovative columnar storage techniques, including its proprietary columnar format, Capacitor, a next-generation columnar storage format.
An important thing to note is that BigQuery allows for unlimited data manipulation language (DML) statements. Through these statements, you can insert, update, or delete a large number of rows in a table in a single job.
BigQuery's storage pricing model is based on the amount of data stored (the first 10 GB is free each month) and whether the storage is considered active or long-term.
It is considered long-term storage if a table or partition is not updated for 90 consecutive days. The price of long-term storage is automatically reduced by 50%, which allows you to save on storage costs for data not being actively used or updated.
Maintaining your data in BigQuery can help optimize cost and performance. Instead of exporting older data to another storage option like Cloud Storage, consider using BigQuery's long-term storage pricing to keep your data in the platform. Additionally, maintaining data in BigQuery eliminates the need to discard valuable data or implement a separate data archiving solution.
Another effective optimization practice is to organize your data using BigQuery's table partitioning and clustering features to match your typical data access patterns, which can help improve query performance and make managing and analyzing your data easier.
Partition and clustering
A partitioned table in BigQuery is divided into smaller units called partitions to make data management and querying more efficient. You can create partitions based on a TIMESTAMP/DATE column or an INTEGER column.
BigQuery's decoupled storage and computation architecture leverage column-based partitioning to minimize the amount of data read from disk by slot workers. Once the data has been read, BigQuery's in-memory shuffle function can automatically rearrange the data into more efficient shards and repartition it as needed.
Clustering in BigQuery involves organizing the data in a table based on the contents of one or more columns in the table's schema. These columns are used to group similar data, and it is usually best to use high cardinality and non-temporal columns for clustering. The order of the clustered columns determines the sort order of the data.
When you add new data to a table or partition, BigQuery automatically performs background re-clustering to maintain the table's or partition's sorted property. This process is free and happens automatically for all users.
To learn more about partitioning and clustering in BigQuery, I recommend reading our comprehensive guide on BigQuery optimization.
Before talking about data ingestion, it’s worth noting that you don’t necessarily need to load data into BigQuery to be able to query it since you can query from external sources, too.
- Load data into native tables - BigQuery can import datasets from various formats. These tables are kept in the BigQuery native storage, which includes replication, backups, scaling out capacity, and much more.
- Query from external tables without loading - One approach for directly querying external data sources without loading into BigQuery storage is to use a federated query. Without needing to load data into BigQuery, you may query across Google services such as Google Sheets, Google Drive, Google Cloud Storage, Cloud SQL, or Cloud BigTable.
It's important to note that the performance of querying external data sources may vary from querying data stored natively within BigQuery. If query performance is a priority, it is usually best to load the data into BigQuery directly. A federated query's performance depends on the external storage engine hosting the data. Keep this in mind when deciding where to store and access your data
Cloud Storage is ideal for storing incoming data for batch use cases. It is a reliable, highly accessible, and cost-effective object storage solution. When loading data from Cloud Storage into BigQuery, you can create a new table, append it to an existing table, or overwrite it.
You can load data in various formats, including CSV, JSON, Avro, Parquet, and ORC. When loading data, you must either define the table or partition schema or use schema auto-detection for supported data types.
There are a few ways in which you can load data:
- Using BigQuery UI - Once you have the data available on your local machine or Cloud Storage, you can load data using the web UI.
- Using CLI—bq load - You can use the bq load command to load data using CLI.
- Using REST API - You can use the REST API from runtimes such as Java or Python.
In addition to the tools listed above, you have the following data pipeline choices for loading data into BigQuery:
- Cloud Dataflow is a fully managed service on GCP built using the open-source Apache Beam API that supports a variety of data sources, including files, databases, message-based systems, and more.
- Cloud Dataproc is a fully managed Apache Spark and Apache Hadoop service on GCP.
Some notes on batch ingestion in BigQuery:
- Batch ingest is free - Loading data into BigQuery using the batch ingest techniques outlined above is free. Quotas and limitations, however, apply.
- Load performance is best effort - Because the computation utilized for data loading is provided at no cost to the user, BigQuery makes no promises about the shared pool's performance.
- Load jobs don’t consume query capacity - The slots used for data querying are separate from those used for data ingestion.
The tabledata.insertAll function in BigQuery allows you to stream data into the platform one record at a time. Ingested data is immediately accessible for querying from the streaming buffer within seconds of the first insertion. Still, it may take up to 90 minutes for the data to be available for copy and export.
A more modern alternative to the tabledata.insertAll option is the BigQuery Storage Write API, a high-performance, unified data-ingestion API that combines both streaming and batch loading. It uses stream offsets for exactly-once semantics and is more efficient than the older tabledata.insertAll method, using gRPC streaming instead of REST. The API has lower cost and allows for up to 2 TB of free monthly ingestion.
One popular approach for ingesting real-time data on the Google Cloud Platform is to use a Cloud Dataflow pipeline in streaming mode to read messages from a cloud pub/sub topic and publish the processed data to BigQuery tables. An added benefit of using a Cloud Dataflow pipeline is that the same code can be used for streaming and batch processing.
There are other options for streaming data to BigQuery besides Cloud Dataflow. For example, the Apache Spark BigQuery Connector allows you to design streaming pipelines in Apache Spark and run them on a Hadoop cluster like Cloud Dataproc.
Data Transfer Service
BigQuery's Data Transfer Service (DTS) is a fully managed service that simplifies the process of importing data from various sources, including Google SaaS applications like Google Ads, external cloud storage providers like Amazon S3, and data warehouses like Teradata and Amazon Redshift.
DTS automates and manages data transfer into BigQuery on a scheduled basis, making it easy to keep your data up to date. It can also be used to recover from any interruptions or gaps in data that may occur.
As stated before, you don’t always need to load data into BigQuery. In the following situations, you can use query materialization:
- Public datasets - These are BigQuery datasets shared with the general public.
- Shared datasets - You can perform queries on a dataset someone has provided you without loading the data.
- External data sources - You can entirely avoid the data loading procedure by building a table based on an external data source and using query federation.
Aside from the native BigQuery solutions, you can also look at data integration possibilities from Google Cloud partners who have merged their industry-leading products with BigQuery. For example, Airbyte is a Google Cloud partner.
BigQuery is compatible with two SQL dialects: standard SQL and legacy SQL. BigQuery processes queries in two modes: interactive (default) and batch.
- Interactive queries - These are queries processed as fast as possible, contributing to the concurrent rate limit and daily limit.
- Batch queries - These queries are queued and started as soon as idle resources in the BigQuery shared resource pool become available, generally within a few minutes.
There’s a schema that describes the column names, data types, and metadata for each table in BigQuery. Schemas are established at the table level and offer structure to the data.
- Data types can be simple data types, such as integers, or more complex, such as ARRAY and STRUCT, for nested and repeated values.
- Column modes can be NULLABLE, REQUIRED, or REPEATED.
When importing data or creating an empty table, you must define the table schema. You can also use schema auto-detection when loading data from self-describing source formats like Avro, Parquet, ORC, Cloud Firestore, or Cloud Datastore export files. You can define schemas manually or through a JSON file.
In addition to standard aggregate and analytic functions, BigQuery offers a range of functions and operators for tasks such as text manipulation, date/time calculations, mathematical operations, and JSON extraction. You can find the complete list of available functions in the BigQuery SQL function reference.
Machine Learning (ML) on large datasets can be complex due to the need for extensive programming and knowledge of ML frameworks. This complexity limits the number of people within a company who can develop solutions and excludes data analysts who understand the data but need more ML expertise.
BigQuery ML addresses the above issue by allowing you to use machine learning through standard SQL tools and skills. With BigQuery ML, you can build and evaluate ML models within BigQuery without needing to transfer data or wait for assistance from a data science team. There is no need to program an ML solution using Python or Java.
There are several reasons why companies of all sizes widely adopt BigQuery:
- Easiness of use - BigQuery excels at making your life as a user easier compared to similar cloud data warehouses. Its serverless architecture allows you to get up and running with minimal configuration. Its SQL interface is familiar to many data analysts and scientists, making it easy to learn and use. Its REST API makes it straightforward to interact with other systems.
- Scalability - BigQuery is designed to handle enormous amounts of data with high performance, making it an excellent choice for data-intensive applications.
- Cost-effectiveness - BigQuery uses a pay-as-you-go pricing model, which means you only pay for the resources you use. This can be more cost-effective than maintaining on-premises data warehouses or using other cloud platforms that require upfront investments or long-term commitments.
- Integration with other Google Cloud Platform services - BigQuery integrates seamlessly with other Google Cloud Platform (GCP) services and tools, allowing you to build end-to-end data pipelines and workflows.
- Machine learning capabilities - BigQuery offers built-in support for machine learning through its BigQuery ML feature, which enables users to build and deploy machine learning models directly within the BigQuery environment.
The features described above make BigQuery an attractive option for many organizations. In fact, at Airbyte, we selected BigQuery as the data warehouse in our analytics data stack.
I hope this blog post has given you a solid understanding of using BigQuery and how it can help you unlock insights from your data.