Data Insights
Article

BigQuery 101: A Beginner's Guide to Google's Cloud Data Warehouse

Thalia Barrera
January 12, 2023
15 min read
Limitless data movement with free Alpha and Beta connectors
Replicate data from or into Bigquery, in minutes
Learn more about the Airbyte Bigquery connector ->

Streaming Ingestion

The tabledata.insertAll function in BigQuery allows you to stream data into the platform one record at a time. Ingested data is immediately accessible for querying from the streaming buffer within seconds of the first insertion. Still, it may take up to 90 minutes for the data to be available for copy and export.

A more modern alternative to the tabledata.insertAll option is the BigQuery Storage Write API, a high-performance, unified data-ingestion API that combines both streaming and batch loading. It uses stream offsets for exactly-once semantics and is more efficient than the older tabledata.insertAll method, using gRPC streaming instead of REST. The API has lower cost and allows for up to 2 TB of free monthly ingestion.

One popular approach for ingesting real-time data on the Google Cloud Platform is to use a Cloud Dataflow pipeline in streaming mode to read messages from a cloud pub/sub topic and publish the processed data to BigQuery tables. An added benefit of using a Cloud Dataflow pipeline is that the same code can be used for streaming and batch processing.

There are other options for streaming data to BigQuery besides Cloud Dataflow. For example, the Apache Spark BigQuery Connector allows you to design streaming pipelines in Apache Spark and run them on a Hadoop cluster like Cloud Dataproc.

Data Transfer Service

BigQuery's Data Transfer Service (DTS) is a fully managed service that simplifies the process of importing data from various sources, including Google SaaS applications like Google Ads, external cloud storage providers like Amazon S3, and data warehouses like Teradata and Amazon Redshift.

DTS automates and manages data transfer into BigQuery on a scheduled basis, making it easy to keep your data up to date. It can also be used to recover from any interruptions or gaps in data that may occur.

Query Materialization

As stated before, you don’t always need to load data into BigQuery. In the following situations, you can use query materialization:

  • Shared datasets - You can perform queries on a dataset someone has provided you without loading the data.
  • External data sources - You can entirely avoid the data loading procedure by building a table based on an external data source and using query federation.

Partner Integrations

Aside from the native BigQuery solutions, you can also look at data integration possibilities from Google Cloud partners who have merged their industry-leading products with BigQuery. For example, Airbyte is a Google Cloud partner, which allows users to import data from Postgres to BigQuery, along with hundreds of other source connectors.

Querying data

BigQuery is compatible with two SQL dialects: standard SQL and legacy SQL. BigQuery processes queries in two modes: interactive (default) and batch.

  • Interactive queries - These are queries processed as fast as possible, contributing to the concurrent rate limit and daily limit.
  • Batch queries - These queries are queued and started as soon as idle resources in the BigQuery shared resource pool become available, generally within a few minutes.

There’s a schema that describes the column names, data types, and metadata for each table in BigQuery. Schemas are established at the table level and offer structure to the data.

  • Data types can be simple data types, such as integers, or more complex, such as ARRAY and STRUCT, for nested and repeated values.
  • Column modes can be NULLABLE, REQUIRED, or REPEATED.

When importing data or creating an empty table, you must define the table schema. You can also use schema auto-detection when loading data from self-describing source formats like Avro, Parquet, ORC, Cloud Firestore, or Cloud Datastore export files. You can define schemas manually or through a JSON file.

In addition to standard aggregate and analytic functions, BigQuery offers a range of functions and operators for tasks such as text manipulation, date/time calculations, mathematical operations, and JSON extraction. You can find the complete list of available functions in the BigQuery SQL function reference.

BigQuery ML

Machine Learning (ML) on large datasets can be complex due to the need for extensive programming and knowledge of ML frameworks. This complexity limits the number of people within a company who can develop solutions and excludes data analysts who understand the data but need more ML expertise.

BigQuery ML addresses the above issue by allowing you to use machine learning through standard SQL tools and skills. With BigQuery ML, you can build and evaluate ML models within BigQuery without needing to transfer data or wait for assistance from a data science team. There is no need to program an ML solution using Python or Java.

Wrapping up

There are several reasons why companies of all sizes widely adopt BigQuery:

  • Easiness of use - BigQuery excels at making your life as a user easier compared to similar cloud data warehouses. Its serverless architecture allows you to get up and running with minimal configuration. Its SQL interface is familiar to many data analysts and scientists, making it easy to learn and use. Its REST API makes it straightforward to interact with other systems. 
  • Scalability - BigQuery is designed to handle enormous amounts of data with high performance, making it an excellent choice for data-intensive applications.
  • Cost-effectiveness - BigQuery uses a pay-as-you-go pricing model, which means you only pay for the resources you use. This can be more cost-effective than maintaining on-premises data warehouses or using other cloud platforms that require upfront investments or long-term commitments.
  • Integration with other Google Cloud Platform services - BigQuery integrates seamlessly with other Google Cloud Platform (GCP) services and tools, allowing you to build end-to-end data pipelines and workflows.
  • Machine learning capabilities - BigQuery offers built-in support for machine learning through its BigQuery ML feature, which enables users to build and deploy machine learning models directly within the BigQuery environment.

The features described above make BigQuery an attractive option for many organizations. In fact, at Airbyte, we selected BigQuery as the data warehouse in our analytics data stack

I hope this blog post has given you a solid understanding of using BigQuery and how it can help you unlock insights from your data. To learn more, check out the article about Google BigQuery Sandbox for a comprehensive overview and insights.

The data movement infrastructure for the modern data teams.
Try a 14-day free trial