Article

3 Techniques to Write Highly Optimized Queries For BigQuery

•

March 23, 2023

•

15 min read

In the first part of this series on how to optimize BigQuery for faster (and less costly) querying, we looked at how BigQuery separates compute and storage. We then looked at 4 major ways you can optimize your storage. We also saw the tools you can use to understand how your queries are performing.

Today, we go a step further in the optimisation journey; writing your queries in the most optimal way. We’ll go through several techniques you should consider when writing your queries

Get only what you need

SELECT only the columns you need

As we saw previously, BigQuery does a full scan of all the data requested i.e. columns specified. This scanning process has an impact on the speed and cost of the query.

Therefore, you should reduce the amount of data (columns) scanned as much as possible.

SELECT * FROM `bigquery-public-data.bitcoin_blockchain.transactions`;

Scans 587.14GB

SELECT timestamp, transaction_id FROM `bigquery-public-data.bitcoin_blockchain.transactions`;

Scans 23.45GB

For cases where you want to exclude a few columns, you can avoid typing all the column names, instead, use the keyword EXCEPT to exclude select columns.

SELECT * EXCEPT (inputs, outputs) FROM `bigquery-public-data.bitcoin_blockchain.transactions`;

If you just want to explore the data (which usually means looking at all the columns), use the preview functionality available on the BigQuery console. It’s completely free!

As an extra safety measure, you can set the maximum billable bytes for a specific query to prevent running queries that lead to scanning entire datasets. This limit will lead to the query failing before running.

‍Note: It’s currently not possible to set a limit at the project level.

Return only the rows you need

BigQuery doesn’t charge you based on the amount of data returned.

However, the amount of data returned by your query will affect the speed/performance of the query.

Especially with the parallelised nature of BigQuery’s processing, the more rows of data you require, the more marshaling and shuffling of data is required from the different processing nodes.

SELECT * FROM `bigquery-public-data.census_bureau_usa.population_by_zip_2010`;

Scans: 81.21MB
Shuffles: 173.19MB
Runs in: 4secs

SELECT * FROM `bigquery-public-data.census_bureau_usa.population_by_zip_2010` WHERE gender IS NOT NULL;

Scans: 81.21MB
Shuffles: 2.11MB
Runs in: 1sec

See that the second query scanned the same amount of data i.e. it scanned all columns but shuffled less data and ran in 25% of the time the first query took.

If your use case always requires you to return all the rows with no filters or aggregations, an OLTP database is probably what you need instead of an OLAP database like BigQuery..

Avoid join anti-patterns

Denormalization is a better strategy than trying to optimize joins, but for cases where you need to join tables, have the following in mind.

Understand BigQuery’s join strategies

Depending on the size of tables being joined, BigQuery will execute either of the following strategies:

Shuffle/hash join

Where both tables are large, BigQuery will shuffle and split the tables in a way that records with matching (join) keys are in the same (parallel) worker. A join is done in the worker. The results of these workers are then joined for the final output.

Broadcast join

Where one of the tables is small enough to be processed by a single worker, the larger table is split and sent to different workers while the smaller table is shared in full to each of the workers processing the larger table.

With this understanding, there are a couple of things you can do to ensure the most efficient execution:

Reduce the size of your table(s) before joining

This can be achieved by:

Aggregating the data
Filtering the data

Where the size of the larger table cannot be reduced, you may explicitly have it on the left side of the join. For a multiple table join, list the tables in descending order of size.

Note: As BigQuery’s optimizer has matured, this step is more or less a best practice. The optimizer will more often than not decide on the best ordering of tables

For scenarios where a hash join will happen, try clustering your tables beforehand. With the correlated values already being in the same blocks, there’s less movement and shuffling of data.

Connect BigQuery With Your Data Sources Using Airbyte

Try FREE for 14 Days

Avoid self joins. Use window functions

A self-join is a situation where a table (or CTE) is joined with itself.

Say you want to get the total passengers ferried every day by the New York yellow cab companies Creative Mobile Technologies and VeriFone Inc., and then you want to see the day-on-day change in this number.

WITH daily_sum AS ( -- Group on the date, then sum to get total passengers per day
 SELECT
   vendor_id,
   DATE_TRUNC(pickup_datetime, DAY) AS the_date,
   SUM(passenger_count) AS passengers_ferried
 FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022`
 GROUP BY 1, 2
)
SELECT
 ds.*,
 dss.passengers_ferried AS passengers_ferried_prev_day,
 ds.passengers_ferried - dss.passengers_ferried AS increase_in_passengers
FROM daily_sum ds
LEFT JOIN daily_sum dss ON ds.the_date = DATE_ADD(dss.the_date, INTERVAL 1 DAY) -- Self-join the CTE on vendor ID and date
 AND ds.vendor_id = dss.vendor_id
ORDER BY the_date ASC

Scans: 467.1MB

Shuffles: 263.74KB

Consumes: 1min slot time

Runs in: 2sec

Since 2003, window functions have been available in standard SQL and are available on BigQuery. These are much faster and way more optimized than the self-join. Also recall with our denormalisation tip that joins are already discouraged.

For our example, we can replace the self join with the LAG() window function

WITH daily_sum AS ( -- Group on the date, then sum to get total passengers per day
 SELECT
   vendor_id,
   DATE_TRUNC(pickup_datetime, DAY) AS the_date,
   SUM(passenger_count) AS passengers_ferried
 FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022`
 GROUP BY 1, 2
)
SELECT
 *,
 passengers_ferried - LAG(passengers_ferried) OVER(PARTITION BY vendor_id ORDER BY the_date ASC) AS passengers_ferried_prev_day AS increase_in_passengers -- Use LAG() to get previous day passengers
FROM daily_sum;

Scans: 467.1MB

Shuffles: 128.71KB

Consumes: 48sec slot time

Runs in: 1sec

From the two queries, you can see that both queries are scanning the same amount of data and returning the same number of rows… but the second one is running at least twice as fast while moving around far less data.

Avoid cross joins

Cross joins perform a cartesian product of two tables. For each row in one table, generate a new row for each row in the second table.

Say you have a transactional system where users may or may not transact on a given day. You might want to have records (even if they’re blank) for all the days since the client started using the system. This is a form of the gaps and islands problem.

One way to go about this would be to generate a table with each of your user IDs and for each user, all the dates in the specified date range, like so…

WITH series_of_dates AS ( -- Generate series with each day since Dec 2020.
   SELECT
       day
   FROM UNNEST(
       GENERATE_DATE_ARRAY(
           '2020-12-01',
           CURRENT_DATE(),
           INTERVAL 1 DAY
       )
   ) day
),
all_users AS ( -- Get list of all users
   SELECT
       user_id
   FROM users
),
cross_joined AS ( -- Each user plus all dates
   SELECT
       day AS cr_day,
       user_id AS cr_id
   FROM series_of_dates, all_users
)
SELECT * FROM cross_joined;

This is known as a cross join. Notice that it will generate an output larger than the input, which is usually a multiplication of the number of rows in both tables.

As described above regarding keeping an eye on how much data you’re returning, avoid operations such as these that return more rows than desired.

It might be a hard nut to crack, but try and rework your queries to avoid cross joins.

Choose your operators carefully

Operators (functions) on BigQuery are not created equal. Some operators are resource-intensive and may not be taking full advantage of BigQuery Query optimizations.

Opt for the other options

Use LIKE() instead of REGEXP_CONTAINS()

REGEXP_CONTAINS() offers the power of regular expressions which comes with high resource usage.
More often than not, LIKE() will get you the same results but faster

Use approximate aggregation functions

For situations where the exactness of your results is not important, consider approximate aggregate functions.

These are functions that use statistical methods to approximate the results of their exact counterparts (to around 97% accuracy). They run faster and consume lower resources. Aggregate functions available are:

APPROX_COUNT_DISTINCT()
APPROX_QUANTILES()
APPROX_TOP_COUNT()
APPROX_TOP_SUM()

These functions have better performance than their counterparts since they use a sample of the data to compute the metric.

Be very keen since approximate functions aren’t suitable for all use cases.

Prefer EXISTS() over COUNT()

When you’re only concerned about the occurrence of a record in your dataset (and not the number of occurrences), use EXISTS().

The COUNT() operator will traverse the entire dataset to get the total occurrences before returning. EXISTS() on the other hand will return immediately a matching record is found.

Use SQL for user-defined functions

For scenarios where you need to create your own functions, implement them with SQL instead of using Javascript.

Javascript necessitates the creation and management of a separate process to execute the Javascript code. This will increase your query’s runtime and compute requirements.

Javascript functions also have limited processing capacity availed by BigQuery hence have a higher chance of failure if not thoroughly optimized.

ORDER BY where absolutely necessary

When sorting your data, take into consideration that data has first to be marshaled from the parallel workers then sorted since all the records have to be at the same place to be compared. This therefore means that:

Adding a sorting step in the middle of your query adds an extra shuffling step that will slow your query.
Note: In some scenarios, BigQuery’s optimizer is intelligent enough to only execute one ordering step at the end of the query
You should sort only the important subsets of your data i.e. reduce your data before sorting
If you absolutely have to sort your output, put this operation at the end/ outermost of your query

Avoid non-cacheable results

Operators such as NOW() force BigQuery to evaluate the query every time it’s called since the results of NOW() cannot be cached. These are known as non-deterministic operators.

To take full advantage of caching capabilities, go for operators that return cacheable results.

Avoid CASTing on the fly

You may be storing your data in fields that you constantly need to transform before using in your queries.

These constant transformations introduce unnecessary compute requirements that make your queries slower and costlier. A common example is timestamps stored as strings in non-standard formats or numbers stored as strings.

In your ETL/ELT pipelines where BigQuery is the destination, ensure that the data is stored in the correct/expected formats to avoid CASTing in every query. You could also create a materialized view with the transformed data!

Other techniques to consider

Some other techniques you should consider include:

Filter and join your data on non-string columns where possible
Avoid referencing CTEs multiple times. BigQuery materializes results of recursive CTEs only. Non-recursive CTEs are evaluated every time they’re referenced
Use BigQuery’s BI engine which accelerates your queries using intelligent caching mechanisms

And finally, when in doubt, follow SQL best practices.

Conclusion

BigQuery is a petabyte-scale analytics DB utilizing parallel processing to achieve fast processing of massive datasets. To optimize your workloads and queries on BigQuery, you can:

Optimize your storage by: