In the first part of this series on how to optimize BigQuery for faster (and less costly) querying, we looked at how BigQuery separates compute and storage. We then looked at 4 major ways you can optimize your storage. We also saw the tools you can use to understand how your queries are performing.
Today, we go a step further in the optimisation journey; writing your queries in the most optimal way. We’ll go through several techniques you should consider when writing your queries
Get only what you need
SELECT only the columns you need
As we saw previously, BigQuery does a full scan of all the data requested i.e. columns specified. This scanning process has an impact on the speed and cost of the query.
Therefore, you should reduce the amount of data (columns) scanned as much as possible.
For cases where you want to exclude a few columns, you can avoid typing all the column names, instead, use the keyword EXCEPT to exclude select columns.
If you just want to explore the data (which usually means looking at all the columns), use the preview functionality available on the BigQuery console. It’s completely free!
As an extra safety measure, you can set the maximum billable bytes for a specific query to prevent running queries that lead to scanning entire datasets. This limit will lead to the query failing before running.
Note: It’s currently not possible to set a limit at the project level.
Return only the rows you need
However, the amount of data returned by your query will affect the speed/performance of the query.
Especially with the parallelised nature of BigQuery’s processing, the more rows of data you require, the more marshaling and shuffling of data is required from the different processing nodes.
Runs in: 4secs
Runs in: 1sec
See that the second query scanned the same amount of data i.e. it scanned all columns but shuffled less data and ran in 25% of the time the first query took.
If your use case always requires you to return all the rows with no filters or aggregations, an OLTP database is probably what you need instead of an OLAP database like BigQuery..
Avoid join anti-patterns
Denormalization is a better strategy than trying to optimize joins, but for cases where you need to join tables, have the following in mind.
Understand BigQuery’s join strategies
Depending on the size of tables being joined, BigQuery will execute either of the following strategies:
Where both tables are large, BigQuery will shuffle and split the tables in a way that records with matching (join) keys are in the same (parallel) worker. A join is done in the worker. The results of these workers are then joined for the final output.
Where one of the tables is small enough to be processed by a single worker, the larger table is split and sent to different workers while the smaller table is shared in full to each of the workers processing the larger table.
With this understanding, there are a couple of things you can do to ensure the most efficient execution:
- Reduce the size of your table(s) before joining
This can be achieved by:
- Aggregating the data
- Filtering the data
- Where the size of the larger table cannot be reduced, you may explicitly have it on the left side of the join. For a multiple table join, list the tables in descending order of size.
Note: As BigQuery’s optimizer has matured, this step is more or less a best practice. The optimizer will more often than not decide on the best ordering of tables
- For scenarios where a hash join will happen, try clustering your tables beforehand. With the correlated values already being in the same blocks, there’s less movement and shuffling of data.
Avoid self joins. Use window functions
A self-join is a situation where a table (or CTE) is joined with itself.
Say you want to get the total passengers ferried every day by the New York yellow cab companies Creative Mobile Technologies and VeriFone Inc., and then you want to see the day-on-day change in this number.
Consumes: 1min slot time
Runs in: 2sec
Since 2003, window functions have been available in standard SQL and are available on BigQuery. These are much faster and way more optimized than the self-join. Also recall with our denormalisation tip that joins are already discouraged.
For our example, we can replace the self join with the LAG() window function
Consumes: 48sec slot time
Runs in: 1sec
From the two queries, you can see that both queries are scanning the same amount of data and returning the same number of rows… but the second one is running at least twice as fast while moving around far less data.
Avoid cross joins
Cross joins perform a cartesian product of two tables. For each row in one table, generate a new row for each row in the second table.
Say you have a transactional system where users may or may not transact on a given day. You might want to have records (even if they’re blank) for all the days since the client started using the system. This is a form of the gaps and islands problem.
One way to go about this would be to generate a table with each of your user IDs and for each user, all the dates in the specified date range, like so…
This is known as a cross join. Notice that it will generate an output larger than the input, which is usually a multiplication of the number of rows in both tables.
As described above regarding keeping an eye on how much data you’re returning, avoid operations such as these that return more rows than desired.
It might be a hard nut to crack, but try and rework your queries to avoid cross joins.
Choose your operators carefully
Operators (functions) on BigQuery are not created equal. Some operators are resource-intensive and may not be taking full advantage of BigQuery’s optimizations.
Opt for the other options
- Use LIKE() instead of REGEXP_CONTAINS()
REGEXP_CONTAINS() offers the power of regular expressions which comes with high resource usage.
More often than not, LIKE() will get you the same results but faster
- Use approximate aggregation functions
For situations where the exactness of your results is not important, consider approximate aggregate functions.
These are functions that use statistical methods to approximate the results of their exact counterparts (to around 97% accuracy). They run faster and consume lower resources. Aggregate functions available are:
These functions have better performance than their counterparts since they use a sample of the data to compute the metric.
Be very keen since approximate functions aren’t suitable for all use cases.
- Prefer EXISTS() over COUNT()
When you’re only concerned about the occurrence of a record in your dataset (and not the number of occurrences), use EXISTS().
The COUNT() operator will traverse the entire dataset to get the total occurrences before returning. EXISTS() on the other hand will return immediately a matching record is found.
- Use SQL for user-defined functions
- ORDER BY where absolutely necessary
When sorting your data, take into consideration that data has first to be marshaled from the parallel workers then sorted since all the records have to be at the same place to be compared. This therefore means that:
- Adding a sorting step in the middle of your query adds an extra shuffling step that will slow your query.
Note: In some scenarios, BigQuery’s optimizer is intelligent enough to only execute one ordering step at the end of the query
- You should sort only the important subsets of your data i.e. reduce your data before sorting
- If you absolutely have to sort your output, put this operation at the end/ outermost of your query
Avoid non-cacheable results
Operators such as NOW() force BigQuery to evaluate the query every time it’s called since the results of NOW() cannot be cached. These are known as non-deterministic operators.
To take full advantage of caching capabilities, go for operators that return cacheable results.
Avoid CASTing on the fly
You may be storing your data in fields that you constantly need to transform before using in your queries.
These constant transformations introduce unnecessary compute requirements that make your queries slower and costlier. A common example is timestamps stored as strings in non-standard formats or numbers stored as strings.
In your ETL/ELT pipelines where BigQuery is the destination, ensure that the data is stored in the correct/expected formats to avoid CASTing in every query. You could also create a materialized view with the transformed data!
Other techniques to consider
Some other techniques you should consider include:
- Filter and join your data on non-string columns where possible
- Avoid referencing CTEs multiple times. BigQuery materializes results of recursive CTEs only. Non-recursive CTEs are evaluated every time they’re referenced
- Use BigQuery’s BI engine which accelerates your queries using intelligent caching mechanisms
And finally, when in doubt, follow SQL best practices.
BigQuery is a petabyte-scale analytics DB utilizing parallel processing to achieve fast processing of massive datasets. To optimize your workloads on BigQuery, you can:
Optimize your storage by:
- Partitioning your tables
- Clustering your tables
- Pre-aggregating your data into materialized views
- Denormalizing your data
Optimize your queries by:
- Selecting and returning only what you need
- Avoiding join anti-patterns
- Choosing your operators carefully
- Following SQL best practices
We hope you’re now equipped to build your petabyte-scale data warehouse in BigQuery in the most optimized cost-effective way.
In an upcoming article we’ll dig deeper into the concepts of denormalization and why it’s not the devil you might think it is.