Query optimization is highly specific to both the query at hand and the nature of the data. It is difficult to make general recommendations to optimize an arbitrary query. With this caveat, the following paragraphs share some basic query-optimization principles that can provide a useful starting point. They are not rules and they may not apply in every case.
💡Suggested Read : How to Read PostgreSQL Query Plans
Database Design A well designed database is the starting point to have efficient queries. Try to use a normalized design. Leaner tables are easier to understand and faster to process. Instead of having tables with too many columns, use views and materialized views to combine tables to offer the same ease of querying. A completely normalized database, however, can be difficult to maintain. So, it is essential to balance normalization with ease of management and maintenance.
Use the Right Datatype Use the smallest datatype possible. For example, if a column uses only small numbers, consider using SMALLINT instead of the standard INT. Similarly, SMALLSERIAL is a good alternative to SERIAL when dealing with small datasets.
Integer based datatypes are always faster than string-based datatypes. If an ID column consisting of numbers is stored as string, it can help to change its datatype to numerical.
Formatting clauses can then be used to prepend zeros to smaller numbers, to ensure that all values are presented with the same number of digits.
💡Suggested Read : PostgreSQL Query Plans for Sorting Data
Index Appropriately Having the right index can help speed up scanning, joining, grouping, sorting, and other operations.
When a scan has to return a small fraction of the total rows in a table, indexing is very helpful. If there is an index on all the columns to be returned by a scan, the planner can use an index-only scan, which is very efficient. Hence, it is better to extract only specific rows instead of SELECT *. This makes it easier to create relevant indices. It is also important to understand that indexing is not a panacea. Overindexing can slow things down - keeping all the indexes updated is resource-intensive. For the same reason, catch-all indexes with all the columns of a table are not very helpful either. It is also advisable to drop unused or scarcely used indexes.
After adding a new index to a table, run VACUUM ANALYZE on the table - this helps the planner pick up the updated statistics for the table and index.
💡Suggested Read : PostgreSQL Query Plans for Reading Tables
Vacuuming Since query planning is based on table statistics, it is essential that the planner has access to the latest statistics. VACUUM clears dead tuples and ANALYZE updates tables statistics. Check the vacuuming strategy to ensure statistics are updated frequently. Also consider vacuuming and analyzing manually when needed.
Use Materialized Views If a frequently run query has expensive subqueries, such as complex joins or views, consider making a materialized view to replicate that subquery. Later queries can use the materialized view, instead of repeatedly executing the complex sub-query. Furthermore, materialized views can also be indexed like regular tables.
Optimize your PostgreSQL queries and streamline data replication with Airbyte
Talk to our team→
Avoid DISTINCT Using DISTINCT is often a suboptimal choice. There are usually ways to rewrite a query with more restrictive filtering conditions to eliminate the need for the DISTINCT operation. When this is not the case, it is often due to poor database design.
Use Prepared Statements If the same query is repeatedly run with different values in the filtering (WHERE) conditions, consider making a prepared statement (also called a parametrized query) from that query. Prepared statements are also useful in data insertion queries. Because prepared statements are preprocessed, the planner is able to cache the execution plan. This is faster because the same query doesn’t need to be repeatedly parsed and planned.
Re-write Complex Queries For complex queries, it is usually possible to express the underlying logic in more than one way. For example, it is always better to run an expensive operation, like sorting or joining, on a smaller dataset. Thus, it can be cheaper to impose a filtering condition before the expensive operation, rather than after it.
Another general principle is to minimize the amount of work the database needs to do. For example, limiting the number of output rows and columns can lead to the planner using a more efficient plan.
Select Fewer Columns SELECT * is usually the least performant approach. It is advisable to select only the desired columns. This makes it easier to use indexes. It also makes it faster to read the data from the disk.
Equality Based Joins Nested Loop Join, which is often the least efficient joining method, is also the only method that can handle inequality based join conditions. Thus, it is advisable to rewrite JOIN clauses using equality based conditions. This can lead to more efficient join plans.
💡Suggested Read : PostgreSQL Query Plans for Joining Tables
Study the Query Plan In general, a good starting point to optimizing a slow query is to use the EXPLAIN command on the query. Study the query plan to find the nodes with the highest cost.
Use EXPLAIN ANALYZE Sometimes, the planner can be wrong. The table statistics (which are based on random sampling) may not reflect the actual distribution of data in the table. Assigning costs to query plans and nodes is only an estimate, and may not reflect the true cost of executing the query. Hence, sometimes, execution times are not in line with estimated costs.
So, use EXPLAIN ANALYZE to find the nodes with the longest execution time and also to determine which variant of a query runs faster.
Use EXPLAIN (ANALYZE, BUFFERS) Disk-based I/O is a common contributor to making a query slow. The more data the executor has to read from disk, the slower the query runs. Conversely, the more data that is read from cache, the faster the execution. Using EXPLAIN with the BUFFERS parameter shows the amount of data read from cache and from disk. The results of commonly used operations are cached in memory. Older and less frequently accessed results are discarded. It is possible to tweak the “shared_buffers” parameter to increase the memory allocated to buffers.
💡Suggested Read : PostgreSQL Query Plans for Aggregating Data
Increase Working Memory Working Memory is the memory available to PostgreSQL to temporarily cache the interim results of operations. For example, to implement quicksort, it needs to hold the entire dataset in memory and sort it.
If the available memory is insufficient to hold the entire dataset, the data has to be processed in batches. The executor fetches one batch from disk, processes it and writes the result. Then it fetches another batch, and so on. Then the results of batches are combined. This slows down the execution time. When many batches have to be used, the impact can be significant.
Increasing the value of the ‘work_mem’ parameter is a common optimization. It is often set to a higher value like 64 MB instead of the default 4 MB.
Note The working memory, as defined by the ‘work_mem’ parameter, is the maximum amount of memory that can be occupied by one process . There can be many processes running simultaneously while processing a given query. If there are multiple connections, there can be multiple queries being processed simultaneously. Thus, the total amount of working memory is often a multiple of the ‘work_mem’ value. Increasing this parameter to too high a value can result in out of memory errors which can hang or crash a busy server.
Caution Increasing work_mem, though reliable when done carefully, is not a bulletproof solution. In rare cases, the planner’s estimate can be inaccurate and it can choose the wrong plan based on the increased memory availability. This can make the query run slower than it did with lower work_mem. So, test thoroughly after changing system settings.
💡Suggested Read: PostgreSQL Use Cases