PostgreSQL internally represents every query in a tree structure - called the query tree. It calls internal functions to execute each part of the query tree. This is represented by the query plan tree - each node of the query plan tree is an internal method. When the query involves joining one or more tables, the query plan uses Join Nodes to do the job. This article does a deep dive into Join Nodes - it includes hands-on examples about using the EXPLAIN commands to analyze plans of queries that include JOIN clauses.
It is assumed that you know how to use EXPLAIN and you are already familiar with the preliminary concepts of query plans . Before joining tables, it is necessary to scan them. The previous article in this series explains how PostgreSQL scans data from tables . The examples in this article use the same demo flight tickets mid-sized database used earlier in the series. To try out these examples, it is necessary to be connected to this database, as explained in the introductory article on reading query plans .
Join Nodes These nodes are used when the query involves combining different tables using a JOIN clause. There are different types of Join nodes:
Hash Join In this method, the planner builds a hash of the values in the join column of the inner table. It then scans the outer table and compares against this hash to find matching rows.
Hash Join is used when:
When the join condition is the ‘=’ operation. The inner table is small in size. Because building the hash of a large leads to a high startup cost. It is not preferred when:
The table is very large. Because if the tables are so large that they cannot fit in the working memory, they have to be worked upon in batches. Each batch is consecutively fetched from disk into memory and processed. This makes it inefficient to use a batched approach. This method works very well to join a large table with a small table. However, it has a high startup cost - because it has to build the before doing the join.
Nested Loop Join In this method, for each row in the outer table, the planner finds the matching rows in the inner table. This process continues till both tables are joined.
This method is used:
Typically for smaller tables. Since it is conceptually a brute force approach, it is inefficient for larger tables. Regardless of table size, when the join condition is anything other than “=”, i.e. inequality based joins always use this method. Nested Loop joins are not preferred for:
Large tables when the join condition is equality (“=”). Though this method has a low startup cost, it is usually less efficient. Because Nested Loop is the only method that can handle inequality based joins, it is advisable to try to prefer equality based joins - this can prompt the planner to choose more efficient methods of joining.
Merge Join Merge Join is possible only when the data in both tables is sorted according to the columns on which the join condition is based. When this is not the case, the planner first sorts the tables, then passes the sorted rows to the Merge Join. This leads to a higher startup cost.
Merge Joins are preferred when:
The tables are too large to be efficiently handled by Hash Join The query needs that the output of the join be sorted - this can lead to higher startup costs. Hash Join or Merge Join If there is sufficient working memory, Hash Joins are more suitable for OLAP queries - which involve large volumes of data. For OLTP queries, the overhead of constructing the hash may be too expensive. Merge Joins are suitable for both OLTP queries and OLAP queries.
Merge Join works on sorted data. So, there is an index covering the join key (column), the data is already implicitly sorted. So, Merge Join can be cheaper. On the other hand, if the data needs to be explicitly sorted (using a sort node), Merge Join can be more expensive. In such cases, Hash Join can be cheaper. In particular, Hash Joins are cheaper if there is sufficient working memory to accommodate the hash in full, or to only need a small number of batches.
The planner decides which Join node to use depending on the query. The following sections illustrate each join node with relevant examples. Most of the examples use the same simple SQL query with slight modifications. This will illustrate how small changes can lead to different query plans.
Hash Join In this demo database, the planner uses Hash Join in many cases.
Example 1 Run a join query on the view “aircrafts” (with 9 rows in total) and the table “seats” (around 1300 rows) with the column “aircraft_code” as the join key:
=# EXPLAIN ANALYZE SELECT * FROM aircrafts a JOIN seats s ON a.aircraft_code = s.aircraft_code ;
The execution plan looks like this:
"Hash Join (cost=1.20..365.86 rows=1339 width=67) (actual time=0.053..2.389 rows=1339 loops=1)"
" Hash Cond: (s.aircraft_code = ml.aircraft_code)"
" -> Seq Scan on seats s (cost=0.00..21.39 rows=1339 width=15) (actual time=0.008..0.164 rows=1339 loops=1)"
" -> Hash (cost=1.09..1.09 rows=9 width=52) (actual time=0.013..0.013 rows=9 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 9kB"
" -> Seq Scan on aircrafts_data ml (cost=0.00..1.09 rows=9 width=52) (actual time=0.003..0.004 rows=9 loops=1)"
"Planning Time: 0.114 ms"
"Execution Time: 2.511 ms"
Below is a plain text explanation of the above plan:
The topmost node is a Hash Join node. some textThe hash join can begin only after the hash of the inner table is ready. Thus, the startup cost (1.20) of this node is higher than the execution cost (1.09) of the hash node. This Hash Join operates on the basis of a Hash Condition - which compares the “aircraft_code” column of both the tables. The hash node has two children. Sequential scan on the outer table, ‘seats’. For each row of the outer table, it uses the hash of the inner table to find matching rows. The matching rows become a part of the join. Hash node which builds the hash of the inner table. To build the hash, it needs to scan the inner table, ‘aircrafts’. Thus the child of this Hash node is a Sequential Scan node. some textNotice that the Hash operation specifies the number of batches. If the available memory is insufficient, the hash is built in batches. In this case, it uses just one batch because it is a small table. To build the hash, it needs the output of the sequential scan. Thus, the startup cost (1.09) of the hash node is equal to the execution cost (1.09) of the child sequential scan node. Sequential scan on the inner table, ‘aircrafts’. But because “aircrafts” is a view, it scans the table(s) underlying the view - in this case, it scans the table “aircrafts_data”. Thus, while doing a Hash Join, if the hashing operation uses many batches, it can be beneficial to increase the amount of working memory.
Nested Loop Join As explained earlier, the main use-cases for Nested Loop Joins are:
Relatively smaller tables Inequality-based joins Joining Smaller Tables Equality Joins also use the Nested Loop node, particularly if the outer table is not big.
Example 2 Example 1 in the previous section, which demonstrated Hash Joins, used two of the smaller tables in the database - aircrafts (9 rows) and seats (1339 rows):
=# EXPLAIN ANALYZE SELECT * FROM aircrafts a JOIN seats s ON a.aircraft_code = s.aircraft_code ;
Even for these tables, the brute force approach of the Nested Loop join is considered inefficient. Since Nested Loop joins are used for smaller tables, try to repeat the same query but limit the size of the output to, say 10.
=# EXPLAIN ANALYZE SELECT * FROM aircrafts a JOIN seats s ON a.aircraft_code = s.aircraft_code LIMIT 10 ;
Now, the planner uses a Nested Loop join, as shown in the plan below:
"Limit (cost=0.15..3.08 rows=10 width=67) (actual time=0.041..0.062 rows=10 loops=1)"
" -> Nested Loop (cost=0.15..392.76 rows=1339 width=67) (actual time=0.041..0.060 rows=10 loops=1)"
" -> Seq Scan on seats s (cost=0.00..21.39 rows=1339 width=15) (actual time=0.007..0.008 rows=10 loops=1)"
" -> Memoize (cost=0.15..0.17 rows=1 width=52) (actual time=0.002..0.002 rows=1 loops=10)"
" Cache Key: s.aircraft_code"
" Cache Mode: logical"
" Hits: 9 Misses: 1 Evictions: 0 Overflows: 0 Memory Usage: 1kB"
" -> Index Scan using aircrafts_pkey on aircrafts_data ml (cost=0.14..0.16 rows=1 width=52) (actual time=0.009..0.009 rows=1 loops=1)"
" Index Cond: (aircraft_code = s.aircraft_code)"
"Planning Time: 0.119 ms"
"Execution Time: 0.084 ms"
Because the planner already knows that it has to return only a very small number (10) of rows, it chooses to avoid the overhead (building the hash) of the Hash Join method and instead does a Nested Loop join. Below are the steps of the query plan, starting from the bottom (leaf) nodes:
The first leaf node is a sequential scan on the outer table, ‘seats’. So, the executor scans the outer table row by row. For each row from the outer table, it finds matching rows by scanning the inner table. The second leaf node is an Index Scan node on the inner table. It imposes the JOIN condition while scanning the index to find relevant rows. The parent of this Index scan node is a Memoize node. This caches the result of the conditional index scan as a memo, for faster access. In the absence of this node, the inner table would have had to be re-scanned for each row of the outer table. The Memoize node and the Sequential Scan node (on the outer table) feed into the Nested Loop node - which executes the JOIN. The topmost node imposes the limit condition using the Limit node. As soon as it receives the required number (10) of rows, the recursive execution stops. It no longer asks child nodes to return more data rows. some textThe Limit node can start returning rows as soon as its child node starts its operation. Thus, the startup cost (0.15) of the Limit node is equal to the startup cost (0.15) of its child Nested Loop node. The Limit node doesn’t have to wait for the Nested Loop node to finish executing. It terminates the query as soon as it gets the required number of rows. Thus, its execution cost (3.08) is less than the theoretical execution cost (392.76) of the child Nested Loop node. Memoize The Memoize node caches the result of a parametrized index scan. In the above example, the planner uses the primary key index and does a parametrized index scan on the inner table (‘aircrafts’). It is inefficient to repeatedly scan the index and check for the JOIN condition - so, the output of the parametrized index scan is cached using the Memoize node.
Limit The Limit node stops its child nodes when the desired number of rows has been returned. It is typically used when the query has a LIMIT clause.
As a practical tip, if only a few columns from a few rows are needed, it is advisable to select only those columns and limit the number of rows that the query returns. For example, if returning the list of suggested products for a customer, it is advisable to return only the top few suggestions from the database. This prompts the planner to choose the most efficient path.
Example 3 - Exercise To artificially simulate a Nested Loop join, you can also explicitly disable Hash Joins and Merge Joins:
=# SET enable_hashjoin = off ;
=# SET enable_mergejoin = off ;
Running any join query will now force the planner to use the Nested Loop join operation. As an exercise, compare the performance of different join methods on the same join query. After testing, be sure to re-enable the methods:
=# SET enable_hashjoin = on ;
=# SET enable_mergejoin = on ;
WARNING Do NOT manually disable join methods in an attempt to optimize a query. The planner chooses the right joining method depending on the available data. These settings are universal and apply to all tables. So doing this can have unintended consequences on other queries.
Inequality Joins Joins are commonly based on the equality condition, e.g. a common structure is:
=# -- pseudo code – table1 JOIN table2 on table1.column1 = table2.column2
In many cases, however, the join condition is based on an inequality, such as:
=# -- pseudo code – table1 JOIN table2 on table1.column1 <> table2.column2
When the JOIN condition is anything other than equality (“=”), Nested Loop Join is the only operation that can implement the JOIN.
Example 4 As a practical example, try this query:
=# EXPLAIN ANALYZE SELECT * FROM aircrafts a JOIN seats s ON a.aircraft_code <> s.aircraft_code ;
The output looks like this:
"Limit (cost=0.15..3.08 rows=10 width=67) (actual time=0.041..0.062 rows=10 loops=1)"
" -> Nested Loop (cost=0.15..392.76 rows=1339 width=67) (actual time=0.041..0.060 rows=10 loops=1)"
" -> Seq Scan on seats s (cost=0.00..21.39 rows=1339 width=15) (actual time=0.007..0.008 rows=10 loops=1)"
" -> Memoize (cost=0.15..0.17 rows=1 width=52) (actual time=0.002..0.002 rows=1 loops=10)"
" Cache Key: s.aircraft_code"
" Cache Mode: logical"
" Hits: 9 Misses: 1 Evictions: 0 Overflows: 0 Memory Usage: 1kB"
" -> Index Scan using aircrafts_pkey on aircrafts_data ml (cost=0.14..0.16 rows=1 width=52) (actual time=0.009..0.009 rows=1 loops=1)"
" Index Cond: (aircraft_code = s.aircraft_code)"
"Planning Time: 0.119 ms"
"Execution Time: 0.084 ms"
In the above execution plan:
The topmost node is a Nested Loop join. This works on the basis of a Join Filter which compares the values of the respective columns to check for the inequality condition. In this case, the planner chooses to do a sequential scan on both tables. Thus, to speed up a Nested Loop join, it can be helpful to have an index on the joining column of the inner table. The planner can choose whether or not to use this index depending on the expected performance.
Materialize Observe in the above query plan that the (orange highlighted) node which sequentially scans the inner table (aircrafts_data) is a child of a Materialize node. Recall that in a nested loop join, for each row of the outer table, the inner table is scanned to find matching rows. Repeatedly scanning the inner table is inefficient. Thus, the planner uses the Materialize node to cache the result of the sequential scan on the inner table. This speeds up the query.
Merge Join In this demo database,the tables are not too large - they are efficiently handled by Hash Join, so the planner doesn’t use Merge Joins in a regular join query by default.
However, Merge Join is considered when the output needs to be sorted. So, start with the same query as before:
=# EXPLAIN ANALYZE SELECT * FROM aircrafts a JOIN seats s ON a.aircraft_code = s.aircraft_code LIMIT 10 ;
As shown in Examples 1 and 2, this uses a Nested Loop join with the LIMIT clause, and a Hash Join without the LIMIT clause.
Example 5 Redo the same query but add an ORDER BY clause:
=# EXPLAIN ANALYZE SELECT * FROM aircrafts a JOIN seats s ON a.aircraft_code = s.aircraft_code ORDER BY a.aircraft_code LIMIT 10 ;
When the output of the join is to be presented as an ordered resultset, i.e. when the join query also contains an ORDER BY clause, it can choose to use Merge Join. The execution plan of the above query is:
"Limit (cost=0.41..3.63 rows=10 width=67) (actual time=0.040..0.059 rows=10 loops=1)"
" -> Merge Join (cost=0.41..431.73 rows=1339 width=67) (actual time=0.039..0.057 rows=10 loops=1)"
" Merge Cond: (s.aircraft_code = ml.aircraft_code)"
" -> Index Scan using seats_pkey on seats s (cost=0.28..64.60 rows=1339 width=15) (actual time=0.013..0.016 rows=10 loops=1)"
" -> Index Scan using aircrafts_pkey on aircrafts_data ml (cost=0.14..12.27 rows=9 width=52) (actual time=0.003..0.003 rows=1 loops=1)"
"Planning Time: 0.131 ms"
"Execution Time: 0.080 ms"
In the above plan:
The tables to be joined are scanned using an Index Scan on each table. The resultsets of the index scans are used as input to the Merge Join. The Merge Join node joins the two resultsets in (sorted) order and on the basis of the Merge Condition some textThe Merge Join can start as soon as it receives rows from the child Index Scan nodes. Thus, the startup cost (0.41) of the Merge Join node is higher than the startup cost of the Index Scan Nodes (0.28 and 0.14) The topmost node imposes the limit on the number of returned rows. Hence, in a query where the planner uses Merge Join, it can be helpful to create the appropriate indexes, to improve the query’s performance. Furthermore, the existence of the right indexes can also lead the planner to choose this method over a less efficient one.
Example 6 - Exercise In the previous example, the sorting is according to one of the columns used in the join - the “aircraft_code” column is used to join as well as to sort. As an exercise, try to re-run this query with a different sorting condition. In the next query, the sorting is according to seat number and the join is based on the aircraft code:
=# EXPLAIN ANALYZE SELECT * FROM seats s JOIN aircrafts a ON a.aircraft_code = s.aircraft_code ORDER BY s.seat_no ;
Observe that the query plan is now very different - even though the output is to be sorted, it doesn't use Merge Join. Instead, it uses Hash Join to join the tables. The output of the Hash Join is sorted by a Sort Node. Compare the performance with the previous query.
Conclusion This article explained how PostgreSQL internally processes operations involving joining tables. It discussed the nodes of the query plan tree that are used in join operations. It also presented some practical tips about potentially improving join performance. In addition to joining tables and selecting data, queries involve operations like aggregating data and sorting the selected data. The next articles in the series discuss query plans and nodes used for aggregation and for sorting .