Article

PostgreSQL Query Plans for Reading Tables

•

May 17, 2024

•

15 min read

The query plan tree consists of the set of individual steps that the executor follows to retrieve data requested by the query. These steps constitute the nodes of the query plan tree.

There are different types of nodes, corresponding to different types of operations. The PostgreSQL executor uses Scan Nodes to read data from tables. This article takes a closer look at Scan Nodes and shows hands-on examples on how to use the EXPLAIN commands to study query plans involving these nodes.

The discussion in this article assumes you are already familiar with preliminary concepts like using the EXPLAIN command to generate query plans and reading simple query plans, which are introduced in the previous articles in the series. The examples used in this article assume that you are already connected to the mid-sized demo flight tickets database, as shown in the introductory article on query plans.

Scan Nodes

These nodes implement the internal methods that go through the data pages of a table to find the most relevant rows. There are four kinds of Scan nodes corresponding to different methods of reading the data. The planner decides the type of scan depending on:

The data sought by the query
Whether the right type of indexes are available on the table
Whether using the index is expected to improve performance

Sequential Scan

The data is read in order, row after row, page after page. This is the default method of reading tables - it can be used in every case, without any preconditions. It has minimal startup cost but it produces unsorted output. The I/O cost is lower because it involves only reading the data.

Parallel Sequential Scan

On multithreaded systems, the parent node can create two or more child worker processes to distribute the task of sequentially scanning the table. Each worker sequentially scans a different part of the table. To manage the workers, parallel scans also include a leader process. In some cases, the leader also takes up a part of the scanning work. The leader can also choose to only distribute the workload (not do any scanning itself) and combine the respective outputs from the workers.

Index Scan

The planner uses an index to identify the relevant rows if:

the table has a relevant index available
using the index is expected to improve the query’s performance

In an index scan, the index is used to determine which rows to read data from. It then reads the actual data of the row. Using an index leads to randomly accessing different pages where the relevant data is stored. Doing a very large number of random accesses is slower than simply sequentially scanning the table. Note than index scans produce sorted output. Furthermore, the I/O costs are higher because it first reads the index, then looks up the data from the disk.

Index Only Scan

In an index-only scan, the index is scanned but the rest of the data row is not read. This method can be used when the query returns only indexed columns - thus scanning the index is sufficient and there’s no need to read the rest of the row.

Bitmap Scan

A bitmap scan combines aspects of the sequential scan and the index scan. It works in two stages:

Bitmap Index Scan

It first creates a bitmap based on the query. The bitmap is an array of single bits (0s and 1s). The length of the array is the number of pages (page files) in the table being scanned. So, the bitmap array has a bit corresponding to each pagefile of the table.

The bitmap is initialized as an array of 0s. It then uses the relevant index to set (update) to 1 the bits corresponding to those pages which are expected to contain a matching row. Thus, the Bitmap indicates which pages contain the relevant rows.

If there are multiple indices, a separate bitmap is built for each index. These bitmaps are combined using a bitwise AND operation.

Bitmap Heap Scan

Recall that a table stores data in a heap data structure. This heap consists of individual page files. Data rows (tuples) of the table are written into these pagefiles.

The Bitmap Heap Scan uses the bitmap to determine which pages in the table’s heap have relevant tuples (rows). It then reads data sequentially from these pages to find the right rows. Because the Bitmap first has to be created, this method has a high startup cost.

Comparison of Sequential Scan and Index Scan

Sequential and Index scans are two of the most common methods of retrieving data. There is often confusion about which is more effective.

In some cases index scans can be less efficient than sequential scans. Indexing allows to read only the relevant rows of data. These rows may be scattered throughout different pages. Indexing is most effective at retrieving a small number of relevant rows from a large table.

It is cheaper to sequentially read data in bulk compared to accessing and reading a lot of random data from different pages. In such cases, the planner can choose to not use the index, even if it is available.
The planner can also choose to ignore the index (and do a sequential scan) if the available index is not suitable for the type of query.
Using the index carries an overhead - the executor has to first check the index to identify relevant rows, check which page that row is stored in, and then read the data from that page. So, sequential scans can be more efficient for small tables.

The next section shows a few examples to illustrate the use of each type of Scan node. The numbers presented here are based on running the queries on the test server. They will not match what you see on your computer. However, they should be proportional and indicative.

Sequential Scan

Example 1

Run a trivial scan, such as:

=# EXPLAIN ANALYZE SELECT * FROM bookings ;

This is not a very large table, so the most efficient way to read it in full is to do a vanilla sequential scan to retrieve the rows.

"Seq Scan on bookings  (cost=0.00..9714.33 rows=593433 width=21) (actual time=0.010..74.698 rows=593433 loops=1)"
"Planning Time: 0.050 ms"
"Execution Time: 116.646 ms"

Parallel Sequential Scan

A parallel sequential scan involves a few overheads:

Dividing the workload among workers
Passing the scanned data from the workers to the leader
Combining the output of the different workers

For simple queries, as in the previous example, these management overheads make it inefficient to use a parallel scan. Therefore, in Example 1 the planner used a plain sequential scan.

Example 2

Add a WHERE condition to the previous query:

=# EXPLAIN ANALYZE SELECT * FROM bookings WHERE total_amount < 4000 ;

Now, in addition to scanning the data, it must also check if the condition in the WHERE clause is satisfied. Thus, it is now efficient to distribute this task among workers.

"Gather  (cost=1000.00..7914.60 rows=438 width=21) (actual time=1.188..39.535 rows=257 loops=1)"
"  Workers Planned: 2"
"  Workers Launched: 2"
"  ->  Parallel Seq Scan on bookings  (cost=0.00..6870.80 rows=182 width=21) (actual time=0.619..30.749 rows=86 loops=3)"
"        Filter: (total_amount < '4000'::numeric)"
"        Rows Removed by Filter: 197725"
"Planning Time: 0.066 ms"
"Execution Time: 39.632 ms"

The above plan shows the use of 2 worker processes which execute the parallel sequential scan. The Gather node at the top combines the outputs of the workers. Creating parallel workers has a fixed overhead cost of 1000, hence the startup cost of the Gather node is 1000.

Sequential vs. Parallel Sequential Scans

Using a more complex query does not always lead to using parallel workers. Even with a WHERE clause, it can sometimes be more efficient to use a vanilla (non-parallel) sequential scan.

Example 3

Invert the condition in the WHERE clause of the previous query:

=# EXPLAIN ANALYZE SELECT * FROM bookings WHERE total_amount > 4000 ;

Instead of “total_amount < 4000”, the query now has “total_amount > 4000”. WIth this simple change, the planner now chooses to go back to doing a sequential scan, instead of doing it in parallel.

"Seq Scan on bookings  (cost=0.00..11197.91 rows=592839 width=21) (actual time=0.012..120.480 rows=593028 loops=1)"
"  Filter: (total_amount > '4000'::numeric)"
"  Rows Removed by Filter: 405"
"Planning Time: 0.092 ms"
"Execution Time: 156.929 ms"

Look closely at the (green) highlighted lines in the query plans in Examples 2 and 3.

In Example 2, each worker removed nearly 200, 000 rows. The query returned 257 rows out of a total of over 590, 000.
In Example 3, using the filter removed only 405 rows. So the number of rows returned is almost the same as the total number of rows in the table.

Removing such a large number of rows is more efficient when the workload is distributed among child processes - because there is more work to be done. In Example 3, only a small number of rows (relative to the total) have to be removed. Hence, it is more efficient to use the sequential scan with the filter condition, instead of adding on the overhead of parallelizing the scan. The planner uses table statistics to estimate how many rows it expected to meet the filter condition.

Bitmap Index Scan

Indexing is one of the most important techniques to improve query performance. Consider again the query in Example 2:

=# EXPLAIN ANALYZE SELECT * FROM bookings WHERE total_amount < 4000 ;

Example 4

The query above has a WHERE clause based on the value of the column “total_amount”. Add an index over this column:

=# CREATE INDEX bookings_totalamount ON bookings (total_amount) ;

=# VACUUM ANALYZE bookings ;

Re-run the earlier SELECT query:

=# EXPLAIN ANALYZE SELECT * FROM bookings WHERE total_amount < 4000 ;

After creating the index, the query:

Has a total cost of 1259 (compared to 7915 earlier) and an execution time of 0.35 ms (compared to 40 ms earlier). some text
- Using the index means it doesn’t have to scan every row to find the few relevant rows. So it executes much faster.
Instead of a (parallel) sequential scan, it now does a Bitmap Index Scan followed by a Bitmap Heap Scan.
The Bitmap Heap Scan can begin only after the Bitmap Index scan is complete. Hence the heap scan has a startup cost of 11.80, which is higher than the execution cost (11.70) of the index scan.
The planning time is 0.4 ms (compared to 0.08 ms earlier)some text
- The planner now has to consider the impact of the index, instead of blindly doing a sequential scan. So, the planning time is higher.

The execution plan for the query after adding the index is shown below.

"Bitmap Heap Scan on bookings  (cost=11.80..1259.71 rows=436 width=21) (actual time=0.056..0.316 rows=257 loops=1)"
"  Recheck Cond: (total_amount 	< '4000'::numeric)"
"  Heap Blocks: exact=246"
"  ->  Bitmap Index Scan on bookings_totalamount  (cost=0.00..11.70 rows=436 width=0) (actual time=0.029..0.029 rows=257 loops=1)"
"        Index Cond: (total_amount < '4000'::numeric)"
"Planning Time: 0.098 ms"
"Execution Time: 0.344 ms"

The execution starts by creating the bitmap. It does this by scanning the index on the column on which the filter (in the WHERE clause) is applied. some text
1. It applies the filer to identify the relevant rows
It then scans the heap to extract the information corresponding to the relevant rows. Notice that almost all the execution time and cost is spent on the heap scan. The index scan is very fast.

Plans vs Actuals

Look closely at the highlighted sections in the first line of the above plan:

"Bitmap Heap Scan on bookings (cost=11.80..1259.71 rows=436 width=21) (actual time=0.056..0.316 rows=257 loops=1)"

The planner originally estimated there to be 436 rows that meet the criteria. After executing the query, there were only 257 rows. This discrepancy is because the plan is based on the statistical properties of the table - these statistical properties are derived via random sampling. This is also why it is important to regularly refresh the table statistics using the VACUUM and ANALYZE commands.

When the Planner Ignores the Index

The “bookings” table has over half a million rows. The query in the previous section returns around 250 rows out of the half-million. This is a very good scenario to use the index - returning a small number of rows from the table. Full table scans are the least efficient when the goal is to return a small percentage of the rows.

To recall, this is the query from the last section:

=# EXPLAIN ANALYZE SELECT * FROM bookings WHERE total_amount < 4000 ;

Example 5

Modify the above query to invert the WHERE condition:

=# EXPLAIN ANALYZE SELECT * FROM bookings WHERE total_amount > 4000 ;
Clearly, this query is going to return a large number of rows. In fact, recall from Example 3 that the size of the returned resultset is almost the same as that of the original table. In such a situation, using the index only adds computational overhead, without adding much (or any) value. Indeed, when you run the modified query, the planner no longer uses the index. This is the query plan on running the modified query:

"Seq Scan on bookings  (cost=0.00..11197.91 rows=592778 width=21) (actual time=0.007..109.540 rows=593028 loops=1)"
"  Filter: (total_amount > '4000'::numeric)"
"  Rows Removed by Filter: 405"
"Planning Time: 0.103 ms"
"Execution Time: 143.515 ms"

Full table scans, and not index scans, are the most efficient when the query has to return a large percentage of the total number of rows. Thus, in this case, the planner does a sequential scan on the table and filters for the relevant rows.

Index-only Scans

Recall that a typical index based scan uses the index to identify the relevant rows. It then uses that information to read the data pages to extract the data in the identified rows.

However, if there is a relevant index that covers all the columns to be returned, there is no need to extract the complete row. Thus, an index-only scan is used when there is a relevant index that includes the columns to be returned.

Example 6

Modify the SELECT query of Example 4 to return only the single column that the index covers:

"Index Only Scan using bookings_totalamount on bookings  (cost=0.42..15.46 rows=402 width=6) (actual time=0.027..0.074 rows=257 loops=1)"
"  Index Cond: (total_amount < '4000'::numeric)"
"  Heap Fetches: 0"
"Planning Time: 0.130 ms"
"Execution Time: 0.110 ms"

The query that returns all the columns in the selected rows has a cost of 1172 and an execution time of 0.7 ms.
The query that returns only the indexed column has a cost of 15 and an execution time of 0.1 ms.
Recall that in the previous example, the bulk of the cost is spent on the heap scan (to read the data), not the index scan.

Thus, limiting the number of returned columns and having an index covering those columns gives a strong performance boost. Index-only scans are the most efficient type of scanning. Thus, consider using SELECT on only the needed columns instead of doing a SELECT *.

Conclusion

This article discussed query plans used in reading data from tables. Operations such as joining, sorting, etc., all start with reading relevant data from the tables. The next articles in this series discuss query plans for joining tables, for aggregating data, for sorting data, and more.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

The data movement infrastructure for the modern data teams.

Try a 14-day free trial

About the Author

Arun's interest in building new things from scratch led him to technology and to learning full stack development skills. He owes much of what he knows to well written documentation. He hopes to pay it forward by sharing his own learnings in the form of hands-on tutorials. His current interests are data analysis and AI/ML tools.