This is the first of a series of articles about studying PostgreSQL query plans. This article introduces the fundamental principles of query planning and shows how to get started with the EXPLAIN commands. Subsequent articles explain how to read a query plan, discuss the individual steps in a query plan tree, show practical examples, and suggest query optimization guidelines.
Background Extracting information from a database is a crucial part of any data-related workflow. Databases store information in data structures called heaps.
Each table gets its own heap. Heaps consist of specially formatted files (page files) which are saved on the disk.some textEach page file is of a fixed size. In PostgreSQL, the fixed page size is 8kB, or 8192 Bytes. The data heap of a single table can span many pages. Data rows are represented as tuples. Tuples are written (saved) in page files. some textThe number of rows (tuples) in a page depends on the width (size of the individual columns) of the rows. The database engine accesses these pages on the disk (or in cache) to retrieve the relevant information. Thus, the database engine needs to decide which pages to access and in what order. Figuring out the most efficient way to do this can be a difficult process, especially for complex queries on large datasets. This is the job of the query planner software.
How PostgreSQL processes Queries The database processes queries in a few steps:
Parsing and preliminary processing. This culminates in a query tree, which is an internal representation of a SQL statement. Each part of the query is a branch in the query tree. Creating a query plan. The planner (also called optimizer) analyzes all the possible paths that it can take to evaluate the query. Each path consists of a sequence of internal methods (for example, methods for scanning tables, for sorting, etc.). It assigns a relative cost to each step in the sequence. Based on these costs, it chooses what it estimates to be the most optimal path. The query plan is also represented as a tree - the query plan tree. Executing the query plan. The executor processes each node (step) in the query plan tree. It is responsible for fetching data from the storage system, filtering for relevant rows, sorting the selected rows, joining tables, and so on. The executor returns the rows and columns requested by the query. How Query Execution Works When executed, each node in the query plan tree returns a set of rows. For example, a sequential scan node returns all the rows of the table. A node with a filter condition returns only those rows that satisfy the filter condition.
The executor recursively processes the entire query plan tree, starting from the topmost node. The following steps roughly describe the recursive algorithm:
The first (topmost) node in the query plan tree calls the executor. It checks if any other (child) nodes need to be processed before the parent node can do its job. If so, the executor runs the child node(s) before executing the parent node. This process continues recursively until all child nodes are processed. some textThus, the bottommost nodes are executed first and the topmost parent node is executed last. A node starts returning rows as soon as it can. For example, a join node starts returning joined rows when it receives one row from each table to be joined. A sequential scan row returns a row as soon as it reads it. However, a sort row has to wait to receive all the rows from the child node before it can even start sorting. This process continues until a child node returns NULL, indicating that it has no more data to return for further processing. The startup time of a node is the time it needs before it can start returning rows. Thus, a sequential scan node has a low startup time - it starts returning rows as soon as it starts reading the table. However, a sort node can have a high startup time - it needs all the input rows before it can start sorting. EXPLAIN commands are used to know the details of the execution path (query plan) and optionally, the execution time and other operational details. Analyzing these details is often the starting point in thinking of ways to optimize queries to run faster.
How EXPLAIN Works EXPLAIN, when used without any special parameters, does not actually execute the query. It uses statistical information about the contents of different tables to estimate the execution cost. It gets this statistical information by randomly sampling the contents of all the data tables.
Table Statistics Table statistics refers to statistical information about tables. These statistics are stored in raw form in the table “pg_statistic”. To get human readable information on table statistics, look at the view “pg_stats”:
=# SELECT * FROM pg_stats WHERE tablename = ‘pg_database’ ;
The above query shows table statistics for one of the internal tables. Run it on a table in a real database to get a better idea of the kind of statistical information Postgres stores. It is also possible to create custom statistics on tables. A detailed discussion of table statistics is beyond the scope of this article.
Vacuuming For Postgres to correctly estimate execution times and choose the most efficient execution path, it needs to have access to the latest table statistics. The VACUUM operation removes dead tuples which store outdated data (such as the data of deleted rows). The ANALYZE operation updates table statistics.
To ensure that the planner has access to the latest statistics it is advisable to run VACUUM ANALYZE on the table:
At periodic intervals After adding a new index to the table How to Run the EXPLAIN Commands Prerequisites To follow the examples in this article and in the rest of the series, you are expected to have access to:
The UNIX/Linux terminal The PSQL command-line frontend for PostgreSQL version 14 or higher The pgAdmin IV GUI interface The “postgres” user to run SQL queries There are two ways to run the EXPLAIN commands:
Using the PSQL interface Using the pgAdmin 4 GUI Query Tool The PSQL Interface Using the PSQL command line, prepend EXPLAIN (or any EXPLAIN-based command) to a query. To start with, run the EXPLAIN command on a trivial query to select everything from an internal system database:
=# EXPLAIN SELECT * FROM pg_database;
The output looks like this:
"Seq Scan on pg_database (cost=0.00..1.09 rows=9 width=271)"
Similarly, try:
=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM pg_database;
The next article in the series explains how to read and understand query plans.
The pgAdmin 4 Query Tool GUI You can also use pgAdmin to use the EXPLAIN-based commands to view the query plans and other details in semi-graphical form.
Set up pgAdmin 4 to connect to the right database. The Object Explorer pane on the left shows the list of databases available. Right click on the name of the database and select “Query Tool”. In the Query window, type out the query for which you want to see the plan (there is no need to prepend the EXPLAIN keyword): SELECT * FROM pg_database ;
Just above the Query window, there is a horizontal pane of buttons - to open files, save files, and so on. This pane also has buttons to run and explain queries. Use the Explain Settings (down arrow) button to enable to these options:some textBuffers Verbose Timing Costs Additionally, you can also choose to enable or disable the other parameters, like Settings, Summary, and WAL. Press the Explain button (Marked with an E) to show the query plan without executing the query. The Explain Analyze button is right next to it. You can also use the function keys to perform different actions on the entered query:some textF5 to run the query F7 to run EXPLAIN SHIFT + F7 to run EXPLAIN ANALYZE The Explain window, below the Query window, shows the output. It has three tabs - Graphical, Analysis, and Statistics:
The Graphical tab shows the different stages of executing the query. Click on each stage to see information about it. The Analysis tab shows a textual description of the nodes of the query plan tree. The Statistics tab shows information about the plan itself, such as the number of nodes of each type. Both interfaces (PSQL and pgAdmin) present the same information, in different format and layouts. The output of the graphical interface is largely self-explanatory after one is familiar with the core ideas. Thus, the discussion in this article focuses on the underlying principles and is based on the PSQL command line interface.
Which to Use – PSQL or pgAdmin Query Tool GUI While they both broadly serve the same purpose, there are important differences between both the interfaces.
pgAdmin’s GUI can seem more welcoming than the CLI output, especially for those used to GUIs and IDEs In pgAdmin, the output window only shows the output of the last query. In the CLI, you can always scroll up to see the plans of previous queries. This can be useful for comparing outputs of different queries. The pgAdmin output is nicely formatted in tabular form and it also has some color coding. This makes it much easier on the eyes. It is good for drilling in on a specific query plan. In practice, try out both and see which you like better.
Note that pgAdmin IV also has a PSQL interface, which is functionally equivalent to running PSQL on the command line. To access this, right click the database name and select the option “PSQL Tool”. However, the terminal emulator built into pgAdmin IV has a subpar user-experience compared to a dedicated Linux CLI terminal.
Conclusion This article introduced the core concepts of query plans and showed how to use the EXPLAIN commands. The next article in the series discusses how to read and interpret query plans.