EnjoY | Database Research And Development: PostgreSQL : Getting the Most Out Of Your PostgreSQL Indexes

Friday, September 18, 2020

PostgreSQL : Getting the Most Out Of Your PostgreSQL Indexes

This article is half-done without your Comment! *** Please share your thoughts via Comment ***

Getting the Most Out Of Your PostgreSQL Indexes

Tips and tricks to better utilize your Postgres indexes


In the Postgres world, indexes are essential to efficiently navigate the table data storage (aka the “heap”). Postgres does not maintain a clustering for the heap, and the MVCC architecture leads to multiple versions of the same tuple lying around. Creating and maintaining effective and efficient indexes to support applications is an essential skill.

Read on to check out a few tips on optimizing and improving the use of indexes in your deployment.

Note: Queries shown below are run on an unmodified pagila sample database.

Use Covering Indexes

Consider a query to fetch the emails of all inactive customers. The customer table has an active column, and the query is straight-forward:

pagila=# EXPLAIN SELECT email FROM customer WHERE active=0;
                        QUERY PLAN
-----------------------------------------------------------
 Seq Scan on customer  (cost=0.00..16.49 rows=15 width=32)
   Filter: (active = 0)
(2 rows)

The query calls for a full sequential scan of the customer table. Let’s create an index on the active column:

pagila=# CREATE INDEX idx_cust1 ON customer(active);
CREATE INDEX
pagila=# EXPLAIN SELECT email FROM customer WHERE active=0;
                                 QUERY PLAN
-----------------------------------------------------------------------------
 Index Scan using idx_cust1 on customer  (cost=0.28..12.29 rows=15 width=32)
   Index Cond: (active = 0)
(2 rows)

This helps, and the sequential scan has become an “index scan”. This means Postgres will scan the index “idx_cust1”, and then further lookup the table’s heap to read the other column values (in this case, the email column) that the query needs.

PostgreSQL 11 introduced covering indexes. This feature allows you to include one or more additional columns in the index itself – that is, the values of these extra columns are stored within the index data storage.

If we were to use this feature and include the value of email inside the index, Postgres will not need to look into the table’s heap to get the value of email. Let’s see if this works:

pagila=# CREATE INDEX idx_cust2 ON customer(active) INCLUDE (email);
CREATE INDEX
pagila=# EXPLAIN SELECT email FROM customer WHERE active=0;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Index Only Scan using idx_cust2 on customer  (cost=0.28..12.29 rows=15 width=32)
   Index Cond: (active = 0)
(2 rows)

The “Index Only Scan” tells us that the query is now completely satisfied by the index itself, thus potentially avoiding all the disk I/O for reading the table’s heap.

Covering indexes are available only for B-Tree indexes as of now. Also, the cost of maintaining a covering index is naturally higher than a regular one.

Use Partial Indexes

Partial indexes only index a subset of the rows in a table. This keeps the indexes smaller in size and faster to scan through.

Assume we need to get the list of emails of customers located in California. The query is:

SELECT c.email FROM customer c
JOIN address a ON c.address_id = a.address_id
WHERE a.district = 'California';

which has a query plan that involves scanning both the tables that are joined:

pagila=# EXPLAIN SELECT c.email FROM customer c
pagila-# JOIN address a ON c.address_id = a.address_id
pagila-# WHERE a.district = 'California';
                              QUERY PLAN
----------------------------------------------------------------------
 Hash Join  (cost=15.65..32.22 rows=9 width=32)
   Hash Cond: (c.address_id = a.address_id)
   ->  Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=34)
   ->  Hash  (cost=15.54..15.54 rows=9 width=4)
         ->  Seq Scan on address a  (cost=0.00..15.54 rows=9 width=4)
               Filter: (district = 'California'::text)
(6 rows)

Let’s see what a regular index gets us:

pagila=# CREATE INDEX idx_address1 ON address(district);
CREATE INDEX
pagila=# EXPLAIN SELECT c.email FROM customer c
pagila-# JOIN address a ON c.address_id = a.address_id
pagila-# WHERE a.district = 'California';
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Hash Join  (cost=12.98..29.55 rows=9 width=32)
   Hash Cond: (c.address_id = a.address_id)
   ->  Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=34)
   ->  Hash  (cost=12.87..12.87 rows=9 width=4)
         ->  Bitmap Heap Scan on address a  (cost=4.34..12.87 rows=9 width=4)
               Recheck Cond: (district = 'California'::text)
               ->  Bitmap Index Scan on idx_address1  (cost=0.00..4.34 rows=9 width=0)
                     Index Cond: (district = 'California'::text)
(8 rows)

The scan of address has been replaced with an index scan over idx_address1, and a scan of address’s heap.

Assuming this is a frequent query and needs to be optimized, we can use a partial index which only indexes those rows of address where the district is ‘California’:

pagila=# CREATE INDEX idx_address2 ON address(address_id) WHERE district='California';
CREATE INDEX
pagila=# EXPLAIN SELECT c.email FROM customer c
pagila-# JOIN address a ON c.address_id = a.address_id
pagila-# WHERE a.district = 'California';
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Hash Join  (cost=12.38..28.96 rows=9 width=32)
   Hash Cond: (c.address_id = a.address_id)
   ->  Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=34)
   ->  Hash  (cost=12.27..12.27 rows=9 width=4)
         ->  Index Only Scan using idx_address2 on address a  (cost=0.14..12.27 rows=9 width=4)
(5 rows)

The query now reads only the index idx_address2 and does not touch the table address.

Use Multi-Value Indexes

Some columns that need indexing may not have a scalar data type. Column types like jsonbarrays and tsvector have composite or multiple values. If you need to index such columns, it is usually the case that you need to search through the individual values in those columns as well.

Let’s try to find all movie titles that include behind the scenes outtakes. The film table has a text array column called special_features, which includes the text array element Behind The Scenes if a movie has that feature. To find all such films, we need to select all rows that have “Behind The Scenes” in any of the values of the array special_features:

SELECT title FROM film WHERE special_features @> '{"Behind The Scenes"}';

The containment operator @> checks if the left hand side is a superset of the right hand side.

Here is the query plan:

pagila=# EXPLAIN SELECT title FROM film
pagila-# WHERE special_features @> '{"Behind The Scenes"}';
                           QUERY PLAN
-----------------------------------------------------------------
 Seq Scan on film  (cost=0.00..67.50 rows=5 width=15)
   Filter: (special_features @> '{"Behind The Scenes"}'::text[])
(2 rows)

which calls for a full scan of the heap, at a cost of 67.

Let’s see if a regular B-Tree index helps:

pagila=# CREATE INDEX idx_film1 ON film(special_features);
CREATE INDEX
pagila=# EXPLAIN SELECT title FROM film
pagila-# WHERE special_features @> '{"Behind The Scenes"}';
                           QUERY PLAN
-----------------------------------------------------------------
 Seq Scan on film  (cost=0.00..67.50 rows=5 width=15)
   Filter: (special_features @> '{"Behind The Scenes"}'::text[])
(2 rows)

The index is not even considered. The B-Tree index has no idea that there are individual elements in the value it indexed.

What we need is a GIN index.

pagila=# CREATE INDEX idx_film2 ON film USING GIN(special_features);
CREATE INDEX
pagila=# EXPLAIN SELECT title FROM film
pagila-# WHERE special_features @> '{"Behind The Scenes"}';
                                QUERY PLAN
---------------------------------------------------------------------------
 Bitmap Heap Scan on film  (cost=8.04..23.58 rows=5 width=15)
   Recheck Cond: (special_features @> '{"Behind The Scenes"}'::text[])
   ->  Bitmap Index Scan on idx_film2  (cost=0.00..8.04 rows=5 width=0)
         Index Cond: (special_features @> '{"Behind The Scenes"}'::text[])
(4 rows)

The GIN index is able to support matching of the individual value against the indexed composite value, resulting in a query plan with less than half the cost of the original.

Eliminate Duplicate Indexes

Over time indexes accumulate, and sometimes one gets added which has the exact same definition as another one. You can use the catalog view pg_indexes to get the human-readable SQL definitions of indexes. You can also easily detect identical definitions:

  SELECT array_agg(indexname) AS indexes, replace(indexdef, indexname, '') AS defn
    FROM pg_indexes
GROUP BY defn
  HAVING count(*) > 1;

And here’s the result when run on the stock pagila database:

pagila=#   SELECT array_agg(indexname) AS indexes, replace(indexdef, indexname, '') AS defn
pagila-#     FROM pg_indexes
pagila-# GROUP BY defn
pagila-#   HAVING count(*) > 1;
                                indexes                                 |                                defn
------------------------------------------------------------------------+------------------------------------------------------------------
 {payment_p2017_01_customer_id_idx,idx_fk_payment_p2017_01_customer_id} | CREATE INDEX  ON public.payment_p2017_01 USING btree (customer_id
 {payment_p2017_02_customer_id_idx,idx_fk_payment_p2017_02_customer_id} | CREATE INDEX  ON public.payment_p2017_02 USING btree (customer_id
 {payment_p2017_03_customer_id_idx,idx_fk_payment_p2017_03_customer_id} | CREATE INDEX  ON public.payment_p2017_03 USING btree (customer_id
 {idx_fk_payment_p2017_04_customer_id,payment_p2017_04_customer_id_idx} | CREATE INDEX  ON public.payment_p2017_04 USING btree (customer_id
 {payment_p2017_05_customer_id_idx,idx_fk_payment_p2017_05_customer_id} | CREATE INDEX  ON public.payment_p2017_05 USING btree (customer_id
 {idx_fk_payment_p2017_06_customer_id,payment_p2017_06_customer_id_idx} | CREATE INDEX  ON public.payment_p2017_06 USING btree (customer_id
(6 rows)

Superset Indexes

It is also possible that you end up with multiple indexes where one indexes a superset of columns that the other does. This may or may not be desirable – the superset one can result in index-only-scans which is a good thing, but may take up too much space, or maybe the query it was originally intended to optimize is not longer used.

If you wish to automate the detection of such indexes, the pg_catalog table pg_index is a good starting point.

Unused Indexes

As the applications that use the database evolve, so do the queries that they use. Indexes that were added earlier may no longer be used by any query. Everytime an index is scanned, it is noted by the statistics manager and a cumulative count is available in the system catalog view pg_stat_user_indexes as the value idx_scan. Monitoring this value over a period of time (say, a month) gives a good idea of which indexes are unused and can be removed.

Here is the query to get the current scan counts for all indexes in the ‘public’ schema:

SELECT relname, indexrelname, idx_scan
FROM   pg_catalog.pg_stat_user_indexes
WHERE  schemaname = 'public';

with output like this:

pagila=# SELECT relname, indexrelname, idx_scan
pagila-# FROM   pg_catalog.pg_stat_user_indexes
pagila-# WHERE  schemaname = 'public'
pagila-# LIMIT  10;
    relname    |    indexrelname    | idx_scan
---------------+--------------------+----------
 customer      | customer_pkey      |    32093
 actor         | actor_pkey         |     5462
 address       | address_pkey       |      660
 category      | category_pkey      |     1000
 city          | city_pkey          |      609
 country       | country_pkey       |      604
 film_actor    | film_actor_pkey    |        0
 film_category | film_category_pkey |        0
 film          | film_pkey          |    11043
 inventory     | inventory_pkey     |    16048
(10 rows)

Rebuild Indexes With Less Locking

It is not uncommon that indexes need to be recreated. Indexes can also become bloated, and recreating the index can fix that, causing it to become faster to scan. Indexes can also become corrupt. Altering index paramaters also may need recreation of the index.

Enable Paralell Index Creation

In PostgreSQL 11, B-Tree index creation is concurrent. It can make use of multiple parallel workers to speed up the creation of the index. However, you need to make sure that these configuration entries are set suitably:

SET max_parallel_workers = 32;
SET max_parallel_maintenance_workers = 16;

The default values are unreasonably small. Ideally, these numbers should increase with the number of CPU cores. See the docs for more information.

Create Indexes in the Background

You can also create an index in the background, using the CONCURRENTLY parameter of the CREATE INDEX command:

pagila=# CREATE INDEX CONCURRENTLY idx_address1 ON address(district);
CREATE INDEX

This is different from doing a regular create index in that it does not require a lock over the table, and therefore does not lock out writes. On the downside, it takes more time and resources to complete.

No comments:

Post a Comment

It’s all about friendly conversation here at small review :) I’d love to be hear your thoughts!

Be sure to check back again because I do make every effort to reply to your comments here.

Featured Post

SQL Server : SELECT all columns to be good or bad in database system

This article is half-done without your Comment! *** Please share your thoughts via Comment *** In this post, I am going to write about one o...

Popular Posts