Tuning your PostgreSQL for High Performance

How you can make the most out of this powerful database

Photo by Hoover Tung on Unsplash

Table of Contents

  • Common Issues

Assumptions

  • You know basic SQL

Common issues you might have faced in the past

  • A query is slow periodically

General principles

Why use a subdomain instead of IP/Hostname of your database

This will allow you to hot-swap servers with less chance of downtime, de-commission nodes, simple round-robin, and geo-distribution to name a few examples.

When your database is smaller than 10 GB

if your database has less than 10 GB. It will be fast with the default settings unless you really start to go crazy with your queries and that’s usually a good sign that your table layout needs to be improved

Read Replicas. Why and How

Postgres is an amazing database, but it has one “problem”. The Vanilla version can only have one write node.
This means there is only so high you can vertically scale your machine.

A simple way to overcome this is by adding read replicas and only use your write instance for writes and high latency sensible queries.
This will also help you with lock/deadlock issues since you will be splitting your reads and writes (caution when you expect those locks to save you from inconsistent behavior on reads)

Use Read Replicas to increase your throughput and resilience

PGBouncer + Connection Pooling

PostgreSQL was built to handle a small number of long-lasting connections.
That’s why it has a relatively small default limit of connections and it’s not uncommon for people to get the error “connection limit exceeded for non-superusers”.
One might be tempted to just increase that limit, but there are better ways.
Postgres will usually complete the same 10,000 transactions faster by doing them 5, 10, or 20 at a time than by doing them 500 at a time. Determining exactly how many should be done at once varies by workload and requires tuning.

https://wiki.postgresql.org/wiki/Number_Of_Database_Connections
https://www.cybertec-postgresql.com/en/tuning-max_connections-in-postgresql/

Another thing to take into consideration with connections is that creating and destroying them sometimes can be responsible for as much as half of that operation time and resource usage.
Because of that, it’s always recommended to have persistent connections.
This, in turn, will just exacerbate the previous limit issue.

So how can we solve this?

The answer is using a connection pool. Sometimes the framework you’re using offers that (like SQLAlchemy), other times you might need a specialized tool.
But if you’re using multiple machines to serve your API (as you probably are), you will run into the previous problem again since your framework pooler will only be responsible for managing the instance it’s running on.

When you have multiple servers, a good solution is using PGBouncer.

This amazing tool will allow you to connect multiple servers to it instead of connecting directly to Postgres.

It will queue incoming connections when the concurrent amount is greater than the one specified by you.
I’ll leave PGBouncer for another post.

PosgreSQL performance will increase dramatically if you have a small number of long-lasting connections instead of constantly creating new ones.

pgBadger

pgBadger is a PostgreSQL log analyzer built for speed that is able to generate full reports from your PostgreSQL log file and show them on a web interface. It’s a single and small Perl script that outperforms any other PostgreSQL log analyzer.
I always recommend using it if your Postgres is critical to your systems.

Keep accurate logs and metrics of your servers so you can make informed decisions

Your transactional DB is not for logging

Logging using your database can lead to massive bloat and it will create a bottleneck on various fronts.
Use s3, cloudwatch, papertrail, elk, logz.io, etc.
Not only will it be faster, cheaper, and allow better visibility, it will prevent your logs from affecting your transactional queries.

OLTP vs OLAP

Pg is usually used in OLTP scenarios with a bit of OLAP.
That is fine as long as you understand the difference between them.
For example, if a public-facing endpoint contains a very complex query that loads a lot of data and does complex operations, it’s probably a good indicator that it should be treated as OLAP and that endpoint refactored to adopt a different strategy.

Keep your queries lean and use data pipelines to feed a data lake that can be used for analytics

Indexes

There ain’t no such thing as a free lunch (TANSTAAFL)

The main thing to have in mind about indexes is that they are not a silver bullet that will solve all your problems without a gotcha.

Space usage
They have to be stored somewhere right?

Overhead for creating, updating, and deleting rows
They have to be maintained as your data changes

In this image, you can see the space those indexes occupy

SQL to fetch that table

You can find more resources on index maintenance here: https://wiki.postgresql.org/wiki/Index_Maintenance

Using these “tools”, you can (and should) keep an eye on their usage.
If you see that they are not being used, that means keeping them is only costing you resources and hurting your performance.

If you have read replicas, always cross-check between all instances since each instance might have different queries targeting them and the above query will produce different statistics for each database.

Nowadays with the proliferation of ORM and automatic migrations, it is very common to see dozens (and even hundreds) of indexes that are 100% useless and are just bloating your database.

Index Types

Like beer, indexes come in multiple forms and flavors

You should know your indexes, although 99% of the time, you’re going to use B-Tree indexes, there’s that 1% that can make a huge difference if used right.

Balanced Tree (B-Tree)

The B-Tree index type uses a balanced tree structure to speed up equality and range queries on columns of all data types. Since B-Tree index entries are sorted, they are sometimes used to retrieve table rows in order, thereby avoiding manually sorting them after retrieval.

This is the default index type and also the most commonly used.

Hash

Can only be used for equality comparisons and should only be used with Pg 10 or newer. I tend to use it for queries that use PostgreSQL more like a Key/Value store.
This index can be smaller and more performant than a B-Tree index for some use cases. Read more about it here.

Generalized Inverted Index (GIN)

Generalized Inverted indexes are great for indexing columns and expressions that contain more than one value. Good examples are array columns, text search documents (tsvector), and binary JSON documents (jsonb).

Generalized Inverted Search Tree (GiST)

GiST stands for Generalized Search Tree and isn’t a single indexing scheme but rather an infrastructure that makes it possible to implement indexing schemes for new data types by providing a balanced tree-structured access method.

Space partitioned GiST (SP-GiST)

Space partitioned GiST is also an infrastructure for implementing new indexing schemes, but instead of providing balanced tree-structures, it provides partitioned search trees. These can be used to implement different types of non-balanced data structures that can yield fast lookups if they are partitioned using a rule that matches the queries they are used for.

Block Range Indexes (BRIN)

Block range indexes are designed to handle very large tables in which the rows’ natural sort order correlates to certain column values. For example, a table storing log entries might have a timestamp column for when each log entry was written. By using a BRIN index on this column, scanning large parts of the table can be avoided when querying rows by their timestamp value with very little overhead.

Partial Indexes

If you know that your query will always ignore columns with defined values, you can make use of a partial index to save space and make it faster.

This image shows a practical example of space savings with a partial index (indexname ends with _filters)

In this query, you’ll be creating an index that only indexes rows if the column is_main equals TRUE.

If you know you’ll have queries that will focus on specific groups of rows, consider using partial indexes to save space and gain performance.

Simple rules/improvements

Even though building queries can be considered a mix between art and science, we have some general rules of thumb

SELECT * …
Select only the columns you’re going to use. This will translate into a smaller package to transport over the network and potentially avoid fetching TOASTed values.

SELECT DISTINCT …
DISTINCT is a really expensive way to avoid row duplicates and usually a sign of bad table design. Consider refactoring your tables/queries to remove the need to use that keyword.

NOT IN
NOT EXISTS will usually outperform “NOT IN” by a good margin.

UNION VS UNION ALL
If you don’t need deduplication, UNION ALL is much cheaper

JOINS VS Nested queries
JOINS are usually faster than nested queries. That’s why the query planner will try its best to transform subqueries into JOINS

CACHING WHEN OPTIMIZING
When optimizing, be careful when doing similar queries multiple times and thinking you improved it just because it’s faster. There’s a good chance you’re getting cached results (from Pg or even the OS).

ANALYZE
Every time you create a new index or do a major version upgrade, run analyze.
If you don’t, that new index might not be used for a while, and in the case of a major upgrade, you’ll have a very bad time with slow queries since all the statistics used by the query planner are discarded.

Make sure you run ANALYZE after major version upgrades since all stats are discarded, resulting in very slow queries for a while if you don’t run it.

CHAR(N)
Use TEXT instead. TEXT will grow dynamically according to your needs and outperform char(n). If you need size constraints, you have better options like using varchar(n) which is basically TEXT with length limit or you can use a CONSTRAINT

SERIAL
It is recommended to use IDENTITY instead since SERIAL has some weird behaviors.

For more in-depth explanations about the previous points and more anti-patterns, refer to the link below to see more: https://wiki.postgresql.org/wiki/Don%27t_Do_This

GOING BEYOND THE BASICS

Here be dragons

PostgreSQL Query Planner

query plan parsed by depesz explain tool

Above you can see an explain plan parsed by this great tool.
Learning to read and understand explain plans will take your PostgreSQL to the next level.
This gitlab page has a really good tutorial:
https://docs.gitlab.com/ee/development/understanding_explain_plans.html

Locks

Before we go into locks, let’s first understand how PostgreSQL keeps data consistent in a multiuser environment, from the docs:

Data consistency is maintained by using a MVCC (Multi-Version Concurrency Control) model. This means that each SQL statement sees a snapshot of data as it was some time ago, regardless of the current state of the underlying data. This prevents statements from viewing inconsistent data produced by concurrent transactions performing updates on the same data rows, providing transaction isolation for each database session. MVCC, by eschewing the locking methodologies of traditional database systems, minimizes lock contention in order to allow for reasonable performance in multiuser environments.

To make the most out of this MVCC model, PostgreSQL has a whole range of different lock types in PostgreSQL to allow a reasonable concurrency while keeping strict consistency constraints.

TYPES OF LOCKS

In the picture above, you can see different types of locks and when they can’t exist at the same time (X marks the conflict).

A practical example of this is when you issue an UPDATE command. It will allow SELECT to happen to the affected rows, but won’t allow other UPDATE commands while the first one is still executing.

Runaway locks can transform your whole production environment into a dumpster fire, block index creation, or block “alter” queries from running.
Have you ever waited hours for a CREATE INDEX to complete? There’s a good chance that for a good part of that time, it was only waiting to acquire a lock.

A nice way to detect those is to run this query:

And it will give you an output similar to this (if you have a lock):

With this, you can decide if you should wait or terminate the query that is keeping that lock.

You have two options if you want to terminate that query as you can see in the gist below:

If you want to know more about locks in PostgreSQL, this is a great post.

PG CONFIG

PostgreSQL offers more config options than you can count, the ones described here are usually the most important ones to tune for your use case.

work_mem

Sets the amount of memory the database server uses for shared memory buffers. In other words, this defines how much memory Postgres can allocate per each query to run them.

on row 6 you can see “external merge Disk”, meaning that the disk was used to execute the sort

The image above has the default 4 MB work_mem limit. In this case, it’s using a complex sort that spills to disk as you can see by the “external merge Disk”.

on row 5 you can see “quicksort Memory”, meaning that the sort was all done in memory

After increasing work_mem, we see that the sort changed to “quicksort Memory” indicating that now it doesn’t need to spill to disk.
This can give you huge performance gains.

maintenance_work_mem
Specifies the maximum amount of memory to be used by maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. Increasing it might speed up those operations.

effective_cache_size
effective_cache_size should be set to an estimate of how much memory is available for disk caching by the operating system and within the database itself, after taking into account what’s used by the OS itself and other applications.
If you’re using services like AWS RDS, this setting is automatically set for you to a generally sensible value.

shared_buffers
The shared_buffers configuration parameter determines how much memory is dedicated to PostgreSQL to use for caching data.
This is a value that usually needs tuning unless you’re using a managed service.

More info:
https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

Bulk insert/update

You have two main approaches and when to use each will depend on your scenario.

COPY + INSERT INTO

When you have external data that you need to import, “copy” is a good option

In the example above, we needed to import a large amount of data from an external source. Using COPY or \copy (depending on if you have the file in the server of the client) is probably the fastest option.

You can also leverage the power of “INSERT INTO SELECT” to merge the data from a COPY with data from other tables, loading the initial data into a temporary table first.

CTE

If you just need to compute information that already exists in your DB, a CTE might do the trick

CTEs allow you to cache expensive queries and use that multiple times. In the example above, I’m “reducing” the totals into constants and then updating a counter table with the results.

Talks you should watch

If you want to learn more or just prefer learning by watching talks, I recommend the following great videos (not mine):

Improving PostgreSQL
https://www.youtube.com/watch?v=yhOkob2PQFQ

Indexes
https://www.youtube.com/watch?v=clrtT_4WBAw

EXPLAIN Explained
https://www.youtube.com/watch?v=mCwwFAl1pBU

Locks
https://www.youtube.com/watch?v=OUHsBuXzdwk

How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!

Stay tuned for the next post. Follow so you won’t miss it!

Principal Engineer @ Farfetch https://www.linkedin.com/in/luis-sena/