Tuning your PostgreSQL for High Performance

How you can make the most out of this powerful database

Photo by Hoover Tung on Unsplash

Table of Contents

  • General principles
  • Indexes
  • Index Types
  • Improving queries
  • Locks
  • Rules of thumb
  • PG Config
  • BULK Updates/Inserts

Assumptions

  • You’ve already used PostgreSQL in the past
  • Basic knowledge of what an index and constraints are

Common issues you might have faced in the past

  • A query is slow at times, but only affects certain users
  • High memory usage
  • High query latency, even for simple queries
  • The database is not responding
  • My server code is not able to connect but I’m able to connect with my root account

General principles

This will allow you to hot-swap servers with less chance of downtime, de-commission nodes, simple round-robin, and geo-distribution to name a few examples.

When your database is smaller than 10 GB

if your database has less than 10 GB. It will be fast with the default settings unless you really start to go crazy with your queries and that’s usually a good sign that your table layout needs to be improved

Read Replicas. Why and How

Postgres is an amazing database, but it has one “problem”. The Vanilla version can only have one write node.
This means there is only so high you can vertically scale your machine.

A simple way to overcome this is by adding read replicas and only use your write instance for writes and high latency sensible queries.
This will also help you with lock/deadlock issues since you will be splitting your reads and writes (caution when you expect those locks to save you from inconsistent behavior on reads)

Use Read Replicas to increase your throughput and resilience

PGBouncer + Connection Pooling

PostgreSQL was built to handle a small number of long-lasting connections.
That’s why it has a relatively small default limit of connections and it’s not uncommon for people to get the error “connection limit exceeded for non-superusers”.
One might be tempted to just increase that limit, but there are better ways.
Postgres will usually complete the same 10,000 transactions faster by doing them 5, 10, or 20 at a time than by doing them 500 at a time. Determining exactly how many should be done at once varies by workload and requires tuning.

https://wiki.postgresql.org/wiki/Number_Of_Database_Connections
https://www.cybertec-postgresql.com/en/tuning-max_connections-in-postgresql/

Another thing to take into consideration with connections is that creating and destroying them sometimes can be responsible for as much as half of that operation time and resource usage.
Because of that, it’s always recommended to have persistent connections.
This, in turn, will just exacerbate the previous limit issue.

So how can we solve this?

The answer is using a connection pool. Sometimes the framework you’re using offers that (like SQLAlchemy), other times you might need a specialized tool.
But if you’re using multiple machines to serve your API (as you probably are), you will run into the previous problem again since your framework pooler will only be responsible for managing the instance it’s running on.

When you have multiple servers, a good solution is using PGBouncer.

This amazing tool will allow you to connect multiple servers to it instead of connecting directly to Postgres.

It will queue incoming connections when the concurrent amount is greater than the one specified by you.
I’ll leave PGBouncer for another post.

PosgreSQL performance will increase dramatically if you have a small number of long-lasting connections instead of constantly creating new ones.

pgBadger

pgBadger is a PostgreSQL log analyzer built for speed that is able to generate full reports from your PostgreSQL log file and show them on a web interface. It’s a single and small Perl script that outperforms any other PostgreSQL log analyzer.
I always recommend using it if your Postgres is critical to your systems.

Keep accurate logs and metrics of your servers so you can make informed decisions

Your transactional DB is not for logging

Logging using your database can lead to massive bloat and it will create a bottleneck on various fronts.
Use s3, cloudwatch, papertrail, elk, logz.io, etc.
Not only will it be faster, cheaper, and allow better visibility, it will prevent your logs from affecting your transactional queries.

OLTP vs OLAP

Pg is usually used in OLTP scenarios with a bit of OLAP.
That is fine as long as you understand the difference between them.
For example, if a public-facing endpoint contains a very complex query that loads a lot of data and does complex operations, it’s probably a good indicator that it should be treated as OLAP and that endpoint refactored to adopt a different strategy.

Keep your queries lean and use data pipelines to feed a data lake that can be used for analytics

Indexes

There ain’t no such thing as a free lunch (TANSTAAFL)

The main thing to have in mind about indexes is that they are not a silver bullet that will solve all your problems without a gotcha.

Space usage
They have to be stored somewhere right?

Overhead for creating, updating, and deleting rows
They have to be maintained as your data changes

In this image, you can see the space those indexes occupy

SQL to fetch that table

You can find more resources on index maintenance here: https://wiki.postgresql.org/wiki/Index_Maintenance

Using these “tools”, you can (and should) keep an eye on their usage.
If you see that they are not being used, that means keeping them is only costing you resources and hurting your performance.

If you have read replicas, always cross-check between all instances since each instance might have different queries targeting them and the above query will produce different statistics for each database.

Nowadays with the proliferation of ORM and automatic migrations, it is very common to see dozens (and even hundreds) of indexes that are 100% useless and are just bloating your database.

Index Types

Like beer, indexes come in multiple forms and flavors

You should know your indexes, although 99% of the time, you’re going to use B-Tree indexes, there’s that 1% that can make a huge difference if used right.

Balanced Tree (B-Tree)

This is the default index type and also the most commonly used.

Hash

Generalized Inverted Index (GIN)

Generalized Inverted Search Tree (GiST)

Space partitioned GiST (SP-GiST)

Block Range Indexes (BRIN)

Partial Indexes

This image shows a practical example of space savings with a partial index (indexname ends with _filters)

In this query, you’ll be creating an index that only indexes rows if the column is_main equals TRUE.

If you know you’ll have queries that will focus on specific groups of rows, consider using partial indexes to save space and gain performance.

Simple rules/improvements

SELECT * …
Select only the columns you’re going to use. This will translate into a smaller package to transport over the network and potentially avoid fetching TOASTed values.

SELECT DISTINCT …
DISTINCT is a really expensive way to avoid row duplicates and usually a sign of bad table design. Consider refactoring your tables/queries to remove the need to use that keyword.

NOT IN
NOT EXISTS will usually outperform “NOT IN” by a good margin.

UNION VS UNION ALL
If you don’t need deduplication, UNION ALL is much cheaper

JOINS VS Nested queries
JOINS are usually faster than nested queries. That’s why the query planner will try its best to transform subqueries into JOINS

CACHING WHEN OPTIMIZING
When optimizing, be careful when doing similar queries multiple times and thinking you improved it just because it’s faster. There’s a good chance you’re getting cached results (from Pg or even the OS).

ANALYZE
Every time you create a new index or do a major version upgrade, run analyze.
If you don’t, that new index might not be used for a while, and in the case of a major upgrade, you’ll have a very bad time with slow queries since all the statistics used by the query planner are discarded.

Make sure you run ANALYZE after major version upgrades since all stats are discarded, resulting in very slow queries for a while if you don’t run it.

CHAR(N)
Use TEXT instead. TEXT will grow dynamically according to your needs and outperform char(n). If you need size constraints, you have better options like using varchar(n) which is basically TEXT with length limit or you can use a CONSTRAINT

SERIAL
It is recommended to use IDENTITY instead since SERIAL has some weird behaviors.

For more in-depth explanations about the previous points and more anti-patterns, refer to the link below to see more: https://wiki.postgresql.org/wiki/Don%27t_Do_This

GOING BEYOND THE BASICS

Here be dragons

PostgreSQL Query Planner

query plan parsed by depesz explain tool

Above you can see an explain plan parsed by this great tool.
Learning to read and understand explain plans will take your PostgreSQL to the next level.
This gitlab page has a really good tutorial:
https://docs.gitlab.com/ee/development/understanding_explain_plans.html

Locks

Before we go into locks, let’s first understand how PostgreSQL keeps data consistent in a multiuser environment, from the docs:

Data consistency is maintained by using a MVCC (Multi-Version Concurrency Control) model. This means that each SQL statement sees a snapshot of data as it was some time ago, regardless of the current state of the underlying data. This prevents statements from viewing inconsistent data produced by concurrent transactions performing updates on the same data rows, providing transaction isolation for each database session. MVCC, by eschewing the locking methodologies of traditional database systems, minimizes lock contention in order to allow for reasonable performance in multiuser environments.

To make the most out of this MVCC model, PostgreSQL has a whole range of different lock types in PostgreSQL to allow a reasonable concurrency while keeping strict consistency constraints.

TYPES OF LOCKS

In the picture above, you can see different types of locks and when they can’t exist at the same time (X marks the conflict).

A practical example of this is when you issue an UPDATE command. It will allow SELECT to happen to the affected rows, but won’t allow other UPDATE commands while the first one is still executing.

Runaway locks can transform your whole production environment into a dumpster fire, block index creation, or block “alter” queries from running.
Have you ever waited hours for a CREATE INDEX to complete? There’s a good chance that for a good part of that time, it was only waiting to acquire a lock.

A nice way to detect those is to run this query:

And it will give you an output similar to this (if you have a lock):

With this, you can decide if you should wait or terminate the query that is keeping that lock.

You have two options if you want to terminate that query as you can see in the gist below:

If you want to know more about locks in PostgreSQL, this is a great post.

PG CONFIG

work_mem

Sets the amount of memory the database server uses for shared memory buffers. In other words, this defines how much memory Postgres can allocate per each query to run them.

on row 6 you can see “external merge Disk”, meaning that the disk was used to execute the sort

The image above has the default 4 MB work_mem limit. In this case, it’s using a complex sort that spills to disk as you can see by the “external merge Disk”.

on row 5 you can see “quicksort Memory”, meaning that the sort was all done in memory

After increasing work_mem, we see that the sort changed to “quicksort Memory” indicating that now it doesn’t need to spill to disk.
This can give you huge performance gains.

maintenance_work_mem
Specifies the maximum amount of memory to be used by maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. Increasing it might speed up those operations.

effective_cache_size
effective_cache_size should be set to an estimate of how much memory is available for disk caching by the operating system and within the database itself, after taking into account what’s used by the OS itself and other applications.
If you’re using services like AWS RDS, this setting is automatically set for you to a generally sensible value.

shared_buffers
The shared_buffers configuration parameter determines how much memory is dedicated to PostgreSQL to use for caching data.
This is a value that usually needs tuning unless you’re using a managed service.

More info:
https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

Bulk insert/update

COPY + INSERT INTO

When you have external data that you need to import, “copy” is a good option

In the example above, we needed to import a large amount of data from an external source. Using COPY or \copy (depending on if you have the file in the server of the client) is probably the fastest option.

You can also leverage the power of “INSERT INTO SELECT” to merge the data from a COPY with data from other tables, loading the initial data into a temporary table first.

CTE

If you just need to compute information that already exists in your DB, a CTE might do the trick

CTEs allow you to cache expensive queries and use that multiple times. In the example above, I’m “reducing” the totals into constants and then updating a counter table with the results.

Talks you should watch

Improving PostgreSQL
https://www.youtube.com/watch?v=yhOkob2PQFQ

Indexes
https://www.youtube.com/watch?v=clrtT_4WBAw

EXPLAIN Explained
https://www.youtube.com/watch?v=mCwwFAl1pBU

Locks
https://www.youtube.com/watch?v=OUHsBuXzdwk

How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!

Stay tuned for the next post. Follow so you won’t miss it!

Principal Engineer @ Farfetch https://www.linkedin.com/in/luis-sena/