Best Practices for Optimizing Postgres Query Performance

Over the last 5 years, we’ve learned a lot on how to optimize Postgres performance. In this eBook, we wrote down our key learnings on how to get the most out of your database.

Have you ever received questions from your team asking why your product’s application is running slowly? Most probably you have. But did you ever consider whether actually your database was at fault for the issue?

In our experience: Database Performance = Application Performance.

Fastest Way to Load Data Into PostgreSQL Using Python

From two minutes to less than half a second!

Data written to an unlogged table will not be logged to the write-ahead-log (WAL), making it ideal for intermediate tables. Note that UNLOGGED tables will not be restored in case of a crash, and will not be replicated.

.. Copy Data From a String Iterator with Buffer Size

In an attempt to squeeze one final drop of performance, we notice that just like page_size, the copy command also accepts a similar argument called size:

size – size of the buffer used to read from the file.

Let’s add a size argument to the function:

@profile
def copy_string_iterator(connection, beers: Iterator[Dict[str, Any]], size: int = 8192) -> None:
    with connection.cursor() as cursor:
        create_staging_table(cursor)
        beers_string_iterator = StringIteratorIO((
            '|'.join(map(clean_csv_value, (
                beer['id'],
                beer['name'],
                beer['tagline'],
                parse_first_brewed(beer['first_brewed']).isoformat(),
                beer['description'],
                beer['image_url'],
                beer['abv'],
                beer['ibu'],
                beer['target_fg'],
                beer['target_og'],
                beer['ebc'],
                beer['srm'],
                beer['ph'],
                beer['attenuation_level'],
                beer['brewers_tips'],
                beer['contributed_by'],
                beer['volume']['value'],
            ))) + '\n'
            for beer in beers
        ))
        cursor.copy_from(beers_string_iterator, 'beers', sep='|', size=size)

The default value for size is 8192, which is 2 ** 13, so we will keep sizes in powers of 2:

>>> copy_string_iterator(connection, iter(beers), size=1024)
copy_string_iterator(size=1024)
Time   0.4536
Memory 0.0

>>> copy_string_iterator(connection, iter(beers), size=8192)
copy_string_iterator(size=8192)
Time   0.4596
Memory 0.0

>>> copy_string_iterator(connection, iter(beers), size=16384)
copy_string_iterator(size=16384)
Time   0.4649
Memory 0.0

>>> copy_string_iterator(connection, iter(beers), size=65536)
copy_string_iterator(size=65536)
Time   0.6171
Memory 0.0

SEARCHING FOR THE BEST COMPROMISE: “SKYLINE” QUERIES

Fortunately PostgreSQL allows us to use more sophisticated sort criteria. Sorting by a single column is boring. What we want is to somehow treat different columns differently. In this case customers might feel that distance is not really linear. If you are 20 or 50 meters away from the beach does not really matter anymore. However, being 50 meters or 1 km away really matters already. To make it easy I decided to go for the square root of the distance while still taking the price as it is. The result looks ways more promising as before:

1
2
3
4
5
6
7
8
9
10
test=# SELECT price * sqrt(distance_beach), *
FROM t_hotel
ORDER BY 1;
?column? | id | name | price | distance_beach
--------------+----+------------------------+-------+----------------
133.491572 | 2 | Crapstone Hotel | 90 | 2.2
139.427400 | 4 | Nowhere Middle Hotel | 45 | 9.6
185.903200 | 1 | ABC Motel | 120 | 2.4
221.37072977 | 3 | Luxury Watch Spa Hotel | 495 | 0.2
(4 rows)

It seems that the Crapstone hotel is the best bargain here. It is not the cheapest hotel but it is pretty close and still reasonably priced so maybe it is best to book that one.

The trouble starts when we look at the execution plan of this tiny PostgreSQL query:

1
2
3
4
5
6
7
8
9
test=# explain SELECT price * sqrt(distance_beach), *
FROM t_hotel
ORDER BY 1;
QUERY PLAN
------------------------------------------------------------------
Sort (cost=48.74..50.32 rows=630 width=132)
Sort Key: ((price * sqrt(distance_beach)))
-> Seq Scan on t_hotel (cost=0.00..19.45 rows=630 width=132)
(3 rows)

PostgreSQL will read all the data and sort by our custom criterial. While this is nice for a small data set, it will kill us if the amount of data keeps growing

It took almost 19 seconds (my laptop) to run the query. For sure: Most users would not tolerate this kind of behavior for too often, so we somehow need to improve runtime.

The SKYLINE OF operator does not exist in PostgreSQL (as in any other database engine I am aware of). However: PostgreSQL offers functional indexes, which are ideal in this case:

1
2
3
4
5
6
7
test=# CREATE FUNCTION rate_hotel(numeric, numeric)
RETURNS numeric AS
$$
SELECT $1 * sqrt($2)
$$
LANGUAGE 'sql' IMMUTABLE;
CREATE FUNCTION

The important thing here is to use an IMMUTABLE function. We must assure that the function used to rank the data is perfectly deterministic and its result does not change over time given the same input parameters.
Creating the index is easy:

1
2
3
4
5
6
7
test=# CREATE INDEX idx_fix_hotel
ON t_hotel (rate_hotel(price, distance_beach));
CREATE INDEX
Time: 22706.882 ms (00:22.707)
test=# ANALYZE ;
ANALYZE
Time: 354.660 ms