SEARCHING FOR THE BEST COMPROMISE: “SKYLINE” QUERIES

Fortunately PostgreSQL allows us to use more sophisticated sort criteria. Sorting by a single column is boring. What we want is to somehow treat different columns differently. In this case customers might feel that distance is not really linear. If you are 20 or 50 meters away from the beach does not really matter anymore. However, being 50 meters or 1 km away really matters already. To make it easy I decided to go for the square root of the distance while still taking the price as it is. The result looks ways more promising as before:

1
2
3
4
5
6
7
8
9
10
test=# SELECT price * sqrt(distance_beach), *
FROM t_hotel
ORDER BY 1;
?column? | id | name | price | distance_beach
--------------+----+------------------------+-------+----------------
133.491572 | 2 | Crapstone Hotel | 90 | 2.2
139.427400 | 4 | Nowhere Middle Hotel | 45 | 9.6
185.903200 | 1 | ABC Motel | 120 | 2.4
221.37072977 | 3 | Luxury Watch Spa Hotel | 495 | 0.2
(4 rows)

It seems that the Crapstone hotel is the best bargain here. It is not the cheapest hotel but it is pretty close and still reasonably priced so maybe it is best to book that one.

The trouble starts when we look at the execution plan of this tiny PostgreSQL query:

1
2
3
4
5
6
7
8
9
test=# explain SELECT price * sqrt(distance_beach), *
FROM t_hotel
ORDER BY 1;
QUERY PLAN
------------------------------------------------------------------
Sort (cost=48.74..50.32 rows=630 width=132)
Sort Key: ((price * sqrt(distance_beach)))
-> Seq Scan on t_hotel (cost=0.00..19.45 rows=630 width=132)
(3 rows)

PostgreSQL will read all the data and sort by our custom criterial. While this is nice for a small data set, it will kill us if the amount of data keeps growing

It took almost 19 seconds (my laptop) to run the query. For sure: Most users would not tolerate this kind of behavior for too often, so we somehow need to improve runtime.

The SKYLINE OF operator does not exist in PostgreSQL (as in any other database engine I am aware of). However: PostgreSQL offers functional indexes, which are ideal in this case:

1
2
3
4
5
6
7
test=# CREATE FUNCTION rate_hotel(numeric, numeric)
RETURNS numeric AS
$$
SELECT $1 * sqrt($2)
$$
LANGUAGE 'sql' IMMUTABLE;
CREATE FUNCTION

The important thing here is to use an IMMUTABLE function. We must assure that the function used to rank the data is perfectly deterministic and its result does not change over time given the same input parameters.
Creating the index is easy:

1
2
3
4
5
6
7
test=# CREATE INDEX idx_fix_hotel
ON t_hotel (rate_hotel(price, distance_beach));
CREATE INDEX
Time: 22706.882 ms (00:22.707)
test=# ANALYZE ;
ANALYZE
Time: 354.660 ms

 

COPY IN POSTGRESQL: MOVING DATA BETWEEN SERVERS

Here is an example:

1
echo "Lots of data" | ssh user@some.example.com 'cat > /directory/big.txt'

In this case “Lots of data” will be copied over SSH and stored in /directory/big.txt.

The beauty is that we can apply the same technique to PostgreSQL:

1
2
3
test=# COPY (SELECT * FROM pg_available_extensions)
TO PROGRAM 'ssh user@some.example.com ''cat > /tmp/result.txt'' ';
COPY 43

To make this work in real life you have to make sure that SSH keys are in place and ready to use. Otherwise the system will prompt for a password, which is of course not desirable at all. Also keep in mind that the SSH command is executed as “postgres” user (in case your OS user is called “postgres” too).

IDEAS FOR SCALING POSTGRESQL TO MULTI-TERABYTE AND BEYOND

If you don’t like to sweat too much and do some pioneering then the safest way to scale of course would be to stick with proven out-of-the-box features of Postgres – so first I’d recommend to take a look at the following keywords with some short explanations and maybe it’s all that you need.

  • Light-weight / special purpose indexes

For a complex OLTP system, supporting hundreds of freaky queries, it is very common that the indexes actually take much more disk space than the table files holding the data. To improve on that (especially for indexes that are used infrequently) one can reduce the index sizes drastically with appropriate use of partial, BRIN, GIN or even a bit experimental BLOOM indexes. In total there are 7 different index types supported…but mostly people only know about and use the default B-tree – a big mistake in a multi-TB setting!

Partial indexes allow indexing only a subset of the data – for example in a sales system we might not be interested in fast access to orders in status “FINISHED” (some nightly reports deal with that usually and they can take their time), so why should we index such rows?

GIN, the most know non-default index type perhaps, has been actually around for ages (full-text search) and in short is perfect for indexing columns where there are lot of repeating values – think all kinds of statuses or good old Mr/Mrs/Miss. GIN only stores every unique column value only once as for the default B-tree you’ll have e.g. 1 millon leaf nodes with the integer “1” in it.

BRIN (block-range a.k.a. min-max index) on the other hand is something newer and very different – it’s a lossy index type with a very small disk footprint where not all column values are actually indexed but only the biggest and smallest values for a range of rows (1 MB section of a table by default) – but this still works very well on ordered values and is for example perfect for time series data or other “log” type of tables.

BLOOM might be an exotic but if you manage to find a good use case (“bitmap/matrix search”) for it, it can be up to 20x more efficient than traditional indexing – see here for an example use case when it seems too abstract.

.. advantages of partitioning are: it’s possible to cleanly separate “cold data” and “hot data” – and this gives us some nice options like compacting old data maximally with VACUUM FULL or placing it on another media

  • Tablespaces

As mentioned above – it is possible to move tables / indexes selectively to different disk media with the help of tablespaces. Here one can achieve different goals – to just save money by using slower/affordable disk partitions for “cold” data, keeping only the most recent/important data on fast/expensive media, using some special compressed file systems for data that has a lot of repetitions or using some network shares or even in-memory file systems on remote nodes for massive non-persistent data – there are quite some options. And management of tablespaces is also quite straightforward actually, only transferring existing tables / indexes during live operation can be problematic due to full locking.

.. What I call hybrid tables, are actually based on Postgres’ excellent SQL MED standard implementation also know as Foreign Data Wrappers, and they basically look like normal Postgres tables for read queries but the data might reside or be piped over from literally anywhere – it might be coming from Twitter, LDAP or Amazon S3, see here for the full list of crazy datasources supported. In practice the most used application of Foreign Data Wrappers (FDW-s) is probably making normal (correctly formatted) files look like tables, for example exposing the server log as a table to make monitoring easier.

.. Where’s the scaling part you may ask though? The FDW approach works very well in the sense that it enables to reduce the amount of data by using some clever file formats or just compression, that typically reduces the data size 10-20x so that the data would fit on the node! This works very well for “cold” data, leaving more disk space/cache available for real tables with “hot” data. Since Postgres 10 it is also very easy to implement – sample code here.

Another very promising use case is to use the columnar data storage format (ORC) – take a look at the “c_store” extension project for more info. It’s especially suited for helping to scale large Data Warehouses with tables being up to 10x smaller and queries up to 100% faster.