SQL: Using Filter to turn EAV (entity/attribute/value) rows into rows of entities

#302 — APRIL 24, 2019 READ ON THE WEB
👋 This issue’s Tip of the Week looks at SQL’s <span style="color: #5a5a5a;">FILTER</span> clause. Scroll to the end of this issue to check it out.. 👀
Postgres Weekly
.. snip ..
supported by Hasura

💡 Tip of the Week

Using <span style="color: #5a5a5a;">FILTER</span> to turn entity/value tables into rows of entities

An entire article could be written on this topic, but I wanted to show off the most basic use of SQL:2003’s <span style="color: #5a5a5a;">FILTER</span> clause that was added to Postgres 9.4.

Let’s say you have a table called <span style="color: #5a5a5a;">props</span> that represents entities, attributes, and values with an integer <span style="color: #5a5a5a;">id</span> column, textual attribute and value columns, and the following contents:

The FILTER clause essentially adds an extra <span style="color: #5a5a5a;">WHERE</span> clause to aggregate functions (such as <span style="color: #5a5a5a;">MIN</span><span style="color: #5a5a5a;">MAX</span> and <span style="color: #5a5a5a;">SUM</span>) allowing you to scope them.

This comes in very handy for pulling out values from our <span style="color: #5a5a5a;">val</span> column based upon the value of the attr column, therefore allowing us to turn a table of entities, attributes and values into a more classical set of columns.

SELECT id,
MAX(val) FILTER(WHERE attr=’name’) AS name,
MAX(val) FILTER(WHERE attr=’age’) AS age,
MAX(val) FILTER(where attr=’city’) AS city
FROM props GROUP BY id;
You might need to reproduce this table and play with the query to get the feel of what’s going on, but essentially we are selecting each row grouped by the ID (i.e. the ID of the underlying entity) and then picking the value associated with each ID that matches certain attribute names, allowing us, in this case, to extract the <span style="color: #5a5a5a;">name</span><span style="color: #5a5a5a;">age</span>, and <span style="color: #5a5a5a;">city</span> values. <span style="color: #5a5a5a;">MAX</span> works fine as an aggregate function here as there is only one attr/val pair per entity.

The <span style="color: #5a5a5a;">FILTER</span> clause has a lot more uses than this, but I felt this was both a pretty neat and perhaps unexpected example of its use.

You can learn more in this article and in the official Postgres documentation.

This week’s tip is sponsored by Hasura, creators of the high-performance GraphQL engine on new and existing Postgres databases. Check them out on GitHub.

Be careful with CTE in PostgreSQL

>A lesser known fact about CTE in PostgreSQL is that the database will evaluate the query inside the CTE and store the results.

PostgreSQL materialized the CTE, meaning, it created a temporary structure with the results of the query defined in the CTE, and only then applied the filter to it. Because the predicate was not applied on the table (but the CTE) PostgreSQL was unable to utilize the index on the ID column.

Unlike PostgreSQL, Oracle is not materializing CTEs by default and the two queries generate the same execution plan.

SEARCHING FOR THE BEST COMPROMISE: “SKYLINE” QUERIES

Fortunately PostgreSQL allows us to use more sophisticated sort criteria. Sorting by a single column is boring. What we want is to somehow treat different columns differently. In this case customers might feel that distance is not really linear. If you are 20 or 50 meters away from the beach does not really matter anymore. However, being 50 meters or 1 km away really matters already. To make it easy I decided to go for the square root of the distance while still taking the price as it is. The result looks ways more promising as before:

1
2
3
4
5
6
7
8
9
10
test=# SELECT price * sqrt(distance_beach), *
FROM t_hotel
ORDER BY 1;
?column? | id | name | price | distance_beach
--------------+----+------------------------+-------+----------------
133.491572 | 2 | Crapstone Hotel | 90 | 2.2
139.427400 | 4 | Nowhere Middle Hotel | 45 | 9.6
185.903200 | 1 | ABC Motel | 120 | 2.4
221.37072977 | 3 | Luxury Watch Spa Hotel | 495 | 0.2
(4 rows)

It seems that the Crapstone hotel is the best bargain here. It is not the cheapest hotel but it is pretty close and still reasonably priced so maybe it is best to book that one.

The trouble starts when we look at the execution plan of this tiny PostgreSQL query:

1
2
3
4
5
6
7
8
9
test=# explain SELECT price * sqrt(distance_beach), *
FROM t_hotel
ORDER BY 1;
QUERY PLAN
------------------------------------------------------------------
Sort (cost=48.74..50.32 rows=630 width=132)
Sort Key: ((price * sqrt(distance_beach)))
-&gt; Seq Scan on t_hotel (cost=0.00..19.45 rows=630 width=132)
(3 rows)

PostgreSQL will read all the data and sort by our custom criterial. While this is nice for a small data set, it will kill us if the amount of data keeps growing

It took almost 19 seconds (my laptop) to run the query. For sure: Most users would not tolerate this kind of behavior for too often, so we somehow need to improve runtime.

The SKYLINE OF operator does not exist in PostgreSQL (as in any other database engine I am aware of). However: PostgreSQL offers functional indexes, which are ideal in this case:

1
2
3
4
5
6
7
test=# CREATE FUNCTION rate_hotel(numeric, numeric)
RETURNS numeric AS
$$
SELECT $1 * sqrt($2)
$$
LANGUAGE 'sql' IMMUTABLE;
CREATE FUNCTION

The important thing here is to use an IMMUTABLE function. We must assure that the function used to rank the data is perfectly deterministic and its result does not change over time given the same input parameters.
Creating the index is easy:

1
2
3
4
5
6
7
test=# CREATE INDEX idx_fix_hotel
ON t_hotel (rate_hotel(price, distance_beach));
CREATE INDEX
Time: 22706.882 ms (00:22.707)
test=# ANALYZE ;
ANALYZE
Time: 354.660 ms