Five ways to paginate in Postgres, from the basic to the exotic

However the PostgreSQL statistics collector maintains per-column histograms of value distribution. We can use these estimates in conjunction with limits and small offsets to get fast random-access pagination through a hybrid approach.

First let’s look at the statistics of our medley:

SELECT array_length(histogram_bounds, 1) - 1
  FROM pg_stats
 WHERE tablename = 'medley'
   AND attname = 'n';

In my database the column n has 101 bound-markers, i.e. 100 ranges between bound-markers. The particular values aren’t too surprising because my data is uniformly distributed

{719,103188,193973,288794,  ,9690475,9791775,9905770,9999847}

Notice that the values are approximate. The first number is not exactly zero, and the last is not exactly ten million. The ranges divide our information into a block size B = 10,000,000 / 100 = 100,000 rows.

We can use the histogram ranges from the PostgreSQL stats collector to obtain probabilistically correct pages. If we choose a client-side page width of W how do we request the ith page? It will reside in block iW / B, at offset iW % B.

Choosing W=20 let’s request page 270,000 from the medley table. Note that PostgreSQL arrays are one-based so we have to adjust the values in the array lookups:

WITH bookmark AS (
    SELECT (histogram_bounds::text::int[])[((270000 * 20) / 100000)+1] AS start,
           (histogram_bounds::text::int[])[((270000 * 20) / 100000)+2] AS stop
    FROM pg_stats
    WHERE tablename = 'medley'
    AND attname = 'n'
    LIMIT 1
  )
SELECT *
FROM medley
WHERE n >= (select start from bookmark)
AND n < (select stop from bookmark)
ORDER BY n ASC
LIMIT 20
OFFSET ((270000 * 20) % 100000);

This performs blazingly fast (notice the offset happens to be zero here). It gives back rows with n = 5407259 through 5407278. The true values on page 270000 are n = 5400001 through 5400020. The values is off by 7239, or about 0.1%.

Implementing faceted search with Django and PostgreSQL

I’ve added a faceted search engine to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL”like” queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:

It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters – and get a general feeling for the shape of the corpus as you navigate it.

Postgres Backups: Logical vs. Physical an Overview

Logical vs. Physical which to choose

Both are useful and provide different benefits. At smaller scale, say under 100 GB of data logical backups via pg_dump are something you should absolutely be doing. Because backups happen quickly on smaller databases you may be able to get out without functionality like point-in-time recovery. At larger scale, as you approach 1 TB physical backups start to become your only option. Because of the load introduced by logical backups and the time lapse between capturing them they become less suitable for production.

Hopefully this primer helps provide a high level overview of the two primary types of backups that exist as options for Postgres. Of course there is much deeper you can go on each, but consider ensuring you have at least one of the two if not both in place. Oh and make sure to test them, an un-tested backup isn’t a backup at all.

PostgreSQL, Aggregates and Histograms

Let’s have a look at our dataset from the NBA games and statistics, and get back to counting rebounds in the drb field. A preliminary query informs us that we have stats ranging from 10 to 54 rebounds per team in a single game, a good information we can use in the following query:

  select width_bucket(drb, 10, 54, 9) as buckets,
         count(*)
    from team_stats
group by buckets
order by buckets;

We asked for 9 separations so we have 10 groups as a result:

 width_bucket | count 
--------------+-------
            1 |    52
            2 |  1363
            3 |  8832
            4 | 20917
            5 | 20681
            6 |  9166
            7 |  2093
            8 |   247
            9 |    20
           10 |     1
(10 rows)