How to do histograms in PostgreSQL

SELECT * FROM histogram($table_name_or_subquery, $column_name);  

. . . to give sweet results like this, in a check of the distribution of 2016 political contributions in Vermont:

fec=# SELECT * FROM histogram('(SELECT * FROM small_donors_vt LIMIT 50000)', 'transaction_amt');

 bucket |   range   | freq |       bar       
--------+-----------+------+-----------------
      1 | [0,9]     | 2744 | ******
      2 | [10,19]   | 5630 | *************
      3 | [20,29]   | 6383 | ***************
      4 | [30,39]   | 1290 | ***
      5 | [40,49]   |  369 | *
      6 | [50,59]   | 3541 | ********
      7 | [60,69]   |  174 | 
      8 | [70,79]   |  313 | *
      9 | [80,89]   |  171 | 
     10 | [90,99]   |   65 | 
     11 | [100,109] | 2363 | ******
     12 | [110,119] |   51 | 
     13 | [120,129] |  115 | 
     14 | [130,139] |   32 | 
     15 | [140,146] |   11 | 
     16 | [150,159] |  187 | 
     17 | [160,169] |   24 | 
     18 | [170,177] |   33 | 
     19 | [180,189] |   19 | 
     20 | [191,199] |   24 | 
     21 | [200,200] |  795 | **

Find the number of the longest continuously rising days for a stock

Today I want to react to an article that claims that Relational Algebra Is the Root of SQL Problems in which the author hand-waves the following position:

SQL becomes more a hindrance to data manipulation than an efficient tool. SQL’s greatest problem isn’t in the implementation level, but at its theory foundation. The problem can’t be solved by application optimization. Relational algebra isn’t sophisticated enough for handling the complicated data manipulation scenarios.

Datasette: instantly create and publish an API for your SQLite databases

A key feature of datasette is that the API it provides is very deliberately read-only. This provides a number of interesting benefits:

 

  • It lets us use SQLite in production in high traffic scenarios. SQLite is an incredible piece of technology, but it is rarely used in web application contexts due to its limitations with respect to concurrent writes. Datasette opens SQLite files using the immutable option, eliminating any concurrency concerns and allowing SQLite to go even faster for reads.
  • Since the database is read-only, we can accept abritrary SQL queries from our users!

Five ways to paginate in Postgres, from the basic to the exotic

However the PostgreSQL statistics collector maintains per-column histograms of value distribution. We can use these estimates in conjunction with limits and small offsets to get fast random-access pagination through a hybrid approach.

First let’s look at the statistics of our medley:

SELECT array_length(histogram_bounds, 1) - 1
  FROM pg_stats
 WHERE tablename = 'medley'
   AND attname = 'n';

In my database the column n has 101 bound-markers, i.e. 100 ranges between bound-markers. The particular values aren’t too surprising because my data is uniformly distributed

{719,103188,193973,288794,  ,9690475,9791775,9905770,9999847}

Notice that the values are approximate. The first number is not exactly zero, and the last is not exactly ten million. The ranges divide our information into a block size B = 10,000,000 / 100 = 100,000 rows.

We can use the histogram ranges from the PostgreSQL stats collector to obtain probabilistically correct pages. If we choose a client-side page width of W how do we request the ith page? It will reside in block iW / B, at offset iW % B.

Choosing W=20 let’s request page 270,000 from the medley table. Note that PostgreSQL arrays are one-based so we have to adjust the values in the array lookups:

WITH bookmark AS (
    SELECT (histogram_bounds::text::int[])[((270000 * 20) / 100000)+1] AS start,
           (histogram_bounds::text::int[])[((270000 * 20) / 100000)+2] AS stop
    FROM pg_stats
    WHERE tablename = 'medley'
    AND attname = 'n'
    LIMIT 1
  )
SELECT *
FROM medley
WHERE n >= (select start from bookmark)
AND n < (select stop from bookmark)
ORDER BY n ASC
LIMIT 20
OFFSET ((270000 * 20) % 100000);

This performs blazingly fast (notice the offset happens to be zero here). It gives back rows with n = 5407259 through 5407278. The true values on page 270000 are n = 5400001 through 5400020. The values is off by 7239, or about 0.1%.