Citus: Scale-Out Clustering and Sharding for PostgreSQL

Citus runs on standard, unpatched PostgreSQL servers. The only modification is installing the extensions into the server. This is a unique and extremely important advantage: most clustered databases that are derived from another database inevitably lag behind and get stuck on an old version of the original database, unable to keep up with the grueling workload of constantly refactoring to build on new releases. Not so for Citus, which doesn’t fork Postgres—it extends it with Postgres’s own extension mechanisms. This means that Citus is positioned to continue innovating on its own software, while continuing to benefit from the strong progress that the PostgreSQL community is delivering on a regular cadence too.

Won’t You Be My Neighbor? Quickly Finding Who is Nearby

Many applications these days want us to know how close we are to things:

  • What are the three closest coffee shops to my current location?
  • Which is the nearest airport to the office?
  • What are the two closest subway stops to the restaurant?

and countless more examples.

Another way of asking these questions is to say “who are my nearest neighbors to me?” This maps to a classic algorithmic problem: efficiently finding the K-nearest neighbors (or K-NN), where K is a constant. For example, the first question would be a 3-NN problem as we are trying to find the 3 closest coffee shops.

(If you are interested in learning more about K-NN problems in general, I highly recommend looking at how you can solve this using n-dimensional Voronoi diagrams, a wonderful data structure developed in the field of computational geometry.)

PostgreSQL defines a distance operator for geometric types that looks like this “<->” that, in the case of points, calculates the 2-Dimensional Euclidean distance. For example:

SELECT POINT(0,0) <-> POINT(1,1);

    ?column?
-----------------
 1.4142135623731


.. If we want to find the three friends who were closest to us on October 1, 2012 between 7:00am and 9:00am, we could construct a query like this:

SELECT visitor, visited_at, geocode
FROM visits
WHERE
    visited_at BETWEEN '2012-10-01 07:00' AND '2012-10-01 09:00'
ORDER BY POINT(40.7127263,-74.0066592) <-> geocode
LIMIT 3;


PostgreSQL 9.1 introduced the KNN-GiST index as a way to accelerate searching for neighbors. It has been implemented on several data types, including points and trigrams, and is also leveraged by the PostGIS geospatial extension.

.. You can use a KNN-GiST index simply by creating a GiST index on a supported data type, which in this case, is the geocode column:

CREATE INDEX visits_geocode_gist_idx ON visits USING gist(geocode);
VACUUM ANALYZE visits;

To demonstrate its power, let’s see what happens if we try to find the 3 nearest points to a given location:

EXPLAIN ANALYZE SELECT visitor, visited_at, geocode
FROM visits
ORDER BY POINT(40.7127263,-74.0066592) <-> geocode
LIMIT 3;

Voronoi diagram (Wikipedia)

20 points and their Voronoi cells (larger version below)

In mathematics, a Voronoi diagram is a partitioning of a plane into regions based on distance to points in a specific subset of the plane. That set of points (called seeds, sites, or generators) is specified beforehand, and for each seed there is a corresponding region consisting of all points closer to that seed than to any other. These regions are called Voronoi cells. The Voronoi diagram of a set of points is dual to its Delaunay triangulation.

It is named after Georgy Voronoi, and is also called a Voronoi tessellation, a Voronoi decomposition, a Voronoi partition, or a Dirichlet tessellation (after Peter Gustav Lejeune Dirichlet). Voronoi diagrams have practical and theoretical applications in a large number of fields, mainly in science and technology, but also in visual art.[1][2] They are also known as Thiessen polygons