Citus Unforks From PostgreSQL, Goes Open Source

First, Citus 5.0 now fully uses the PostgreSQL extension APIs. In other words, Citus becomes the first distributed database in the world that doesn’t fork the underlying database. This means Citus users can immediately benefit from new features in PostgreSQL, such as semi-structured data types (json, jsonb), UPSERT, or when 9.6 arrives no more full table vacuums. Also, users can keep working with their existing Postgres drivers and tools.

.. Second, Citus is going open source! The project, codebase, and all open issues are now available on Github

 

Replacing EAV with JSONB in PostgreSQL

Enter Entity-Attribute-Value. I’ve seen this pattern in almost every database I’ve worked with. One table contains the entities, another table contains the names of the properties (attributes) and a third table links the entities with their attributes and holds the value. This gives you the flexibility for having different sets of properties (attributes) for different entities, and also for adding properties on the flywithout locking your table for 3 days.

Nonetheless, I wouldn’t be writing this post if there were no downsides to this approach. Selecting one or more entities based on 1 attribute value requires 2 joins: one with the attribute table and one with the value table. Need entities bases on 2 attributes? That’s 4 joins! Also, the properties usually are all stored as strings, which results in type casting, both for the result as for the WHERE clause. If you write a lot of ad-hoc queries, this is very tedious.

Despite these obvious shortcomings, EAV has been used for a long time to solve this kind of problem. It was a necessary evil, and there just was no better alternative. But then PostgreSQL came along with a new feature…

Starting from PostgreSQL 9.4, a JSONB datatype was added for storing binary JSON data.

Generated Primary and Foreign Keys for Distributed Databases

Any solution to extend or replace sequences as generated keys must meet the following requirements:

  1. They must guarantee uniqueness across multiple systems and databases.
  1. They must not impair the efficiency of clustered indexes as commonly implemented by RDBMSs.
  1. They must allow for the effective migration of pre-existing sequence-based databases.
  1. They must be reliable and maintainable.

.. A refinement of solution 3 is to create a single key column that encapsulates both the database identifier and a serial integer.