database

Setting Up MongoDB Replica Sets on Amazon EC2

Before you set up MongoDB on EC2 make sure you understand the various aspects of running MongoDB in the Amazon cloud:

$3000 Data Warehouse — Redshift vs. Postgres

Whether it’s 30-day retention or unique sessions, many analytics queries rely on being able to count the distinct number of elements in a set very fast. On average, Redshift was 200x faster than RDS Postgres for these queries.

..[1] Postgres stores data by row. This means you have to read the whole table to sum the price column.

Redshift stores its data organized by column. This allows the database to compress records because they’re all the same data type. Once they’re compressed, there’s less data to read off disk and store in RAM

.. [2] Postgres does not use multiple cores for a single query. While this allows more queries to run in parallel, no single query can use all of the machine’s resources.

.. [3] A Redshift cluster can achieve higher much IOPS. Each node reads from a different disk and the IOPS sums across the cluster. Our benchmark cluster achieved over 50K IOPS.

PostgreSQL Select Filters

But what happens when Jeff from Accounting wants to know how many people ordered five specific products as a column list? In the old days, we might do something like this:

SELECT sum(CASE WHEN product_id = 1 THEN item_count ELSE 0 END) AS horse_mask_count,
       sum(CASE WHEN product_id = 7 THEN item_count ELSE 0 END) AS eyeball_count,
       sum(CASE WHEN product_id = 13 THEN item_count ELSE 0 END) AS badger_count,
       sum(CASE WHEN product_id = 29 THEN item_count ELSE 0 END) AS orb_count,
       sum(CASE WHEN product_id = 73 THEN item_count ELSE 0 END) AS memebox_count
  FROM sys_order;

 horse_mask_count | eyeball_count | badger_count | orb_count | memebox_count 
------------------+---------------+--------------+-----------+---------------
            59870 |         59951 |        59601 |     59887 |         60189

Gross.

As a DBA, I’ve seen more of these than I can reasonably stand, and hate them every single time. It’s not the use of the CASE statement that is so irksome, but the micromanaging methodology necessary to reduce the count to zero for unwanted items. With FILTER however, this query changes quite a bit:

SELECT sum(item_count) FILTER (WHERE product_id = 1) AS horse_mask_count,
       sum(item_count) FILTER (WHERE product_id = 7) AS eyeball_count,
       sum(item_count) FILTER (WHERE product_id = 13) AS badger_count,
       sum(item_count) FILTER (WHERE product_id = 29) AS orb_count,
       sum(item_count) FILTER (WHERE product_id = 73) AS memebox_count
  FROM sys_order;

 horse_mask_count | eyeball_count | badger_count | orb_count | memebox_count 
------------------+---------------+--------------+-----------+---------------
            59870 |         59951 |        59601 |     59887 |         60189

SQL Joins: O(log(n))

Each join is O(log(n)), so I believe you’re stuck with O(log(n)^7) as a worst case, although in practice it will probably not be so bad since one of the joins will probably limit the result set significantly.