scalability

Setting Up MongoDB Replica Sets on Amazon EC2

Before you set up MongoDB on EC2 make sure you understand the various aspects of running MongoDB in the Amazon cloud:

$3000 Data Warehouse — Redshift vs. Postgres

Whether it’s 30-day retention or unique sessions, many analytics queries rely on being able to count the distinct number of elements in a set very fast. On average, Redshift was 200x faster than RDS Postgres for these queries.

..[1] Postgres stores data by row. This means you have to read the whole table to sum the price column.

Redshift stores its data organized by column. This allows the database to compress records because they’re all the same data type. Once they’re compressed, there’s less data to read off disk and store in RAM

.. [2] Postgres does not use multiple cores for a single query. While this allows more queries to run in parallel, no single query can use all of the machine’s resources.

.. [3] A Redshift cluster can achieve higher much IOPS. Each node reads from a different disk and the IOPS sums across the cluster. Our benchmark cluster achieved over 50K IOPS.

10 Things You Should Know About Running MongoDB At Scale

It turns out that in real-life large deployments the biggest impact to performance is how well the schema design fits with the application needs. Second biggest impact is from lack of indexes or wrong indexes or way too many indexes. But even when the schema design is perfect and indexes are optimal, it is the disk IO throughput capacity that ends up being the next limiting factor, especially to the write throughput. Insufficient RAM will cause a lot of page faulting and add pressure to the disk IO, more on RAM needs later.

Python: Task queues

Tasks are handled asynchronously either because they are not initiated by an HTTP request or because they are long-running jobs that would dramatically reduce the performance of an HTTP response.

For example, a web application could poll the GitHub API every 10 minutes to collect the names of the top 100 starred repositories. A task queue would handle invoking code to call the GitHub API, process the results and store them in a persistent database for later use.