The Guardians Switched from Mongo to Postgres

In April the Guardian switched off the Mongo DB cluster used to store our content after completing a migration to PostgreSQL on Amazon RDS. This post covers why and how

At the Guardian, the majority of content – including articles, live blogs, galleries and video content – is produced in our in-house CMS tool, Composer. This, until recently, was backed by a Mongo DB database running on AWS. This database is essentially the “source of truth” for all Guardian content that has been published online – approximately 2.3m content items. We’ve just completed our migration away from Mongo to Postgres SQL.

My experience with MongoDB

For starters, while a schema­less solu­tion makes early tin­ker­ing more fric­tion­less, some­times you want the checks and data pro­tec­tion that a schema can provide. Typed columns also allows rela­tional data­bases to make opti­miza­tions that Mon­goDB can’t.

.. The docs sug­gest run­ning it on at least three ded­i­cated servers with ample resources each. This was a bit much for me so I ran it on a sin­gle server which it shared with the appli­ca­tion. As a result my appli­ca­tion was slow and the server crashed peri­od­i­cal­ly. Now you could crit­i­cize me for not fol­low­ing the rec­om­mended pro­ce­dure, and you’d be right, but under­stand that when I switched to Post­greSQL, with­out increas­ing the hard­ware capac­ity at all, all of my per­for­mance and sta­bil­ity prob­lems went away. Mon­goDB demanded too much for less per­for­mance and essen­tially the same queries

.. Besides, unless your sys­tem is par­tic­u­larly write heavy rela­tional data­bases can use repli­ca­tion to scale out any­how. Mon­goD­B’s model really isn’t an advan­tage unless you are solv­ing a write-heavy, Big Data problem.1 Until you reach that scale, it’s actu­ally slower than the alter­na­tive.

Which database should I use for a killer web application: MongoDB, PostgreSQL, or MySQL?

Fan Out On Write
… is best if the timeline most resembles Twitter, a roughly permanent, append-only, per-user aggregation of a set of source feeds. When an incoming event is recorded, it is stored in the source feed and then a record containing a copy is created for each timeline that it must be visible in. Each timeline is essentially a materialized view of every event in the end user has visibility into.

.. The on read strategy is best if the timeline most resembles Facebook’s news feed, in that it is temporal and/or you want dynamic features like relevance (the “Top News” feed on Facebook) or event aggregation. When an event is recorded, it is stored only in the source feed. Every time an end user requests their individual timeline, the system must read all of the source feeds that the end user has visibility into, and aggregate these feeds together.

If you’re doing fan-out-on-read, you can’t really ever afford to go to disk, even if they’re SSDs. All of the timeline data must be in memory, all the time. This means you’ll probably have to monitor the size of the data set and actively purge the oldest data as the feed data grows beyond a safe threshold.

.. There is however an advanced use case that PostgreSQL isn’t that good for: graphs.

.. MongoDB is not a graph database, or even a relational database. Like CouchDB, it’s a document database and it represents the other end of the scale. What if you want to store all details of a movie in one place and you aren’t interested in being able to cross-reference the data? The relational databases want you to “normalise” your data by storing every little detail in a separate table, but you find that annoying. You just want to store your data in one place without having to think it through too much. You’re still figuring out what data you want to store, so for now you just want to dump it somewhere and worry about querying it later. That’s exactly what document databases are for and MongoDB is king of that hill.

.. Just use MySQL or Postgres. Why? If your entire _active set_ fits in a single machine’s main memory (which can be as high as 128GB+ with modern commodity machines) you don’t have a horizontal scalability problem: i.e., there is absolutely no reason for you to partition (“shard”) your database and give up relations. If your active data set fits in memory most any properly tuned database with an index will perform well enough to saturate your Ethernet card before the database itself becomes a limitation.

.. MySQL’s InnoDB (or Postgres’s storage engine), however, will still allow you maintain (depending on your request distribution) a ~2:1-5:1 data to memory ratio with a spinning disk. Once you’ve gone beyond that, performance begins to fall of rapidly (as you’re making multiple disk seeks for every request). Now, your best course of action is just to upgrade to SSDs (solid state drives), which — again — allow you to saturate your Ethernet card *before* the database becomes a limitation.

.. the non-relational model has palpable advantages for the typical webapp and opens new opportunities when occupying the position in the stack where mysql used to sit. The reason for this is the dynamic nature of webapp development. Features often change with iterations on user feedback. These new and different features often have different persistency requirements that need to be fit into the existing DB schema.

MongoDB is perfectly suited for that. In some ways, it is the best drop in non-relational replacement for a relational DB like MySQL in that it puts the most effort into making the transition smooth. What you get is freedom to change you data schema on the fly – there are no named columns with fixed datatypes. You can essentially put any “stuff” into a document, and documents within a collection (the mongo equivalent of a table) don’t need to have consistent representations across. One document may have many keys where another only has a few.

.. If data is not like logging and can be updated randomly, MongoDB is a diaster.

.. If you’re just starting out, MongoDb would be helpful for things like log files. The nature of being flexible on your documents means you can change fields on the fly and not have to worry about 4 day alter table statements on a 1 billion row log table as you’d have with MySQL. It’s also handy if you have known data structures that can be represented in a single document (note: the 4mb limit of a document means you shouldn’t ever add things like comments or any expanding data set to your document as you WILL fail on the next insert over 4mb)

Upserts for counters are useful too

.. I think you should hold off optimizing your website to use a nosql database until you actually discover a performance problem. Your database will have to be getting a huge number of writes before a relational database is a problem. Any number of reads can be handled using replication.