Fan Out On Write
… is best if the timeline most resembles Twitter, a roughly permanent, append-only, per-user aggregation of a set of source feeds. When an incoming event is recorded, it is stored in the source feed and then a record containing a copy is created for each timeline that it must be visible in. Each timeline is essentially a materialized view of every event in the end user has visibility into.
.. The on read strategy is best if the timeline most resembles Facebook’s news feed, in that it is temporal and/or you want dynamic features like relevance (the “Top News” feed on Facebook) or event aggregation. When an event is recorded, it is stored only in the source feed. Every time an end user requests their individual timeline, the system must read all of the source feeds that the end user has visibility into, and aggregate these feeds together.
If you’re doing fan-out-on-read, you can’t really ever afford to go to disk, even if they’re SSDs. All of the timeline data must be in memory, all the time. This means you’ll probably have to monitor the size of the data set and actively purge the oldest data as the feed data grows beyond a safe threshold.
.. There is however an advanced use case that PostgreSQL isn’t that good for: graphs.
.. MongoDB is not a graph database, or even a relational database. Like CouchDB, it’s a document database and it represents the other end of the scale. What if you want to store all details of a movie in one place and you aren’t interested in being able to cross-reference the data? The relational databases want you to “normalise” your data by storing every little detail in a separate table, but you find that annoying. You just want to store your data in one place without having to think it through too much. You’re still figuring out what data you want to store, so for now you just want to dump it somewhere and worry about querying it later. That’s exactly what document databases are for and MongoDB is king of that hill.
.. Just use MySQL or Postgres. Why? If your entire _active set_ fits in a single machine’s main memory (which can be as high as 128GB+ with modern commodity machines) you don’t have a horizontal scalability problem: i.e., there is absolutely no reason for you to partition (“shard”) your database and give up relations. If your active data set fits in memory most any properly tuned database with an index will perform well enough to saturate your Ethernet card before the database itself becomes a limitation.
.. MySQL’s InnoDB (or Postgres’s storage engine), however, will still allow you maintain (depending on your request distribution) a ~2:1-5:1 data to memory ratio with a spinning disk. Once you’ve gone beyond that, performance begins to fall of rapidly (as you’re making multiple disk seeks for every request). Now, your best course of action is just to upgrade to SSDs (solid state drives), which — again — allow you to saturate your Ethernet card *before* the database becomes a limitation.
.. the non-relational model has palpable advantages for the typical webapp and opens new opportunities when occupying the position in the stack where mysql used to sit. The reason for this is the dynamic nature of webapp development. Features often change with iterations on user feedback. These new and different features often have different persistency requirements that need to be fit into the existing DB schema.
MongoDB is perfectly suited for that. In some ways, it is the best drop in non-relational replacement for a relational DB like MySQL in that it puts the most effort into making the transition smooth. What you get is freedom to change you data schema on the fly – there are no named columns with fixed datatypes. You can essentially put any “stuff” into a document, and documents within a collection (the mongo equivalent of a table) don’t need to have consistent representations across. One document may have many keys where another only has a few.
.. If data is not like logging and can be updated randomly, MongoDB is a diaster.
.. If you’re just starting out, MongoDb would be helpful for things like log files. The nature of being flexible on your documents means you can change fields on the fly and not have to worry about 4 day alter table statements on a 1 billion row log table as you’d have with MySQL. It’s also handy if you have known data structures that can be represented in a single document (note: the 4mb limit of a document means you shouldn’t ever add things like comments or any expanding data set to your document as you WILL fail on the next insert over 4mb)
Upserts for counters are useful too
.. I think you should hold off optimizing your website to use a nosql database until you actually discover a performance problem. Your database will have to be getting a huge number of writes before a relational database is a problem. Any number of reads can be handled using replication.
Cassandra’s memory footprint is more dependent on the number of column families than on the size of the data set. Cassandra scales pretty well horizontally for storage and IO, but not for memory footprint, which is tied to your schema and your cache settings regardless of the size of your cluster. Planning for the smallest number of column families possible reduces your memory footprint and allows more memory for caching.
.. Cassandra runs most efficiently with data that is written once. Data that is frequently deleted or updated puts more pressure on compaction. Cassandra’s compaction tends to be CPU intensive, and heavy compaction loads can cause nodes to fall out of the ring.