NoSQL Design for DynamoDB

Differences Between Relational Data Design and NoSQL

Relational database systems (RDBMS) and NoSQL databases have different strengths and weaknesses:

  • In RDBMS, data can be queried flexibly, but queries are relatively expensive and don’t scale well in high-traffic situations (see First Steps for Modeling Relational Data in DynamoDB).

  • In a NoSQL database such as DynamoDB, data can be queried efficiently in a limited number of ways, outside of which queries can be expensive and slow.

These differences make database design very different between the two systems:

  • In RDBMS, you design for flexibility without worrying about implementation details or performance. Query optimization generally doesn’t affect schema design, but normalization is very important.

  • In DynamoDB, you design your schema specifically to make the most common and important queries as fast and as inexpensive as possible. Your data structures are tailored to the specific requirements of your business use cases.

Two Key Concepts for NoSQL Design

NoSQL design requires a different mindset than RDBMS design. For an RDBMS, you can go ahead and create a normalized data model without thinking about access patterns. You can then extend it later when new questions and query requirements arise. You can organize each type of data into its own table.

NoSQL design is different:

  • For DynamoDB, by contrast, you shouldn’t start designing your schema until you know the questions it will need to answer. Understanding the business problems and the application use cases up front is essential.

  • You should maintain as few tables as possible in a DynamoDB application. Most well designed applications require only one table.

Approaching NoSQL Design

The first step in designing your DynamoDB application is to identify the specific query patterns that the system must satisfy.

In particular, it is important to understand three fundamental properties of your application’s access patterns before you begin:

  • Data size: Knowing how much data will be stored and requested at one time will help determine the most effective way to partition the data.

  • Data shape: Instead of reshaping data when a query is processed (as an RDBMS system does), a NoSQL database organizes data so that its shape in the database corresponds with what will be queried. This is a key factor in increasing speed and scalability.

  • Data velocity: DynamoDB scales by increasing the number of physical partitions that are available to process queries, and by efficiently distributing data across those partitions. Knowing in advance what the peak query loads might be helps determine how to partition data to best use I/O capacity.

After you identify specific query requirements, you can organize data according to general principles that govern performance:

  • Keep related data together.   Research on routing-table optimization 20 years ago found that “locality of reference” was the single most important factor in speeding up response time: keeping related data together in one place. This is equally true in NoSQL systems today, where keeping related data in close proximity has a major impact on cost and performance. Instead of distributing related data items across multiple tables, you should keep related items in your NoSQL system as close together as possible.

    As a general rule, you should maintain as few tables as possible in a DynamoDB application. As emphasized earlier, most well designed applications require only one table, unless there is a specific reason for using multiple tables.

    Exceptions are cases where high-volume time series data are involved, or datasets that have very different access patterns—but these are exceptions. A single table with inverted indexes can usually enable simple queries to create and retrieve the complex hierarchical data structures required by your application.

  • Use sort order.   Related items can be grouped together and queried efficiently if their key design causes them to sort together. This is an important NoSQL design strategy.

  • Distribute queries.   It is also important that a high volume of queries not be focused on one part of the database, where they can exceed I/O capacity. Instead, you should design data keys to distribute traffic evenly across partitions as much as possible, avoiding “hot spots.”

  • Use global secondary indexes.   By creating specific global secondary indexes, you can enable different queries than your main table can support, and that are still fast and relatively inexpensive.

These general principles translate into some common design patterns that you can use to model data efficiently in DynamoDB.

Why You Should Never Use MongoDB

For quite a few years now, the received wisdom has been that social data is not relational, and that if you store it in a relational database, you’re doing it wrong.

.. When Diaspora decided to store social data in MongoDB, we were conflating a database with a cache. Databases and caches are very different things. They have very different ideas about permanence, transience, duplication, references, data integrity, and speed.

.. MongoDB’s ideal use case is even narrower than our television data. The only thing it’s good at is storing arbitrary pieces of JSON. “Arbitrary,” in this context, means that you don’t care at all what’s inside that JSON. You don’t even look. There is no schema, not even an implicit schema, as there was in our TV show data. Each document is just a blob whose interior you make absolutely no assumptions about.

Google goes back to the future with SQL F1 database: ‘Can you have a truly scalable database without going NoSQL? Yes!’

The AdWords system includes “100s of applications and 1000s of users,” which all share a database over 100TB serving up “hundreds of thousands of requests per second, and runs SQL queries that scan tens of trillions of data rows per day,” Google said. And it’s got five nines of availability.

..  Google describes it as a “a hybrid, combining the best aspects of traditional relational databases and scalable NoSQL systems”.

.. The technology comes with a cost, as Google said due to its design choices it resulted in “higher latency for typical reads and writes,” though the company has developed workarounds for this. It has “relatively high” commit latencies of 50-150 ms, Google writes.

.. We’ve moved this huge Adwords system onto this new DB and proved it actually works. The system is way more scalable than what it was before – better availability than MySQL, better consistency than MySQL, we’ve got SQL-query that’s just as good as what we started with.”

.. “We also have a lot of experience with eventual consistency systems at Google,” they write in the paper. “In all such systems, we find developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date.”

.. Right now, companies such as Cloudera, ParAccel, MongoDB, and Cascading are all trying to layer SQL-like query engine over a datastore like HDFS or MongoDB. But though these can scale well they lack the transactional capabilities of some systems. This, for enterprises, is an unpleasant pill to swallow.