CAP Twelve Years Later: How the “Rules” Have Changed

The easiest way to understand CAP is to think of two nodes on opposite sides of a partition. Allowing at least one node to update state will cause the nodes to become inconsistent, thus forfeiting C. Likewise, if the choice is to preserve consistency, one side of the partition must act as if it is unavailable, thus forfeiting A. Only when nodes communicate is it possible to preserve both consistency and availability, thereby forfeiting P. The general belief is that for wide-area systems, designers cannot forfeit P and therefore have a difficult choice between C and A. In some sense, the NoSQL movement is about creating choices that focus on availability first and consistency second; databases that adhere to ACID properties (atomicity, consistency, isolation, and durability) do the opposite. The “ACID, BASE, and CAP” sidebar explains this difference in more detail.

.. Finally, all three properties are more continuous than binary. Availability is obviously continuous from 0 to 100 percent, but there are also many levels of consistency, and even partitions have nuances, including disagreement within the system about whether a partition exists.

Explain Extended: SQL Explained

This series of articles is inspired by multiple questions asked by the site visitors and Stack Overflow users, including Tony, Philip, Rexem and others.

Which method (NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL) is best to select values present in one table but missing in another one?

This:

1.SELECT  l.*
2.FROM    t_left l
3.LEFT JOIN
4.t_right r
5.ON      r.value = l.value
6.WHERE   r.value IS NULL

, this:

1.SELECT  l.*
2.FROM    t_left l
3.WHERE   l.value NOT IN
4.(
5.SELECT  value
6.FROM    t_right r
7.)

or this:

1.SELECT  l.*
2.FROM    t_left l
3.WHERE   NOT EXISTS
4.(
5.SELECT  NULL
6.FROM    t_right r
7.WHERE   r.value = l.value
8.)

 

Analyzing S3 and CloudFront Access Logs with AWS RedShift

Log data is an interesting case for RedShift. In our environment as mentioned previously we have so much log data from our CloudFront and S3 usage that nobody could conceivably work with those datasets using standard text tools such as grep or tail. Many people load their access logs into databases, but we have not found this to be feasible using MySQL or PostgreSQL due to the fact that ad-hoc queries run against sets with billions of rows can take hours. Once imported into RedShift the same queries take minutes at the most.

.. For our simple example though, we’ll just load one month of logs from just one of our CloudFront distributions:

<span class="k">COPY</span> <span class="n">cf_logentries</span>
  <span class="k">FROM</span> <span class="s1">'s3://cloudfront-logs/E1DHT7QI9H0ZOB.2014-04-'</span>
  <span class="n">CREDENTIALS</span> <span class="s1">'aws_access_key_id=;aws_secret_access_key='</span>
  <span class="k">DELIMITER</span> <span class="s1">'\t'</span> <span class="n">MAXERROR</span> <span class="mi">200</span> <span class="n">FILLRECORD</span> <span class="n">IGNOREHEADER</span> <span class="mi">2</span> <span class="n">gzip</span><span class="p">;

.. With CloudFront you really should care about your cache hit ratio - maybe it's obvious, but the load on your origin systems decrease as your content becomes easier to cache. This query will look at the most used URLs and give you a cache hit ratio:</span>

 

AWS RDS Provisioned IOPS really worth it?

If you’re ok with running replicas, we recommend running a read-only replica as a NON-RDS instance, and putting it on a regular EC2 instance. You can get better read-IOPS at a much cheaper price by managing the replica yourself. We even setup replicas outside AWS using stunnel and put SSD drives as the primary block device and we get ridiculous read speeds for our reporting systems – literally 100 times faster than we get from RDS.