How Redshift differs from PostgreSQL

  • Redshift is a columnar database better suited for analytics, and thus a more appropriate platform for a data warehouse.
  • In PostgreSQL a single database connection cannot utilize more than one CPU, while Redshift is architected for parallel processing across multiple nodes.
  • Redshift doesn’t enforce primary key, foreign key, or uniqueness constraints, though Amazon says “primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.”

Analyzing S3 and CloudFront Access Logs with AWS RedShift

Log data is an interesting case for RedShift. In our environment as mentioned previously we have so much log data from our CloudFront and S3 usage that nobody could conceivably work with those datasets using standard text tools such as grep or tail. Many people load their access logs into databases, but we have not found this to be feasible using MySQL or PostgreSQL due to the fact that ad-hoc queries run against sets with billions of rows can take hours. Once imported into RedShift the same queries take minutes at the most.

.. For our simple example though, we’ll just load one month of logs from just one of our CloudFront distributions:

COPY cf_logentries
  FROM 's3://cloudfront-logs/E1DHT7QI9H0ZOB.2014-04-'
  CREDENTIALS 'aws_access_key_id=;aws_secret_access_key='

.. With CloudFront you really should care about your cache hit ratio - maybe it's obvious, but the load on your origin systems decrease as your content becomes easier to cache. This query will look at the most used URLs and give you a cache hit ratio: