A Rare Peek Into The Massive Scale of AWS

The Amazon online retail business may be a $70 billion behemoth, but it does not throw off a lot of cash. Amazon founder and CEO Jeff Bezos is not interested in profits as much as he is about transforming the world around him, but the cloud computing business is one of the most capital intensive businesses there are in the world. Thanks to its near monopoly in online search, Google can spend tens of billions of dollars on datacenters and not bat an eyelash. Microsoft, thanks to its near monopoly on desktop software and its dominant position in datacenter software, also has very deep pockets and can spend as much.

.. So, the answer is that AWS probably has somewhere between 2.8 million and 5.6 million servers across its infrastructure. I realize those are some pretty big error bars, but this is the data we have to work with.

.. “Networking is a red alert situation for us right now,” explained Hamilton. “The cost of networking is escalating relative to the cost of all other equipment. It is Anti-Moore. All of our gear is going down in cost, and we are dropping prices, and networking is going the wrong way. That is a super-big problem, and I like to look out a few years, and I am seeing that the size of the networking problem is getting worse constantly. At the same time that networking is going Anti-Moore, the ratio of networking to compute is going up.”

.. The first thing that Amazon learned from its custom network gear is what it learned about servers and storage a long time ago: If you build it yourself with minimalist attitudes and only with the features you need, it is a lot cheaper.

.. But the surprising thing, even to Hamilton, was that network availability went up, not down. And that is because AWS switches and routers only had features that AWS needed in its network.

.. And so when it first tested out its homegrown networking, it did so on a 3 megawatt datacenter with 8,000 servers that cost something on the order of $40 million to build. This is not something even the largest network equipment providers can do, but AWS could, and did, literally rent the capacity from itself for a couple of hundred thousand dollars to test at this vast scale for a couple of months. (Yet another example of scale and how to leverage it.) Today all of the AWS network is using this custom network stack. Equally important to owning the stack and testing it thoroughly, Amazon continuously develops the code and puts it into production. “Even if it doesn’t start out better, it keeps getting better.”

.. The AZs are usually under 1 millisecond apart in terms of latency and are always less than 2 milliseconds apart; this speed is what allows for synchronous data replication, since committing data to a solid state drive takes – wait for it – between 1 and 2 milliseconds.