Some weeks back, somebody asked me “What’s the big deal with the cloud? I don’t even understand WHAT it is!”. This is a common problem, and one I’m going to try and clear up here and now.
Why is it so hard to define?
The cloud is so hard to define, because it is comprised of several different ideas and technologies. As I see it, the cloud comprises of the following things:
- File Storage
- Remote Computing Power
- Clustered Web Hosting
- Data Storage
- Web Applications
- Data Exchange
I will attempt to define all of these, and end with a real world scenario (though fake) of how several of these can be brought together.
Disks are cheap, we know that. You can buy 1TB for US$75, that’s peanuts! The problem is high availability and data throughput. This is where “old skool” CDN’s typical played a role. However, with the introduction of Amazon’s Simple Storage Service (S3), things changed. While there is little difference between the two services in terms of the reason you used a CDN; what S3 bought to the table was a unique pricing plan (no huge setup fees, just pay pennies for what you use) making it available to every company at every level, more importantly they also introduced an API.
Through the API, those looking for a standard CDN-type service, can upload their resources transparently as an integral part of their process. In addition many services capitalize on this API to provide non-CDN services, such as data backup.
Since the introduction of S3, Rackspace has also entered the space with it’s Cloud Files service.
Remote Computing Power
Another facet of The Cloud, is remote computing power, this originally took the form of Amazon’s Elastic Cloud Computing (EC2). The idea behind this service, is the ability to configure what I can best describe as virtual machines to perform specific tasks (i.e. crunch data). Then, using the API, you can “spin up” multiple virtual appliances using the disk image as you need them.
This means you have the resources of a giant enterprise company at your disposal, on an as-needed basis, and again, one of the breakthroughs is Amazons pricing: Pay for what you use.
In 2008, a small company loved by geeks around the world, entered into this space, SliceHost. Known (at least, by a savvy few) for their excellent VPS services, the introduction of an API, put them in direct competition with Amazon’s EC2. In October 2008, Rackspace purchased SliceHost and while SliceHost is still a separate company, the technology now powers Rackspace’s Cloud Servers offering.
Clustered Web Hosting
Clustered web hosting is nothing new. Companies have been creating clusters of servers for eons, for many tasks; ranging from number crunching, to data analysis, through to web servers and database servers. Where this space enters into The Cloud, is through a service like Rackspaces Mosso/Cloud Sites service. Like a traditional cluster, they provide high availability, lots of power and reliability. (Note: I use Mosso for this blog and a number of other sites)
However, where this becomes blended with the cloud, versus traditional clusters, is that Mosso operates one giant cluster, with huge numbers of websites using the same cluster, with the infrastructure in place to allow those sites to grow as large as they wish to autonomously and transparently as needs require.
Another (perhaps the first, but I’m not that familiar with them) player in this space, is MediaTemple’s Grid-Service.
You might ask yourself, what is the difference between File Storage and Data Storage? The answer is the same as what is the difference between the file system and a database.
This area is the newest addition to the cloud, and one I think most people saw as needed to really replace the old style non-cloud systems. The biggest player in this market is Amazon’s SimpleDB (beta), with Google’s BigTable service only available through their python-based AppEngine.
Arguably the meat of Web 2.0, web applications allow people to create, and work in the cloud without any knowledge of the technology. To them, data held by web applications is in the same place as their webmail. API access to integrate these applications into other services are a part of how they are used within the cloud. The obvious player in this area is Google, with it’s Gmail, Google Calendar, and other Google apps such as Google Docs.
Data Exchange using web services is the heart of Web 2.0: the mashup. Data exchange is not strictly part of the cloud, but web services are. Almost all of the cloud is interacted with using web services. In addition, thanks to the ideas of single sign on using OpenID etc, we are starting to see different facets of our data migrating across the websites we use to make it more useful and accessible — this is part of the cloud.
For this scenario, I’m going to make up some fictional scenario involving Twitter. I have absolutely no idea what they have technically going on, and have no idea if this is how they might handle the scenario; it’s just a well known scenario that could be solved using the cloud.
The scenario is this: Oprah joins your service, and suddenly you have an influx of a new users. In addition Ashton Kutcher and CNN are duking it out to reach 1 million followers.
You have 2 weeks to prepare, you could call your Dell representative and order 50 new servers, clone disks, and put them into your cluster… but what if it’s not enough? How do you spend that much money when the hype might only last 2 weeks? a month? The simple answer is, you don’t. Instead you configure a couple of EC2 or CloudServer instances, and as your load starts to ramp up, you simply initiate more and more appliance on-the-fly using their respective APIs.
Knowing that Oprahs show is going to air at a specific time; you might fire up several instances to get the ball rolling an hour before hand.
You have one appliance which will function as web servers for twitter.com, one for handling API requests, perhaps even split out registration to it’s own appliance, and then of course clustered copies of their traditional RDBMS (i.e. they’re not typically using Amazon Simple DB for their regular storage as it’s functionality just isn’t up-to-par).
You already have S3 in place for use avatars, but instead of calculating the filename hash on every request, or retrieving it from your local database, you push that into Amazon Simple DB.
And that’s it. As the load starts to drop off, you shut down EC2 instances, knowing if you get a sudden influx, you can always spin them back up.
Eventually, you get a handle on what your new average load will be (presumably, only some small portion of the initial influx of “zomg Oprah says this is awesome so it must be” people will stay) and then you can actually purchase the right amount of actual hardware to add to your own systems.
Or not. Keep it in the cloud. That’s a decision you can now make at your leisure, instead of scrambling to make your best guess in that two week period before things go nuts.
The reason the cloud is so hard to define, is because it’s no single thing. It is, like it’s namesake, nebulous. It is simply there, and will look like what you make it.
Please read Rob’s reply below, he is an employee of Rackspace, and usually (always?) the guy behind @mosso.