A recent Amazon outage resulted in a small number of customers losing production data stored in their accounts. This, of course, led to typical anti-cloud comments that follows such events.
The reality is that these customers' data loss had nothing to do with cloud and everything to do with them not understanding the storage they were using and backing it up.
Over Labor Day weekend there was a power outage in one of the availability zones in the AWS US-East-1 region. Back-up generators came on, but quickly failed for unknown reasons.
Customers’ Elastic Block Store (EBS) data is replicated among multiple servers, but the outage affected multiple servers. While the bulk of data stored in EBS was fine or was able to be easily recovered after outage, .5 per cent of the data could not be recovered. Customers among the .5 per cent who did not have a back-up of their EBS data actually lost data.
It is often said that you can outsource IT but you cannot outsource the responsibility for IT. If you are going to use another company's service to store your company's important data, you need to understand how that service works.
That includes what native protection tools it offers and even more importantly what protection it does not offer. This article will discuss Amazon's Elastic Block Store (EBS), and the next one will explain its Simple Storage Service (S3) – both with an eye toward backing up the data stored by the service. The purpose, design and protection capabilities of the two services couldn't be more different.
How EBS works
EBS is essentially a very reliable virtual hard drive in the cloud. If you think of EBS as nothing more than a very fancy hard drive, what you need to do to protect it will become immediately obvious.
The challenge is that many people think that all cloud storage is automatically protected against everything, and that simply isn’t true. The events of the recent Amazon outage, where customers lost data due to an event beyond their control, should drive that point home.
EBS volumes are protected via replication between multiple servers within an availability zone – a specific geographical location. It is block-level replication that is essentially the same as what you would get with a RAID array in a data centre.
Just like RAID-protected storage, any logical corruption that may occur will likely be replicated as well, causing all data stored on that EBS volume to be corrupted or deleted.
Logical corruption can happen in a number of ways, including human error (accidentally deleting a directory), software error (bugs) or an electrical spike. This is why we back-up RAID arrays, and this is why you should backup EBS volumes.
Amazon says in their EBS product description to expect a failure rate between .1 per cent and .2 per cent, and that means if you have 1,000 volumes, you should expect to lose one or two volumes a year.
While that is roughly 10 times better than the numbers for an individual hard drive, it is not an insignificant number once you start talking about thousands of volumes. The next sentence mentions that they offer snapshot services to protect these volumes. Take the hint: protect your EBS volumes with EBS snapshots.
What are EBS snapshots?
An EBS snapshot is an image copy of the volume at a particular time; it’s very different than what we mean when we use the term snapshot in storage circles. The first snapshot, the image copy, is a full backup, and subsequent snapshots are block-level incremental back-ups.
The image is stored as an object in S3, although it is not visible in S3. If you are using EC2 for important applications that are creating data that you would like to keep, any data volumes need to be backed up, and EBS snapshots are an easy, automated way to do that.
When looking at the best practices pages for EBS snapshots, the phrase "everything old is new again" comes to mind. Since EBS snapshots are an image-level copy of a volume, you need to make sure you are not changing the volume as you are creating the snapshot.
The recommended way is to make sure that any instance using the volume is turned off so that you are not writing data during a back-up.
That's not really possible for most people, so the best they can do is to run a command inside the VM to temporarily halt writes to the volume while they take a snapshot.
Or, if the VM in question is running Windows, it is also possible to integrate with the Windows VSS service so that Windows takes an application-consistent snapshot before you take your volume-level snapshot. If you are not running Windows, pre- and post-scripting is really your only option for ensuring data integrity when taking a snapshot.
One thing about snapshots is that they do have a cost. So make sure that as part of your process of creating snapshots as backups for your EBS data you are also automatically deleting snapshots after they pass a particular age.
This will help reduce your S3 bill. One way to do this is to use the Amazon Data Lifecycle Manager. It identifies each snapshot with a tag, and you can specify the snapshot’s entire lifecycle on creation, including how long it should be kept.
There are also third-party tools, both free and commercial, that can add additional functionality to what Amazon offers out-of-the-box. Commercial tools can help automate copying EBS snapshots to another region and another account, to protect against a hacker gaining access to your account.
EBS is the main option for block storage in Amazon Web Services (AWS). But it is not magic, and it is not automatically protected against all things that might do it harm. Make sure to take advantage of the back-up services AWS provides so that if the worst happens you can easily recover.