Amazon has posted an essay-length explanation of the cloud outage that took offline some of the Web's most popular services last week. In summary, it appears that human error during an system upgrade meant a redundant backup network for the Elastic Block Service (EBS) accidentally took up the entire network traffic in the U.S. East Region, overloading it, and jamming up the system.
At the end of a long battle to restore services, Amazon says it managed to recover most data but 0.07 percent "could not be restored for customers in a consistent state". A rather miserly 10-day usage credit is being given to users, although users should check their Amazon Web Services (AWS) control panel to see if they qualify. No doubt several users are also consulting the AWS terms and conditions right now, if not lawyers.
A software bug played a part, too. Although unlikely to occur in normal EBS usage, the bug became a substantial problem because of the sheer volume of failures that were occurring. Amazon also says their warning systems were not "fine-grained enough" to spot when other issues occurred at the same time as other, louder alarm bells were ringing.
Amazon calls the outage a "re-mirroring storm." EBS is essentially the storage component of the Elastic Compute Cloud (EC2), which lets users hire computing capacity in Amazon's cloud service.
EBS works via two networks: a primary one and a secondary network that's slower and used for backup and intercommunication. Both are comprised of clusters containing nodes, and each node acts as a separate storage unit.
There are always two copies of a node, meant to preserve data integrity. This is called re-mirroring. Crucially, if one node is unable to find its partner node to backup to then it'll get stuck until it can find a replacement, and will keep trying until it can find a node. Similarly, new nodes need also to create a partner to be valid, and will get stuck until they can succeed.
It appears that during a routine system upgrade, all network traffic for the U.S. East Region was accidentally sent to the secondary network. Being slower and of lower capacity, the secondary network couldn't handle this traffic. The error was realized and the changes rolled back, but by that point the secondary network had been largely filled -- leaving some nodes on the primary network unable to re-mirror successfully. When unable to re-mirror, a node stops all data access until it's sorted out a backup, a process that ordinarily takes milliseconds but -- it would transpire -- would now take days, as Amazon engineers fought to fix the system.
Because of the re-mirroring storm that had arisen, it became difficult to create new nodes, as happens normally during everyday EC2 usage. In fact, so many new node creation requests arose, which couldn't be serviced, that the EBS control system also became partially unavailable.
Amazon engineers then turned off the capability to create new nodes, essentially putting the brakes on EBS (and therefore EC2 -- this is probably the moment at which many websites and services went offline). Things began to improve but that's when a software bug struck. When many EBS nodes close their requests for re-mirroring at the same time, they fail. Normally this issue hadn't shown its head because there'd never been a situation when so many nodes were closing requests simultaneously.
As a result, even more nodes attempted to re-mirror and the situation became worse. The EBS control system was again adversely affected.
Fixing the problem was problematic because EBS was configured not to trust any nodes it thought had failed. Therefore, the Amazon engineers had to physically locate and connect new storage in order to create new nodes to meet the demand -- around 13 percent of existing volumes, which is likely a huge amount of storage. Additionally, they had reconfigured the system to avoid any more failures, but this made bringing the new hardware online very difficult.
Some system reprogramming took place and eventually everything began to return to normal. A snapshot had been made when the crisis hit and Amazon engineers had to restore 2.2 percent of this manually. Eventually 1.04 percent of the data had to be forensically restored (I'm guessing they had to dip into archives and manually extract and restore files). In the end, 0.07 percent of files couldn't be restored. That might not sound a lot, but bearing in mind Amazon Web Services is the stream train driving the Internet, I suspect it's quite a lot of data.
Amazon has, of course, promised to improve across the board -- everything from auditing processes to avoid the error that kicked off the event, to speeding up recovery. There's an apology too, but it's surprisingly short and perhaps not as grovelling as some would like. At this stage of the game I suspect all the AWS engineers want to do is take a few days off.
I'm among those who anticipated this outage was an extraordinary event. I thought an act of God might be involved somewhere -- maybe a seagull fell into a ventilation pipe and blew up a sever.
Sadly, it looks like I'm wrong. There are clear failures that could have been seen in advance, and they're are going to dent the confidence of anybody using Amazon Web Services. Ultimately, it's clear that nobody ever asked, "What if?"
I don't expect anybody to be giving up on Amazon Web Services right now, largely because it remains one of the cheapest and most accessible services out there. But Amazon's going to have to keep its nose clean in the coming months and years until the great cloud outage is just a memory.