When good disks go bad

It's easy to forget that when things do go wrong, they can go very wrong

Comments

Despite their increasing complexity in terms of both size and functionality storage systems have achieved an impressive level of reliability. This is particularly noteworthy given the fact that they are engineered around electro-mechanical devices (i.e. disks) that are among the components most prone to failure within the datacenter. The safeguards and redundancies designed into modern storage systems routinely handle most device failures as a regular matter of course with little or no impact to the overall operation.

As is often the case with technology, they have reached a point where this reliability is often taken for granted, especially by those who aren't spending their days (and nights) on the storage management front lines. It's easy to forget that when things do go wrong, they can go very wrong. Occasionally, it's helpful to be reminded.

A friend of mine, who is the CIO for a mid-sized organization, recently shared his 72-hour nightmare experience with me. Their storage system, which housed key applications and email inexplicably went down one evening. After contacting vendor support, they learned that the apparent reason for the outage was a known firmware bug that caused the controller to think that there were multiple drive failures.

Now, one might ask why the organization, under a valid support contract, had no prior notification about the firmware update to address such a serious problem. It seems that such a notification should have occurred, but hadn't.

If this had been the only problem, the outage, while serious, could have been resolved in fairly short order. Unfortunately, the problem was exacerbated by a series of tech support mishaps in the firmware update and system recovery process that led to multiple rebuilds and an extended period of time with the system needlessly operating at risk in a severely degraded mode.

This organization fell victim to what can be described as the Achilles' heel of storage infrastructure - the intersection of technology bugs and human error. This is a highly unpredictable type of risk, and unfortunately the opportunities for prevention and avoidance are few. Some things that can be done to reduce the likelihood of such a situation include:

Verifying that you are receiving notifications of critical patches and updates
Keeping configuration management information current
Establishing a process to quickly flag and update at-risk systems
When dealing with vendor technical support in critical recovery situations, triple-check, escalate, and obtain expert approval.

Even with these steps, however, there is no guarantee of risk elimination, as my friend can now attest.

Jim Damoulakis is chief technology officer of GlassHouse Technologies, a leading provider of independent storage services. He can be reached at jimd@glasshouse.com.

Join the newsletter!

Error: Please check your email address.

More about HIS