Computerworld

Microsoft Sidekick Debacle & the Cloud: Lessons Learned

When Microsoft's storage service for Sidekick users broke down, cloud computing questions sprang up -- both fair and unfair.

This week's cloud tempest is the very visible breakdown of Microsoft's Danger storage service for the T-Mobile Sidekick phone. An apologetic email (as reported by TechCrunch) first went out from Microsoft to users noting that all data had been lost with no way to recover it. Apparently, it now seems that some or most of the data will be recovered, which is, of course, good news. I don't know that Microsoft has provided any formal explanation of what went wrong, but most of the speculation I've seen identifies a failed SAN upgrade with no data backup available as the cause for the data loss.

People on all sides of the cloud debate have been debating this incident and treating it as though it is a proxy for the entire concept of cloud computing.

While it's unlikely that one should conflate this situation with the totality of cloud computing, there are some very, very important issues highlighted by this situation that are worth exploring and understanding.

Lessons to be Drawn

It's a cloud: Some writing I've seen on this incident downplay it because, in the view of the authors, this service isn't really a cloud offering. They say it's a limited application, or an adjunct service to a hardware device, or it's really a consumer service and therefore not a "real" cloud application because those are aimed at business users. That's baloney.

First of all, it is a cloud application. It certainly fits into the common SaaS definitions. The "it's really a consumer service" rationale won't wash, either. With the blurring of consumer and commercial use, what's personal to one person might be mission-critical to another. And trying to deflect concern about this incident by defining it away misses the point. Cloud computing is a big tent (if I may mix a metaphor), and one of its strengths is the fact that many different approaches can be considered as cloud computing. In any case, clever dissembling is beside the point. If it walks like a duck, quacks like a duck, trying to convince someone that it's not a duck because it's actually a similar looking, slightly different species is unlikely to be successful.

This attention bespeaks intense interest in the cloud: Let's face it, all the hullabaloo about this incident is good news, because it means people recognize cloud computing is an important development. You don't spend a lot of time worrying about something you don't care about. It's obvious that the concept of cloud computing has garnered attention, to which I attribute the fact that everyone recognizes that the old methods of running IT infrastructure are expensive and don't scale.

This incident represents a breach of best practices: Losing data is the greatest shortcoming an operations group can suffer. A service outage is bad, but losing data is inexcusable. In fact, calling this a breach of best practices is overstating it. The term "best practice" describes a set of processes performed by the leaders in a field, not the mainstream. Backing up data is data management 101; really, it's 01. If this incident is truly a result of failing to do a backup, it contravenes the basic, simplest practice of managing data. No matter what the cause, losing data is inexcusable.

It calls into question one of the tenets of cloud computing: The expertise of cloud providers. My company does not run its own email service; we use Google to manage our mail system. Is this because we don't know how to run a mail server? Of course not. We do it for a very simple reason: using Google allows us to focus on our core mission, serving our clients.

We are very aware of what would happen if we ran our own mail server. Every time there was a problem, we'd treat it like an inconvenient interruption, and do just the minimum to patch the problem and get back to our real work. We would never devote the full amount of time running a mail server deserves. Therefore, our mail service would always be fragile, subject to interruption, and (most likely) vulnerable to security penetration. So we turn to a company that can devote real resources to running our mail server, one that follows best practices, and one that can take the necessary time to do it right.

Page Break

An article on CRN blamed the outage on the fact that Microsoft is working on another project and pulled engineers from Danger onto the other project. Frankly, this is, or should be, irrelevant from a user perspective. A cloud provider is running a service and has to be committed to operational excellence, despite any other distractions or competing priorities. Otherwise, it forces the customer to examine the internals of the cloud service. This, from the perspective of the customer, is impractical, since everyone has limited time to devote to these things--a problem which will only get worse, given the fact that we are moving to a world in which use of cloud services is rapidly multiplying.

Moreover, most cloud providers don't want a horde of customers insisting on auditing the service--the support required for customer audits is not scalable. Finally, a customer shouldn't have to examine the inner workings of the cloud service. One doesn't question how the local electric utility schedules its generator maintenance, why should it be necessary for a cloud service? Customers should not have to do detailed evaluations of a cloud service: it's the job of the service provider to ensure appropriate operational processes in place.

Whatever the reason for the data loss, it calls into question the tenet that cloud computing enables a better level of discipline and expertise to be devoted to a service offering. If a customer can't depend on a cloud provider to perform at a higher level than the customer could do on its own, why should it turn to the cloud?

Likely Outcomes of this Incident

Microsoft evaluates its practices throughout its cloud offerings: I guarantee that one outcome of this incident is that an edict came down from on high: "Make sure no other system is vulnerable to this problem!" There are undoubtedly a bunch of operations groups at Microsoft digging through backup practices to ensure redundant data is stored and that reliable backups are being performed. Also undoubted is the response of these groups: "how come we're being stuck with a ton of extra work because they screwed up?" Fellas, that's just the way organizations work.

Other cloud providers use this as a "teaching moment": While these cloud companies are wiping their hands across their foreheads in relief, thinking "there but for God's grace go I," senior management is regarding this incident as an inexpensive way to learn an important lesson, and are taking it as an opportunity to do a low-risk drill. Of course, if other Microsoft operations groups resent having to do work because of this incident, imagine how ops groups in other companies feel!

Microsoft's credibility suffers a short-term hit: Some people will generalize this situation to all of Microsoft's offerings, and be more cautious about using them. Let me be clear: I don't believe this situation represents Microsoft's typical operations practices. Hotmail is a far larger service, and I don't recall hearing anything like this happening with it. Nevertheless, Microsoft's overall cloud reputation will be tarnished for a while.

The best thing for Microsoft would be to treat this as crisis management event, and follow the established playbook: early apologies, full transparency, frequent updates. That still won't prevent people from re-evaluating their opinions, at least in the short-term, but it will help return those initial re-evaluations back to their long-term assessments more quickly.

Cloud computing in general suffers a short-term hit: Any time one market participant suffers a significant blow, the concern spreads to others. All cloud providers are going to be questioned about their competence regarding storage practices. It's inevitable and unavoidable. Rather than resisting it, they should take it as an opportunity to proclaim about how much they are concerned on this topic and describe at length the extensive, redundant, and highly structured processes they have in place to avoid issues like this one. This information won't stop people from querying the provider, but it shows responsiveness and provides the opportunity to pick up share.

Long-term, this is a minor bump in the road: Of course this is a significant incident, and of course a very difficult situation for those affected by it, but in the long-run, this will be looked back at as a minor incident. Cloud computing is gaining momentum, driven by an appreciation of its strengths and cost efficiencies, and a problem, even one as serious as this, will not long hinder its progress.

Bernard Golden is CEO of consulting firm HyperStratus, which specializes in virtualization, cloud computing and related issues. He is also the author of "Virtualization for Dummies," the best-selling book on virtualization to date.