This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.
The economics, scale, and manageability of cloud storage simply cannot be matched even by the largest enterprise datacenters.
Hyperscale cloud storage providers like AWS, Google and Azure dropped prices by up to 65% last year and promised a Moore's Law pricing model going forward. AWS provides eleven 9's of durability, meaning if you store 10,000 objects with Amazon S3, you can, on average, expect to incur a loss of a single object once every 10,000,000 years. Further, Amazon S3 is designed to sustain the concurrent loss of data in two facilities by storing objects on multiple devices across multiple facilities.
Unfortunately, up until recently, cloud storage is really only useful for the data you don't use instead of the data you actually use. In other words, cloud storage is cheap and deep but hasn't been able to offer the performance of local storage. For cloud storage to be useful for unstructured data storage, it needs to provide equivalent flexibility, performance and productivity as enterprise storage systems. The cost advantage by itself, compelling as it is, simply won't be enough.
In order to use the cloud for both your active data and your inactive data, it has to feel equal to or better than the local filers that are already in place. For this to happen, the following key requirements must be in place:
* Cache locally: Given the user expectation of LAN-like file access times, active data needs be cached locally while inactive data is stored in the cloud. While most data isn't access very often and is perfectly suited for the cloud, active data needs to remain close to the user. Machine learning based on file usage, "pinned folders," or a combination of both methods needs to be employed to make sure the right files are cached locally while less used files recede back into the cloud.
* Global deduplication: Global deduplication ensures that only one unique block of data is stored in the cloud and cached locally. With the commonality of blocks across files, global deduplication reduces the amount of data that is stored in the cloud and sent between the cloud and the local caches as only changes are stored and sent. For example, when Electronic Arts centralized its data in cloud storage, its total storage footprint dropped from 1.5PBs to only 45TBs. The time needed to transfer 50GB game builds between offices dropped from up to 10 hours to minutes as only the changes to the builds were actually sent.
* NAS-like responsiveness: File directory browsing must be as responsive as a local NAS. In order to do this, not only should the active data be cached locally, but the metadata of all files, not just cached files, must also be cached in SSD at all sites. SSDs are necessary as the user is seeing a full representation of all the files in the entire file system even though less than 5% of the files are locally cached. When the user navigates up and down files and folders in the network drive, it has to "feel" like all those files are there. As a portion of the file metadata is often displayed along side the file name, and file locking has to be instantaneous for any file even if not locally cached, metadata has to accessed as fast as possible. Without all the file metadata in cache, users think that their computer or network is running slow as navigating a folder is one of the most basic functions.
* Support for "chatty" applications.: Applications must work across sites as well as they work at a single site. Many technical applications (CAD, PLM, BIM) are extremely chatty, which normally increases the time to open, save, or sync a file from less than 30 seconds on a local NAS to over 20 minutes when centralized in the cloud. Most people think this is a bandwidth issue, but in fact it is because the applications are very chatty.
For example, a common CAD application has nearly 16,000 sequential file operations that need to occur before a file is opened. If the authoritative copy is on the same LAN, the file lock is only 0.5 ms away so the file opens in 8 seconds (16,000 * 0.5 milliseconds). However, chattiness causes massive delays if over a WAN. If a file that is centralized in Syracuse was opened from San Diego, the file lock is 86 milliseconds away (the round trip latency from San Diego to Syracuse), so it takes 16,000 * 86 milliseconds to for the file to open -- approximately 22 minutes. The actual data transfer is a fraction of the 22 minutes.
* Data integrity and cross-site locking. When data lives on a file server, we only have to worry about maintaining one consistent copy (as long as the file is locked when a user is editing it). This changes when data lives in the cloud but is accessed from many sites. To avoid file corruption when using cloud storage, you need two things:
- A clear separation between the authoritative copy of data in the cloud and the local cache copy at each site. A "transactionally consistent" file system can maintain file integrity even if there's a hardware or power failure -- without falling back on a file system check or earlier file version. This assures data integrity in a distributed environment.
- Granular component-level locking that works across sites and can lock portions of files rather than just entire files. When you're working across sites, the cloud can't be an intermediary for file lock data. There needs to be direct connectivity between sites to keep data current and maintain effective byte-level locking.
* Better than local security: Look for four security capabilities: encryption across the file system; secure key management -- keys should never be sent to or stored in the cloud; lock management integration with other security tools; and compliance with relevant security standards like FIPS 140-2.
* Flexibility to change: You never know when you might need to change cloud providers -- remember Nirvanix? You might also want to use two cloud providers, essentially using one as a secondary site. A global file system should support both scenarios.
There are many companies already using the cloud for primary storage for multiple sites. For example, C&S Companies and Mead & Hunt both support Autodesk Revit and CAD files in the cloud for distributed project teams; Electronic Arts runs intensive software development applications across 40 sites with file data in cloud storage; and Milwaukee Electric Tool uses cloud storage for all its files, but was driven by the need to collaborate on CAD and video files between the US and China.
Many complex applications and data will continue to need a local SAN or NAS -- or something that behaves exactly like one. Data security, application type, file size or complexity and other concerns mean some data needs to stay in the organization. But the costs and inflexibility of traditional storage -- particularly when application data is shared across multiple offices -- are slowing businesses down. Finding a cloud storage solution that supports the requirements outlined here can make cloud a primary storage option in addition to the DR, back-up and archiving role that it has played to date.
Randy Chou is the co-founder and CEO of Panzura.