Interest in AI applications is on the rise, but infrastructure and operations (I&O) leaders are often unprepared to address storage requirements for growing and diverse datasets of large-scale machine learning (ML).
If you’re selecting infrastructure for AI workloads involving ML and deep learning, you must understand the unique requirements of these emerging workloads, especially if you’re looking to use them to accelerate innovation and agility.
There are three main impacts that AI workloads have on data management and storage.
1. Distinct characteristics require complementary storage architectures
Different stages of AI workloads comprising compute intensive ML and deep neural networks (DNNs) have distinct input/output (I/O) characteristics. This requires complementary storage architectures to be deployed.
ML workloads are complex and different, not just from traditional enterprise stacks, but also from high performance computing workloads. The success of ML and AI initiatives relies on orchestrating effective data pipelines that provision the high quality of data in the right formats in a timely manner during the different stages of the AI pipeline.
These stages involve data collection and integration; data preparation and cleaning; model training – initiatives can be broadly classified as statistical ML or DNN workloads; and inference – where the trained model that’s deployed needs to analyse the most current snapshot of the data and provide analysis in near real-time.
As you start evaluating new platforms and products to support AI projects, remember that it will be difficult to find a “one size fits all” storage and data platform for the entire workflow.
On one hand, compute-intensive ML and DNN workloads present new and unique performance challenges that require new approaches for accommodating both high throughput and low latency at scale. On the other, they require the ability to store multi-petabyte datasets at the lowest possible cost, while enabling data mobility between edge, core and public cloud deployments.
2. Unique requirements will cause approaches to be re-evaluated
Unique requirements of AI and ML workloads will lead you to re-evaluate your approach to storage selection and embrace new technology and deployment methods. Deploying a storage platform for AI and ML presents new challenges for the large-scale deployment that needs to accommodate all stages of AI data processing.
To determine your storage characteristics and requirements, it’s important to assess the storage platform against workload type; protocol; deployment methods – edge, on-premise or cloud; and technologies.
Then select and evaluate storage platforms to adhere to the requirements of large-scale AI and ML implementations. There are a number of factors to consider, particularly whether the platforms offer portability, scalability, performance, interoperability, software-defined and cost optimisation.
Choose vendors and products that can deliver high performance for both bandwidth oriented batch workloads and small-file workloads, as most traditional solutions can’t deliver good performance for both sequential and random storage I/O.
Avoid focusing on performance characteristics alone. Evaluate all storage offerings across cost, portability, scale, deployment options and interoperability requirements.
Develop an integrated approach for data management across the different AI pipeline phases and deployment options (edge, core and public cloud) to avoid introducing storage siloes.
3. Nascent vendor ecosystem is exacerbating vendor selection
The vendor ecosystem supporting ML workloads is nascent, but rapidly evolving, exacerbating vendor selection for I&O leaders.
ML and DNN workloads have a significant impact on storage architecture. Due to the parallel processing capabilities and sheer density of specialised processors like graphics processing units (GPUs), reading training data from disk-based systems is one of the most common bottlenecks.
Design your network and storage subsystems to reduce I/O bottlenecks, so they fully leverage the value of your investments in specialised compute hardware such as GPUs.
Many large incumbent vendors are repositioning their distributed file systems for AI workloads, while we’re also seeing several emerging vendors in this space.
Vendors competing for AI workloads can be broadly categorised as appliance vendors primarily focused on on-premises deployments; software-defined storage vendors agnostic to where workloads are run; and native cloud storage services primarily provided by large public cloud IaaS vendors such as Amazon Web Services (AWS), Azure and Google.
Choose vendors that provide the broadest platform support and flexibility in deployment models, since many AI workloads will span from data centre to cloud to edge.
Julia Palmer is a research director at Gartner, covering emerging infrastructure technologies, such as hyperconverged infrastructures, software-defined storage, hybrid and public cloud storage, distributed file systems and object storage and solid-state storage.