How can you improve HPC cluster utilisation?

Building an on-premises elastic HPC cluster

Comments

HPC clusters require both high capex investment to procure and build, and substantial opex to operate. Therefore, high ROI is expected. However, this is not the case for most HPC installations. True, HPC clusters are mostly used for research purposes so they cannot be equally compared to commercial investments. Nevertheless, considering this context the ROI here can be measured in terms of research productivity.

HPC clusters are widely used at universities and research organisations, and due to the multidisciplinary nature of most of these organisations, the HPC clusters are logically expected to run a variety of computation programs to solve different kinds of problems.

However, in most cases, they are exclusively used by one or a couple of disciplines because of organisational and funding reasons which are in fact invisibly influenced by technical reasons, which are in turn invisibly caused by a poor technical design.

Most HPC clusters run a specific operating system as their sole software platform, and consequently the applications that can be used on a particular HPFC are naturally limited. This static model is acceptable if the HPC cluster was procured and built for specific users, and is actually being utilised above 70 per cent throughout its lifecycle, otherwise the HPC cluster is not wisely utilised.

HPC clusters can be elastic. HPC computing nodes should be viewed as a pool of reusable hardware resources that can dynamically host multiple software platforms; hence they can accommodate a larger variety of applications and serve larger number of different users.

HPC computing nodes can be dynamically segmented into multiple auto-scalable computing platforms based on demands and priorities.

For instance, one segment runs a Linux operating system, and another segment runs a Windows operating system. If the Linux segment is under-utilised and there is more demand on the Windows segment, then automatically remove idle computing nodes from the Linux segment and allocate them to the Windows segment, and vice versa.

Elastic HPC clusters can easily be built on public cloud computing platforms due to their inherent elasticity. However, the same model can technically be achieved for on-premises HPC clusters by building it as a private cloud platform, or by virtualizing the computing nodes and introducing an automation layer on top of the virtualized computing nodes, or by only introducing an automation layer on top of the bare-metal computing nodes.

Click here for a high-resolution version.

The automation layer can be integrated with a billing system in order to charge all disciplines separately. In this manner, the HPC cluster can be leveraged efficiently and effectively by all the disciplines of the organisation as needed, and hence maximise the ROI on the HPC investment.

Abraham Alawi is a solutions architect and DevOps engineer who has worked across a number of prominent Australian enterprises.