Understanding Azure HPC

Way back when, so the story goes, someone said we’d only need five computers for the whole world. It’s quite easy to argue that Azure, Amazon Web Services, Google Cloud Platform, and the like are all implementations of a massively scalable compute cluster, with each server and each data center another component that adds up to build a huge, planetary-scale computer. In fact, many of the technologies that power our clouds were originally developed to build and run supercomputers using off-the-shelf commodity hardware.

Why not take advantage of the cloud to build, deploy, and run HPC (high-performance computing) systems that exist for only as long as we need them to solve problems? You can think of clouds in much the same way the filmmakers at Weta Digital thought about their render farms, server rooms of hardware built out to be ready to deliver the CGI effects for films like King Kong and The Hobbit. The equipment doubled as a temporary supercomputer for the New Zealand government while waiting to be used for filmmaking.

The first big case studies of the public clouds focused on this capability, using them for burst capacity that in the past might have gone to on-premises HPC hardware. They showed a considerable cost saving with no need to invest in data center space, storage, and power.

Introducing Azure HPC

HPC capabilities remain an important feature for Azure and other clouds, no longer relying on commodity hardware but now offering HPC-focused compute instances and working with HPC vendors to offer their tools as a service, treating HPC as a dynamic service that can be launched quickly and easily while being able to scale with your requirements.

Azure’s HPC tools can perhaps best be thought of as a set of architectural principles, focused on delivering what Microsoft describes as “big compute.” You’re taking advantage of the scale of Azure to perform large-scale mathematical tasks. Some of these tasks might be big data tasks, whereas others might be more focused on compute, using a limited number of inputs to perform a simulation, for instance. These tasks include creating time-based simulations using computational fluid dynamics, or running through multiple Monte Carlo statistical analyses, or putting together and running a render farm for a CGI movie.

Azure’s HPC features are intended to make HPC available to a wider class of users who may not need a supercomputer but do need a higher level of compute than an engineering workstation or even a small cluster of servers can provide. You won’t get a turnkey HPC system; you’ll still need to build out either a Windows or Linux cluster infrastructure using HPC-focused virtual machines and an appropriate storage platform, as well as interconnects using Azure’s high-throughput RDMA networking features.

Building an HPC architecture in the cloud

Technologies such as ARM and Bicep are key to building out and maintaining your HPC environment. It’s not like Azure’s platform services, as you are responsible for most of your own maintenance. Having an infrastructure-as-code basis for your deployments should make it easier to treat your HPC infrastructure as something that can be built up and torn down as necessary, with identical infrastructures each time you deploy your HPC service.

Microsoft provides several different VM types for HPC workloads. Most applications will use the H-series VMs which are optimized for CPU-intensive operations, much like those you’d expect from computationally demanding workloads focused on simulation and modelling. They’re hefty VMs, with the HBv3 series giving you as many as 120 AMD cores and 448GB of RAM; a single server costs $9.12 an hour for Windows or $3.60 an hour for Ubuntu. An Nvidia InfiniBand network helps build out a low-latency cluster for scaling. Other options offer older hardware for lower cost, while smaller HC and H-series VMs use Intel processors as an alternative to AMD. If you need to add GPU compute to a cluster, some N-series VMs offer InfiniBand connections to help build out a hybrid CPU and GPU cluster.

It’s important to note that not all H-series VMs are available in all Azure regions, so you may need to choose a region away from your location to find the right balance of hardware for your project. Be prepared to budget several thousand dollars a month for large projects, especially when you add storage and networking. On top of VMs and storage, you’re likely to need a high-bandwidth link to Azure for data and results.

Once you’ve chosen your VMs, you need to pick an OS, a scheduler, and a workload manager. There are many different options in the Azure Marketplace, or if you prefer, you can deploy a familiar open source solution. This approach makes it relatively simple to bring existing HPC workloads to Azure or build on existing skill sets and toolchains. You even have the option of working with cutting-edge Azure services like its growing FPGA support. There’s also a partnership with Cray that delivers a managed supercomputer you can spin up as needed, and well-known HPC applications are available from the Azure Marketplace, simplifying installation. Be prepared to bring your own licenses where necessary.

Managing HPC with Azure CycleCloud

You don’t have to build an entire architecture from scratch; Azure CycleCloud is a service that helps manage both storage and schedulers, giving you an environment to manage your HPC tools. It’s perhaps best compared to tools like ARM, as it’s a way to build infrastructure templates that focus on a higher level than VMs, treating your infrastructure as a set of compute nodes and then deploying VMs as necessary, using your choice of scheduler and providing automated scaling.

Everything is managed through a single pane of glass, with its own portal to help control your compute and storage resources, integrated with Azure’s monitoring tools. There’s even an API where you can write your own extensions to add additional automation. CycleCloud isn’t part of the Azure portal, it installs as a VM with its own web-based UI.

Big compute with Azure Batch

Although most of the Azure HPC tools are infrastructure as a service, there is a platform option in the shape of Azure Batch. This is designed for intrinsically parallel workloads, like Monte Carlo simulations, where each part of a parallel application is independent of every other part (though they may share data sources). It’s a model suitable for rendering frames of a CGI movie or for life sciences work, for example analyzing DNA sequences. You provide software to run your task, built to the Batch APIs. Batch allows you to use spot instances of VMs where you’re cost sensitive but not time dependent, running your jobs when capacity is available.

Not every HPC job can be run in Azure Batch, but for the ones that can, you get interesting scalability options that help keep costs to a minimum. A monitor service helps manage Batch jobs, which may run several thousand instances at the same time. It’s a good idea to prepare data in advance and use separate pre- and post-processing applications to handle input and output data.

Using Azure as a DIY supercomputer makes sense. H-series VMs are powerful servers that provide plenty of compute capability. With support for familiar tools, you can migrate on-premises workloads to Azure HPC or build new applications without having to learn a whole new set of tools. The only real question is economical: Does the cost of using on-demand high-performance computing justify switching away from your own data center?

Understanding Azure HPC

Build your own supercomputer in the cloud with Azure’s high-performance computing tools.

Introducing Azure HPC

Building an HPC architecture in the cloud

Managing HPC with Azure CycleCloud

Big compute with Azure Batch