10 mins read

Announcing AWS Parallel Computing Service to run HPC workloads at virtually any scale

Spoken by Polly

Today we announce AWS Parallel Computing Service (AWS PCS)a new managed service that helps customers set up and manage High Performance Computing (HPC) Cluster so they can run their simulations seamlessly on AWS at virtually any scale. With the Slumber Schedulers allow them to work in a familiar HPC environment and achieve results faster, rather than having to worry about infrastructure.

In November 2018 we AWS Parallel Clusteran open source cluster management tool supported by AWS that helps you deploy and manage HPC clusters in the AWS Cloud. AWS ParallelCluster also enables customers to quickly build and deploy proof-of-concept and production HPC compute environments. You can use AWS ParallelCluster Command Line Interface, API, Python libraryand the user interface, which is installed from open source packages. They are responsible for updates, which may include tearing down and redeploying clusters. However, many customers have asked us for a fully managed AWS service to eliminate operational tasks when building and running HPC environments.

AWS PCS simplifies AWS managed HPC environments and is accessible through the AWS Management ConsoleAWS SDK and AWS Command Line Interface (AWS CLI). Your system administrators can create managed Slurm clusters that use their compute and storage configurations, identities, and job allocation preferences. AWS PCS uses Slurm, a highly scalable, fault-tolerant job scheduler used by a wide range of HPC customers, to schedule and orchestrate simulations. End users such as scientists, researchers, and engineers can log into AWS PCS clusters to run and manage HPC jobs, use interactive software on virtual desktops, and access data. They can quickly bring their workloads to AWS PCS without significant effort to port code.

You can fully managed NICE DCV Remote desktops for remote visualization and access to job telemetry or application logs so specialists can manage your HPC workflows in one place.

AWS PCS is designed for a wide range of traditional and emerging compute- or data-intensive engineering and scientific workloads in areas such as computational fluid dynamics, weather modeling, finite element analysis, electronic design automation, and reservoir simulations, and uses familiar methods to prepare, run, and analyze simulations and computations.

Getting Started with AWS Parallel Computing Service
To try out AWS PCS, you can download our Tutorial on creating a simple cluster in the AWS documentation. First, you create a virtual private cloud (VPC) using an AWS CloudFormation template and shared storage in Amazon Elastic File System (Amazon EFS) in your account for the AWS Region where you want to test AWS PCS. For more information, see Creating a VPC And Create shared storage in the AWS documentation.

1. Create a cluster
In the AWS PCS Consolechoose Creating a clustera persistent resource for managing resources and running workloads.

Next, enter your cluster name and choose the controller size of your Slurm scheduler. You can choose Small (up to 32 nodes and 256 jobs), medium (up to 512 nodes and 8,192 jobs) or Large (up to 2,048 nodes and 16,384 jobs) for the limits of cluster workloads. In Networking In the section, select your created VPC, the subnet to launch the cluster, and the security group applied to your cluster.

Optionally, you can set Slurm configuration, such as an idle time before scaling down compute nodes, a prolog and epilogue script directory on started compute nodes, and a resource selection algorithm parameter used by Slurm.

Choose Creating a cluster. It takes some time for the cluster to be deployed.

2. Create compute node groups
After creating your cluster, you can create compute node groups, a virtual collection of Amazon Elastic Compute Cloud (Amazon EC2) Instances that AWS PCS uses to provide interactive access to a cluster or to run jobs in a cluster. When you define a compute node group, you specify general characteristics such as EC2 instance types, minimum and maximum instance count, target VPC subnets, Amazon Machine Image (AMI)Purchase option and custom launch configuration. Compute node groups require an instance profile to AWS Identity and Access Management (IAM) Role for an EC2 instance and an EC2 launch template that AWS PCS uses to configure EC2 instances that it launches. For more information, see Creating a starter template And Creating an instance profile in the AWS documentation.

To create a compute node group in the console, go to your cluster and select the Compute node groups and the Creating a Compute Node Group Button.

You can create two compute node groups: a login node group for end-user access and a job node group for running HPC jobs.

To create a compute node group that runs HPC jobs, enter a compute node name and select a previously created EC2 launch template, IAM instance profile, and subnets to launch compute nodes in your cluster VPC.

Next, select the preferred EC2 instance types to use when launching compute nodes, as well as the minimum and maximum instance count for scaling. I have the hpc6a.48xlarge Instance type and scaling are limited to eight instances. For a login node, you can choose a smaller instance, for example a c6i.xlarge Instance. You can also use either the Upon request or Position EC2 purchase option if the instance type supports it. Optionally, you can select a specific AMI.

Choose Create. It takes some time for the compute node group to be deployed. For more information, see Create a compute node group to run jobs And Create a compute node group for login nodes in the AWS documentation.

3. Create and run your HPC jobs
After you create your compute node groups, you submit a job to a queue to run. The job remains in the queue until AWS PCS schedules it to run on a compute node group based on the available provisioned capacity. Each queue is associated with one or more compute node groups that provide the necessary EC2 instances to perform the processing.

To create a queue in the console, go to your cluster and select the Queues and the Create queue Button.

Enter your queue name and select the compute node groups assigned to your queue.

Choose Create and wait while the queue is created.

If the login computer node group is active, you can AWS System Manager to connect to the created EC2 instance. Go to Amazon EC2 Console and select your EC2 instance of the login compute node group. For more information, see Create a queue to submit and manage orders And Connecting to your cluster in the AWS documentation.

To run a job with Slurm, prepare a submission script that specifies the job requirements and submit it to a queue with the sbatch command. Typically this is done from a shared directory so that the login and compute nodes have a common storage space for file access.

You can also run a Message Passing Interface (MPI) job in AWS PCS using Slurm. For more information, see Running a single node job with Slurm or Running a multi-node MPI job with Slurm in the AWS documentation.

You can connect a fully managed NICE DCV remote desktop for visualization. First, use the CloudFormation template from HPC recipes for the AWS GitHub repository.

In this example I have the OpenFOAM Motorcycle simulation to calculate the uniform flow around a motorcycle and a rider. This simulation was run with 288 cores of three hpc6a instances. The output can be viewed in ParaView Session after logging into the DCV instance web interface.

After you complete the HPC jobs using the cluster and node groups you created, you should delete the resources you created to avoid unnecessary costs. For more information, see Delete your AWS resources in the AWS documentation.

What you should know
Here are some things you should know about this feature:

  • Slurm versions – AWS PCS initially supports Slurm 23.11 and provides mechanisms to allow customers to upgrade their Slurm major versions as new versions are added. In addition, AWS PCS is designed to automatically update the Slurm controller with patch releases. For more information, see Slurm versions in the AWS documentation.
  • Capacity reservations – You can reserve EC2 capacity in a specific Availability Zone and for a specific duration using On-Demand Capacity Reservations to ensure that you have the compute capacity you need when you need it. For more information, see Capacity reservations in the AWS documentation.
  • Network file systems – You can connect network storage volumes to which data and files can be written and retrieved, including Amazon FSx for NetApp ONTAP, Amazon FSx for OpenZFSAnd Amazon File Cache as well as Amazon EFS And Amazon FSx for Lustre. You can also use self-managed volumes such as NFS servers. For more information, see Network file systems in the AWS documentation.

Now available
AWS Parallel Computing Service is now available in US East (N. Virginia), AWS US East (Ohio), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm) Regions.

AWS PCS launches all resources in your AWS account. You are billed accordingly for these resources. For more information, see the AWS PCS Pricing Page.

Try it out and send feedback to AWS re:Post or through your usual AWS support contacts.

— Channy

PS Many thanks to Matthias Vaughna key developer advocate at AWS, for his contribution to creating an HPC test environment.