Set up AWS ParallelCluster

Important

AWS ParallelCluster and FSx for Lustre costs hundreds or thousands of dollars per month to use. See FSx for Lustre Pricing and EC2 Pricing for details.

AWS ParallelCluster is a service that lets you create your own HPC cluster. Using GCHP on AWS ParallelCluster is similar to using GCHP on any other HPC. We offer up-to-date Amazon Machine Images (AMIs) with GCHP’s dependencies built and GCHP compiled through AMI list. These images contain pre-built GCHP source code and the tools for creating a GCHP run directory. This page has instructions on using the AMIs to create your own ParallelCluster. You can also choose to set up AWS ParallelCluster for running GCHP simulations yourself, and the other GCHP documentation like Build GCHP’s dependencies, Download the model, Compile, Download Input Data, and Run the model is appropriate for using GCHP on AWS ParallelCluster.

The workflow for getting started with GCHP simulations using AWS ParallelCluster based on our public AMIs is

Create an FSx for Lustre file system for input data (described on this page)
Configure AWS CLI (described on this page)
Configure AWS ParallelCluster (described on this page)
Create AWS ParallelCluster with GCHP public AMIs (described on this page)
Follow the normal GCHP User Guide
1. Create a Run Directory
2. Download Input Data
Running GCHP on ParallelCluster (described on this page)

These instructions were written using AWS ParallelCluster 3.7.0.

1. Create an FSx for Lustre file system

Start by creating an FSx for Lustre file system. This is persistent storage that will be mounted to your AWS ParallelCluster cluster. This file system will be used for storing GEOS-Chem input data and for housing your GEOS-Chem run directories.

Refer to the official FSx for Lustre Instructions for instructions on creating the file system. Only Step 1, Create your Amazon FSx for Lustre file system, is necessary. Step 2, Install the Lustre client, and subsequent steps have instructions for mounting your file system to EC2 instances, but AWS ParallelCluster automates this for us.

In subsequent steps you will need the following information about your FSx for Lustre file system:

its ID (fs-XXXXXXXXXXXXXXXXX)
its subnet (subnet-YYYYYYYYYYYYYYYYY)
its security group that has the inbound network rules (sg-ZZZZZZZZZZZZZZZZZ).

Once you have created the file system, proceed with 2. AWS CLI Installation and First-Time Setup.

2. AWS CLI Installation and First-Time Setup

Next you need to make sure you have the AWS CLI installed and configured. The AWS CLI is a terminal command, aws, for working with AWS services. If you have already installed and configured the AWS CLI previously, continue to 3. Create your AWS ParallelCluster.

Install the aws command: Official AWS CLI Install Instructions. Once you have installed the aws command, you need to configure it with the credentials for your AWS account:

$ aws configure

For instructions on aws configure, refer to the Official AWS Instructions or this YouTube tutorial.

3. Create your AWS ParallelCluster

Note

You should also refer to the offical AWS documentation on Configuring AWS ParallelCluster. Those instructions will have the latest information on using AWS ParallelCluster. The instructions on this page are meant to supplement the official instructions, and point out the important parts of the configuration for use with GCHP.

Next, install AWS ParallelCluster with pip. This requires Python 3.

$ pip install aws-parallelcluster

Now you should have the pcluster command. You will use this command to performs actions like: creating a cluster, shutting your cluster down (temporarily), destroying a cluster, etc.

Create a cluster config file by running the pcluster configure command:

$ pcluster configure --config cluster-config.yaml

For instructions on pcluster configure, refer to the official instructions Configuring AWS ParallelCluster.

The following settings are recommended:

Scheduler: slurm
Operating System: alinux2
Head node instance type: c5n.large
Number of queues: 1
Compute instance type: c5n.18xlarge
Maximum instance count: Your choice. This is the maximum number execution nodes that can run concurrently. Execution nodes automatically spinup and shutdown according when there are jobs in your queue.

Now you should have a file name cluster-config.yaml. This is the configuration file with setting for a cluster.

Before starting your cluster with the pcluster create-cluster command, you can modify cluster-config.yaml to create cluster based on our AMIs. We provide the available AMI ID through AMI list.

You also need to modify cluster-config.yaml so that your FSx for Lustre file system is mounted to your cluster. Use the following cluster-config.yaml as a template for these changes.

Region: us-east-1  # [replace with] the region with your FSx for Lustre file system
Image:
  Os: alinux2
  CustomAmi: ami-AAAAAAAAAAAAAAAAA # [replace with] the AMI ID you want to use
HeadNode:
  InstanceType: c5n.large  # smallest c5n node to minimize costs when head-node is up
  Networking:
    SubnetId: subnet-YYYYYYYYYYYYYYYYY  # [replace with] the subnet of your FSx for Lustre file system
    AdditionalSecurityGroups:
      - sg-ZZZZZZZZZZZZZZZZZ  # [replace with] the security group with inbound rules for your FSx for Lustre file system
  LocalStorage:
    RootVolume:
      VolumeType: io2
  Ssh:
    KeyName: AAAAAAAAAA  # [replace with] the name of your ssh key name for AWS CLI
SharedStorage:
  - MountDir: /fsx  # [replace with] where you want to mount your FSx for Lustre file system
    Name: FSxExtData
    StorageType: FsxLustre
    FsxLustreSettings:
      FileSystemId: fs-XXXXXXXXXXXXXXXXX  # [replace with] the ID of your FSx for Lustre file system
Scheduling:
  Scheduler: slurm
  SlurmQueues:
  - Name: main
    ComputeResources:
    - Name: c5n18xlarge
      InstanceType: c5n.18xlarge
      MinCount: 0
      MaxCount: 10  # max number of concurrent exec-nodes
      DisableSimultaneousMultithreading: true  # disable hyperthreading (recommended)
      Efa:
        Enabled: true
    Networking:
      SubnetIds:
      - subnet-YYYYYYYYYYYYYYYYY  # [replace with] the subnet of your FSx for Lustre file system (same as above)
      AdditionalSecurityGroups:
        - sg-ZZZZZZZZZZZZZZZZZ  # [replace with] the security group with inbound rules for your FSx for Lustre file system
      PlacementGroup:
        Enabled: true
    ComputeSettings:
      LocalStorage:
        RootVolume:
          VolumeType: io2

When you are ready, run the pcluster create-cluster command.

$ pcluster create-cluster --cluster-name pcluster --cluster-configuration cluster-config.yaml

It may take several minutes up to an hour for your cluster’s status to change to CREATE_COMPLETE. You can check the status of you cluster with the following command.

$ pcluster describe-cluster --cluster-name pcluster

Once your cluster’s status is CREATE_COMPLETE, run the pcluster ssh command to ssh into it.

$ pcluster ssh --cluster-name pcluster -i ~/path/to/keyfile.pem

At this point, your cluster is set up and you can use it like any other HPC. Now you can create a run directory by running the createRunDir.sh command. Your next steps will be following the normal instructions found in the User Guide.

4. Running GCHP on ParallelCluster

AWS ParallelCluster supports Slurm and AWS Batch job schedulers. Your cluster is set to use Slurm scheduler according to the configuration file. It might require the root permission to run Slurm commands or restart Slurm. Before you submit your job, you can start a shell as superuser by running sudo -s.

You can follow Run the model to run GCHP with Slurm scheduler.