From 0f412f5888584d8828a6bd507fc08916357049e7 Mon Sep 17 00:00:00 2001 From: Allan Carter Date: Wed, 8 May 2024 23:09:05 +0000 Subject: [PATCH] Deployed ded618c with MkDocs version: 1.5.3 --- CONTRIBUTING/index.html | 14 +-- config/index.html | 176 ++++++++++++++-------------- custom-amis/index.html | 6 +- debug/index.html | 12 +- delete-cluster/index.html | 2 +- deploy-parallel-cluster/index.html | 20 ++-- deployment-prerequisites/index.html | 24 ++-- federation/index.html | 2 +- implementation/index.html | 12 +- index.html | 12 +- job_preemption/index.html | 4 +- onprem/index.html | 12 +- res_integration/index.html | 2 +- rest_api/index.html | 4 +- run_jobs/index.html | 22 ++-- sitemap.xml.gz | Bin 127 -> 127 bytes soca_integration/index.html | 2 +- 17 files changed, 163 insertions(+), 163 deletions(-) diff --git a/CONTRIBUTING/index.html b/CONTRIBUTING/index.html index c44ced99..d9fa3516 100644 --- a/CONTRIBUTING/index.html +++ b/CONTRIBUTING/index.html @@ -133,12 +133,12 @@
-

Contributing Guidelines

+

Contributing Guidelines

Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional documentation, we greatly value feedback and contributions from our community.

Please read through this document before submitting any issues or pull requests to ensure we have all the necessary information to effectively respond to your bug report or contribution.

-

Reporting Bugs/Feature Requests

+

Reporting Bugs/Feature Requests

We welcome you to use the GitHub issue tracker to report bugs or suggest features.

When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:

@@ -148,7 +148,7 @@

Reporting Bugs/Feature RequestsAny modifications you've made relevant to the bug
  • Anything unusual about your environment or deployment
  • -

    Contributing via Pull Requests

    +

    Contributing via Pull Requests

    Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:

    1. You are working against the latest source on the main branch.
    2. @@ -166,15 +166,15 @@

      Contributing via Pull Requestsforking a repository and creating a pull request.

      -

      Finding contributions to work on

      +

      Finding contributions to work on

      Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.

      -

      Code of Conduct

      +

      Code of Conduct

      This project has adopted the Amazon Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opensource-codeofconduct@amazon.com with any additional questions or comments.

      -

      Security issue notifications

      +

      Security issue notifications

      If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our vulnerability reporting page. Please do not create a public github issue.

      -

      Licensing

      +

      Licensing

      See the LICENSE file for our project's licensing. We will ask you to confirm the licensing of your contribution.

    diff --git a/config/index.html b/config/index.html index a2d7ee6e..eb8e2f16 100644 --- a/config/index.html +++ b/config/index.html @@ -327,7 +327,7 @@
    -

    Configuraton File Format

    +

    Configuraton File Format

    This project creates a ParallelCluster configuration file that is documented in the ParallelCluster User Guide.

     termination_protection: bool
    @@ -458,73 +458,73 @@ 

    Configuraton File FormatStatusScript:

    -

    Top Level Config

    -

    termination_protection

    +

    Top Level Config

    +

    termination_protection

    Enable Cloudformation Stack termination protection

    default=True

    -

    StackName

    +

    StackName

    The name of the configuration stack that will configure ParallelCluster and deploy it.

    If you do not specify the ClusterName then it will default to a value based on the StackName. If StackName ends in -config then ClusterName will be the StackName with -config stripped off. Otherwise it will be the StackName with -cl (for cluster) appended.

    Optional so can be specified on the command-line

    default='slurm-config'

    -

    Region

    +

    Region

    AWS region where the cluster will be deployed.

    Optional so can be specified on the command-line

    -

    SshKeyPair

    +

    SshKeyPair

    Default EC2 key pair that will be used for all cluster instances.

    Optional so can be specified on the command-line

    -

    VpcId

    +

    VpcId

    The ID of the VPC where the cluster will be deployed.

    Optional so can be specified on the command-line

    -

    CIDR

    +

    CIDR

    The CIDR of the VPC. This is used in security group rules.

    -

    SubnetId

    +

    SubnetId

    The ID of the VPC subnet where the cluster will be deployed.

    Optional. If not specified then the first private subnet is chosen. If no private subnets exist, then the first isolated subnet is chosen. If no isolated subnets exist, the the first public subnet is chosen.

    We recommend using a private or isolated subnet.

    -

    ErrorSnsTopicArn

    +

    ErrorSnsTopicArn

    The ARN of an existing SNS topic. Errors will be published to the SNS topic. You can subscribe to the topic so that you are notified for things like script or lambda errors.

    Optional, but highly recommended

    -

    TimeZone

    +

    TimeZone

    The time zone to use for all EC2 instances in the cluster.

    default='US/Central'

    -

    RESEnvironmentName

    +

    RESEnvironmentName

    If you are deploying the cluster to use from Research and Engineering Studio (RES) virtual desktops, then you can specify the environment name so that the virtual desktops automatically get configured to use the cluster.

    The security group of the desktops will be updated with rules that allow them to talk to the cluster and the cluster will be configured on the desktop.

    The Slurm binaries will be compiled for the OS of the desktops and and environment modulefile will be created so that the users just need to load the cluster modulefile to use the cluster.

    -

    slurm

    +

    slurm

    Slurm configuration parameters.

    -

    ParallelClusterConfig

    +

    ParallelClusterConfig

    ParallelCluster specific configuration parameters.

    -

    Version

    +

    Version

    The ParallelCluster version.

    This is required and cannot be changed after the cluster is created.

    Updating to a new version of ParallelCluster requires either deleting the current cluster or creating a new cluster.

    -

    ClusterConfig

    +

    ClusterConfig

    type: dict

    Additional ParallelCluster configuration settings that will be directly added to the configuration without checking.

    This will will be used to create the initial ParallelCluster configuration and other settings in this configuration file will override values in the dict.

    This exists to enable further customization of ParallelCluster beyond what this configuration supports.

    -

    Image

    +

    Image

    The OS and AMI to use for the head node and compute nodes.

    -
    OS
    +
    OS

    See the ParallelCluster docs for the supported OS distributions and versions.

    -
    CustomAmi
    +
    CustomAmi

    See the ParallelCluster docs for the custom AMI documentation.

    NOTE: A CustomAmi must be provided for Rocky8. All other distributions have a default AMI that is provided by ParallelCluster.

    -

    Architecture

    +

    Architecture

    The CPU architecture to use for the cluster.

    ParallelCluster doesn't support heterogeneous clusters. All of the instances must have the same CPU architecture and the same OS.

    @@ -535,11 +535,11 @@

    Architecturex86_64

    default: x86_64

    -

    ComputeNodeAmi

    +

    ComputeNodeAmi

    AMI to use for compute nodes.

    All compute nodes will use the same AMI.

    The default AMI is selected by the Image parameters.

    -

    DisableSimultaneousMultithreading

    +

    DisableSimultaneousMultithreading

    type: bool

    default=True

    Disable SMT on the compute nodes.

    @@ -547,12 +547,12 @@

    DisableSimultaneousMultithreadingNot all instance types can disable multithreading. For a list of instance types that support disabling multithreading, see CPU cores and threads for each CPU core per instance type in the Amazon EC2 User Guide for Linux Instances.

    Update policy: The compute fleet must be stopped for this setting to be changed for an update.

    ParallelCluster documentation

    -

    EnableEfa

    +

    EnableEfa

    type: bool

    default: False

    Recommend to not use EFA unless necessary to avoid insufficient capacity errors when starting new instances in group or when multiple instance types in the group.

    See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html#placement-groups-cluster

    -

    Database

    +

    Database

    Optional

    Configure the Slurm database to use with the cluster.

    This is created independently of the cluster so that the same database can be used with multiple clusters.

    @@ -560,7 +560,7 @@

    DatabaseDatabaseStackName. All of the other parameters will be pulled from the stack.

    See the ParallelCluster documentation.

    -

    DatabaseStackName
    +
    DatabaseStackName

    Name of the ParallelCluster CloudFormation stack that created the database.

    The following parameters will be set using the outputs of the stack:

    default='REQUEUE'

    -

    PreemptType

    +

    PreemptType

    Slurm documentation

    Valid values:

    default='preempt/partition_prio'

    -

    PreemptExemptTime

    +

    PreemptExemptTime

    Slurm documentation

    Global option for minimum run time for all jobs before they can be considered for preemption.

    A time of -1 disables the option, equivalent to 0. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes", and "days-hours:minutes:seconds".

    default='0'

    type: str

    -

    SlurmConfOverrides

    +

    SlurmConfOverrides

    File that will be included at end of slurm.conf to override configuration parameters.

    This allows you to customize the slurm configuration arbitrarily.

    This should be used with caution since it can result in errors that make the cluster non-functional.

    type: str

    -

    SlurmrestdUid

    +

    SlurmrestdUid

    User ID for the slurmrestd daemon.

    type: int

    default=901

    -

    SlurmRestApiVersion

    +

    SlurmRestApiVersion

    The REST API version.

    This is automatically set based on the Slurm version being used by the ParallelCluster version.

    type: str

    default: ''0.0.39'

    -

    Head Node AdditionalSecurityGroups

    +

    Head Node AdditionalSecurityGroups

    Additional security groups that will be added to the head node instance.

    -

    Head Node AdditionalIamPolicies

    +

    Head Node AdditionalIamPolicies

    List of Amazon Resource Names (ARNs) of IAM policies for Amazon EC2 that will be added to the head node instance.

    -

    SubmitterSecurityGroupIds

    +

    SubmitterSecurityGroupIds

    External security groups that should be able to use the cluster.

    Rules will be added to allow it to interact with Slurm.

    -

    SubmitterInstanceTags

    +

    SubmitterInstanceTags

    Tags of instances that can be configured to submit to the cluster.

    When the cluster is deleted, the tag is used to unmount the slurm filesystem from the instances using SSM.

    -

    InstanceConfig

    +

    InstanceConfig

    Configure the instances used by the cluster.

    A partition will be created for each combination of Base OS, Architecture, and Spot.

    -

    UseSpot

    +

    UseSpot

    Configure spot instances.

    type: bool

    default: True

    -

    Exclude

    +

    Exclude

    Instance families and types to exclude.

    Exclude patterns are processed first and take precesdence over any includes.

    Instance families and types are regular expressions with implicit '^' and '$' at the begining and end.

    -
    Exclude InstanceFamilies
    +
    Exclude InstanceFamilies

    Regular expressions with implicit '^' and '$' at the begining and end.

    An empty list is the same as '.*'.

    Default:

    @@ -702,7 +702,7 @@
    Exclude InstanceFamilies -
    Exclude InstanceTypes
    +
    Exclude InstanceTypes

    Regular expressions with implicit '^' and '$' at the begining and end.

    An empty list is the same as '.*'.

    Default:

    @@ -711,15 +711,15 @@
    Exclude InstanceTypesInclude
    +

    Include

    Instance families and types to include.

    Exclude patterns are processed first and take precesdence over any includes.

    Instance families and types are regular expressions with implicit '^' and '$' at the begining and end.

    -
    MaxSizeOnly
    +
    MaxSizeOnly

    type: bool

    default: False

    If MaxSizeOnly is True then only the largest instance type in a family will be included unless specific instance types are included.

    -
    Include InstanceFamilies
    +
    Include InstanceFamilies

    Regular expressions with implicit '^' and '$' at the begining and end.

    An empty list is the same as '.*'.

    Default:

    @@ -765,7 +765,7 @@
    Include InstanceFamilies -
    Include InstanceTypes
    +
    Include InstanceTypes

    Regular expressions with implicit '^' and '$' at the begining and end.

    An empty list is the same as '.*'.

    Default:

    @@ -776,63 +776,63 @@
    Include InstanceTypesNodeCounts
    +

    NodeCounts

    Configure the number of compute nodes of each instance type.

    -
    DefaultMinCount
    +
    DefaultMinCount

    type: int

    default: 0

    Minimum number of compute nodes to keep running in a compute resource. If the number is greater than zero then static nodes will be created.

    -
    DefaultMaxCount
    +
    DefaultMaxCount

    type: int

    The maximum number of compute nodes to create in a compute resource.

    -
    ComputeResourceCounts
    +
    ComputeResourceCounts

    Define compute node counts per compute resource.

    These counts will override the defaults set by DefaultMinCount and DefaultMaxCount.

    -
    ComputeResourceName
    +
    ComputeResourceName

    Name of the ParallelCluster compute resource. Can be found using sinfo.

    -
    # Compute Resource MinCount
    +
    # Compute Resource MinCount

    type: int

    default: 0

    -
    # Compute Resource MaxCount
    +
    # Compute Resource MaxCount

    type: int

    -

    Compute Node AdditionalSecurityGroups

    +

    Compute Node AdditionalSecurityGroups

    Additional security groups that will be added to the compute node instances.

    -

    Compute Node AdditionalIamPolicies

    +

    Compute Node AdditionalIamPolicies

    List of Amazon Resource Names (ARNs) of IAM policies for Amazon EC2 that will be added to the compute node instances.

    -

    OnPremComputeNodes

    +

    OnPremComputeNodes

    Define on-premises compute nodes that will be managed by the ParallelCluster head node.

    The compute nodes must be accessible from the head node over the network and any firewalls must allow all of the Slurm ports between the head node and compute nodes.

    ParallelCluster will be configured to allow the neccessary network traffic and the on-premises firewall can be configured to match the ParallelCluster seccurity groups.

    -
    ConfigFile
    +
    ConfigFile

    Configuration file with the on-premises compute nodes defined in Slurm NodeName format as described in the Slurm slurm.conf documentation.

    The file will be included in the ParallelCluster slurm.conf so it can technically include any Slurm configuration updates including custom partition definitions.

    NOTE: The syntax of the file isn't checked and syntax errors can result in the slurmctld daemon failing on the head node.

    -
    On-Premises CIDR
    +
    On-Premises CIDR

    The CIDR that contains the on-premises compute nodes.

    This is to allow egress from the head node to the on-premises nodes.

    -
    Partition
    +
    Partition

    A partition that will contain all of the on-premises nodes.

    -

    SlurmUid

    +

    SlurmUid

    type: int

    default: 900

    The user id of the slurm user.

    -

    storage

    -

    ExtraMounts

    +

    storage

    +

    ExtraMounts

    Additional mounts for compute nodes.

    This can be used so the compute nodes have the same file structure as the remote desktops.

    This is used to configure ParallelCluster SharedStorage.

    -
    dest
    +
    dest

    The directory where the file system will be mounted.

    This sets the MountDir.

    -
    src
    +
    src

    The source path on the file system export that will be mounted.

    -
    type
    +
    type

    The type of mount. For example, nfs3.

    -
    options
    +
    options

    Mount options.

    -
    StorageType
    +
    StorageType

    The type of file system to mount.

    Valid values:

    -
    FileSystemId
    +
    FileSystemId

    Specifies the ID of an existing FSx for Lustre or EFS file system.

    -
    VolumeId
    +
    VolumeId

    Specifies the volume ID of an existing FSx for ONTAP or FSx for OpenZFS file system.

    -

    ExtraMountSecurityGroups

    +

    ExtraMountSecurityGroups

    The security groups used by the file systems so that the head and comnpute nodes can be configured to connect to them.

    For example:

    @@ -861,7 +861,7 @@

    ExtraMountSecurityGroups -
    FileSystemType
    +

    FileSystemType

    Type of file system so that the appropriate ports can be opened.

    Valid values:

    diff --git a/custom-amis/index.html b/custom-amis/index.html index 6704fa1b..1c4af9bd 100644 --- a/custom-amis/index.html +++ b/custom-amis/index.html @@ -127,7 +127,7 @@
    -

    Custom AMIs for ParallelCluster

    +

    Custom AMIs for ParallelCluster

    ParallelCluster supports building custom ParallelCluster AMIs for the head and compute nodes. You can specify a custom AMI for the entire cluster (head and compute nodes) and you can also specify a custom AMI for just the compute nodes. By default, ParallelCluster will use pre-built AMIs for the OS that you select. The exception is Rocky 8 and 9, for which ParallelCluster does not provide pre-built AMIs. @@ -161,13 +161,13 @@

    Custom AMIs for ParallelCluster -

    FPGA Developer AMI

    +

    FPGA Developer AMI

    The build file with fpga in the name is based on the FPGS Developer AMI. The FPGA Developer AMI has the Xilinx Vivado tools that can be used free of additional charges when run on AWS EC2 instances to develop FPGA images that can be run on AWS F1 instances.

    First subscribe to the FPGA developer AMI in the AWS Marketplace. There are 2 versions, one for CentOS 7 and the other for Amazon Linux 2.

    -

    Deploy or update the Cluster

    +

    Deploy or update the Cluster

    After the AMI is built, add it to the config and create or update your cluster to use the AMI. You can set the AMI for the compute and head nodes using slurm/ParallelClusterConfig/Os/CustomAmi and for the compute nodes only using slurm/ParallelClusterConfig/ComputeNodeAmi.

    Note: You cannot update the OS of the cluster or the AMI of the head node. If they need to change then you will need to create a new cluster.

    diff --git a/debug/index.html b/debug/index.html index f57f3fb0..1b324f34 100644 --- a/debug/index.html +++ b/debug/index.html @@ -139,9 +139,9 @@
    -

    Debug

    +

    Debug

    For ParallelCluster and Slurm issues, refer to the official AWS ParallelCluster Troubleshooting documentation.

    -

    Slurm Head Node

    +

    Slurm Head Node

    If slurm commands hang, then it's likely a problem with the Slurm controller.

    Connect to the head node from the EC2 console using SSM Manager or ssh and switch to the root user.

    sudo su

    @@ -160,14 +160,14 @@

    Slurm Head NodeCompute Nodes

    +

    Compute Nodes

    If there are problems with the compute nodes, connect to them using SSM Manager.

    Check for cloud-init errors the same way as for the slurmctl instance. The compute nodes do not run ansible; their AMIs are configured using ansible.

    Also check the slurmd.log.

    Check that the slurm daemon is running.

    systemctl status slurmd

    -

    Log Files

    +

    Log Files

    @@ -182,11 +182,11 @@

    Log FilesJob Stuck in Pending State

    +

    Job Stuck in Pending State

    You can use scontrol to get detailed information about a job.

    scontrol show job *jobid*
     
    -

    Job Stuck in Completing State

    +

    Job Stuck in Completing State

    When a node starts it reports it's number of cores and free memory to the controller. If the memory is less than in slurm_node.conf then the controller will mark the node as invalid. diff --git a/delete-cluster/index.html b/delete-cluster/index.html index 8b947016..b7118b61 100644 --- a/delete-cluster/index.html +++ b/delete-cluster/index.html @@ -119,7 +119,7 @@

    -

    Delete Cluster

    +

    Delete Cluster

    To delete the cluster all you need to do is delete the configuration CloudFormation stack. This will delete the ParallelCluster cluster and all of the configuration resources.

    If you specified RESEnvironmentName then it will also deconfigure the creation of users_groups.json and also deconfigure the VDI diff --git a/deploy-parallel-cluster/index.html b/deploy-parallel-cluster/index.html index c553566d..fae81e63 100644 --- a/deploy-parallel-cluster/index.html +++ b/deploy-parallel-cluster/index.html @@ -155,15 +155,15 @@

    -

    Deploy AWS ParallelCluster

    +

    Deploy AWS ParallelCluster

    A ParallelCluster configuration will be generated and used to create a ParallelCluster slurm cluster. The first supported ParallelCluster version is 3.6.0. Version 3.7.0 is the recommended minimum version because it supports compute node weighting that is proportional to instance type cost so that the least expensive instance types that meet job requirements are used. The current latest version is 3.8.0.

    -

    Prerequisites

    +

    Prerequisites

    See Deployment Prerequisites page.

    - +

    It is highly recommended to create a ParallelCluster UI to manage your ParallelCluster clusters. A different UI is required for each version of ParallelCluster that you are using. The versions are list in the ParallelCluster Release Notes. @@ -171,12 +171,12 @@

    Create ParallelCluster Slurm Database

    +

    Create ParallelCluster Slurm Database

    The Slurm Database is required for configuring Slurm accounts, users, groups, and fair share scheduling. It you need these and other features then you will need to create a ParallelCluster Slurm Database. You do not need to create a new database for each cluster; multiple clusters can share the same database. Follow the directions in this ParallelCluster tutorial to configure slurm accounting.

    -

    Create the Cluster

    +

    Create the Cluster

    To install the cluster run the install script. You can override some parameters in the config file with command line arguments, however it is better to specify all of the parameters in the config file.

    ./install.sh --config-file <config-file> --cdk-cmd create
    @@ -184,7 +184,7 @@ 

    Create the ClusterCreate users_groups.json

    +

    Create users_groups.json

    Before you can use the cluster you must configure the Linux users and groups for the head and compute nodes. One way to do that would be to join the cluster to your domain. But joining each compute node to a domain effectively creates a distributed denial of service (DDOS) attack on the demain controller @@ -231,7 +231,7 @@

    Create users_groups.jsonNow the cluster is ready to be used by sshing into the head node or a login node, if you configured one.

    If you configured extra file systems for the cluster that contain the users' home directories, then they should be able to ssh in with their own ssh keys.

    -

    Configure submission hosts to use the cluster

    +

    Configure submission hosts to use the cluster

    ParallelCluster was built assuming that users would ssh into the head node or login nodes to execute Slurm commands. This can be undesirable for a number of reasons. First, users shouldn't be given ssh access to a critical infrastructure like the cluster head node. @@ -266,7 +266,7 @@

    Configure submission host It also configures the modulefile that sets up the environment to use the slurm cluster.

    The clusters have been configured so that a submission host can use more than one cluster by simply changing the modulefile that is loaded.

    On the submission host just open a new shell and load the modulefile for your cluster and you can access Slurm.

    -

    Customize the compute node AMI

    +

    Customize the compute node AMI

    The easiest way to create a custom AMI is to find the default ParallelCluster AMI in the UI. Create an instance using the AMI and make whatever customizations you require such as installing packages and configuring users and groups.

    @@ -279,7 +279,7 @@

    Customize the compute node AMIRun Your First Job

    +

    Run Your First Job

    Run the following command in a shell to configure your environment to use your slurm cluster.

    module load {{ClusterName}}
     
    @@ -292,7 +292,7 @@

    Run Your First JobSlurm Documentation

    +

    Slurm Documentation

    https://slurm.schedmd.com

    diff --git a/deployment-prerequisites/index.html b/deployment-prerequisites/index.html index 9b631528..e8a37d4f 100644 --- a/deployment-prerequisites/index.html +++ b/deployment-prerequisites/index.html @@ -163,28 +163,28 @@
    -

    Deployment Prerequisites

    +

    Deployment Prerequisites

    This page shows common prerequisites that need to be done before deployment.

    -

    Deployment Server/Instance Requirements

    +

    Deployment Server/Instance Requirements

    The deployment process was developed and tested using Amazon Linux 2. It has also been tested on RHEL 8 and RHEL 9. An easy way to create a deployment instance is to use an AWS Cloud 9 desktop. This will give you a code editor IDE and shell environment that you can use to deploy the cluster.

    If the required packages aren't installed then you will need sudo or root access on the instance.

    -

    Configure AWS CLI Credentials

    +

    Configure AWS CLI Credentials

    You will needs AWS credentials that provide admin access to deploy the cluster.

    -

    Clone or Download the Repository

    +

    Clone or Download the Repository

    Clone or download the aws-eda-slurm-cluster repository to your system.

    git clone git@github.com:aws-samples/aws-eda-slurm-cluster.git
     
    - +

    The Slurm cluster allows you to specify an SNS notification that will be notified when an error is detected. You can provide the ARN for the topic in the config file or on the command line.

    You can use the SNS notification in various ways. The simplest is to subscribe your email address to the topic so that you get an email when there is an error. You could also use it to trigger a CloudWatch alarm that could be used to trigger a lambda to do automatic remediation or create a support ticket.

    -

    Make sure using at least python version 3.7

    +

    Make sure using at least python version 3.7

    This application requires at least python version 3.7.

    Many distributions use older versions of python by default such as python 3.6.8 in RHEL 8 and Rocky Linux 8. Newer versions are available, but can't be made the system default without breaking OS tools such as yum. @@ -198,14 +198,14 @@

    Make sure using at least pyt $ python3 --version Python 3.11.5 -

    Make sure required packages are installed

    +

    Make sure required packages are installed

    cd aws-eda-slurm-cluster
     source setup.sh
     

    The setup script assumes that you have sudo access so that you can install or update packages. If you do not, then contact an administrator to help you do the updates. If necessary modify the setup script for your environment.

    -

    Install Cloud Development Kit (CDK) (Optional)

    +

    Install Cloud Development Kit (CDK) (Optional)

    The setup script will attempt to install all of the prerequisites for you. If the install script fails on your system then you can refer to this section for instructions on how to install or update CDK.

    @@ -226,7 +226,7 @@

    Install Cloud Development Ki

    Note that the version of aws-cdk changes frequently. The version that has been tested is in the CDK_VERSION variable in the install script.

    The install script will try to install the prerequisites if they aren't already installed.

    -

    Create Configuration File

    +

    Create Configuration File

    Before you deploy a cluster you need to create a configuration file. A default configuration file is found in source/resources/config/default_config.yml. You should create a new config file and update the parameters for your cluster. @@ -296,7 +296,7 @@

    Create Configuration FileConfigure the Compute Instances

    +

    Configure the Compute Instances

    The slurm/InstanceConfig configuration parameter configures the base operating systems, CPU architectures, instance families, and instance types that the Slurm cluster should support. ParallelCluster currently doesn't support heterogeneous clusters; @@ -381,7 +381,7 @@

    Configure the Compute Instances -

    Configure Fair Share Scheduling (Optional)

    +

    Configure Fair Share Scheduling (Optional)

    Slurm supports fair share scheduling, but it requires the fair share policy to be configured. By default, all users will be put into a default group that has a low fair share. The configuration file is at source/resources/playbooks/roles/ParallelClusterHeadNode/files/opt/slurm/config/accounts.yml.example @@ -450,7 +450,7 @@

    Configure Fair Share Schedulin PriorityWeightJobSize=0

    These weights can be adjusted based on your needs to control job priorities.

    -

    Configure Licenses

    +

    Configure Licenses

    Slurm supports configuring licenses as a consumable resource. It will keep track of how many running jobs are using a license and when no more licenses are available then jobs will stay pending in the queue until a job completes and frees up a license. diff --git a/federation/index.html b/federation/index.html index c381475f..9be70b8b 100644 --- a/federation/index.html +++ b/federation/index.html @@ -109,7 +109,7 @@

    -

    Federation (legacy)

    +

    Federation (legacy)

    To maximize performance, EDA workloads should run in a single AZ. If you need to run jobs in more than one AZ then you can use the federation feature of Slurm so that you can run jobs on multiple clusters.

    The config directory has example configuration files that demonstrate how deploy federated cluster into 3 AZs.

    diff --git a/implementation/index.html b/implementation/index.html index b9c5e566..ce74f93f 100644 --- a/implementation/index.html +++ b/implementation/index.html @@ -129,15 +129,15 @@
    -

    Implementation Details (legacy)

    -

    Slurm Infrastructure

    +

    Implementation Details (legacy)

    +

    Slurm Infrastructure

    All hosts in the cluster must share a uniform user and group namespace.

    The munged service must be running before starting any slurm daemons.

    -

    Directory Structure

    +

    Directory Structure

    All of the configuration files, scripts, and logs can be found under the following directory.

    /opt/slurm/{{ClusterName}}
     
    -

    CloudWatch Metrics

    +

    CloudWatch Metrics

    CloudWatch metrics are published by the following sources, but the code is all in SlurmPlugin.py.

    • Slurm power saving scripts
        @@ -157,7 +157,7 @@

        CloudWatch MetricsDown Node Handling

        +

        Down Node Handling

        If a node has a problem running jobs then Slurm can mark it DOWN. This includes if the resume script cannot start an instance for any reason include insufficient EC2 capacity. This can create 2 issues. First, if the compute node is running then it is wasting EC2 costs. @@ -168,7 +168,7 @@

        Down Node HandlingInsufficient Capacity Exception (ICE) Handling

        +

        Insufficient Capacity Exception (ICE) Handling

        When Slurm schedules a powered down node it calls the ResumeScript defined in slurm.conf. This is in /opt/slurm/{{ClusterName}}/bin/slurm_ec2_resume.py. The script will attempt to start an EC2 instance and if it receives and InsufficientCapacityException (ICE) then the node will be marked down and Slurm will requeue the job. diff --git a/index.html b/index.html index 30a687b5..8e877341 100644 --- a/index.html +++ b/index.html @@ -135,7 +135,7 @@

    -

    AWS EDA Slurm Cluster

    +

    AWS EDA Slurm Cluster

    This repository contains an AWS Cloud Development Kit (CDK) application that creates a Slurm cluster that is suitable for running production EDA workloads on AWS.

    The original (legacy) version of this repo that used a custom Python plugin to integrate Slurm with AWS has been deprecated and is no longer supported. It can be found on the v1 branch. @@ -183,7 +183,7 @@

    AWS EDA Slurm ClusterOperating System and Processor Architecture Support

    +

    Operating System and Processor Architecture Support

    This Slurm cluster supports the following OSes:

    ParallelCluster:

      @@ -205,7 +205,7 @@

      Operating System an
    • RedHat 8 and x86_64

    Note that in ParallelCluster, all compute nodes must have the same OS and architecture.

    -

    Documentation

    +

    Documentation

    View on GitHub Pages

    You can also view the docs locally, The docs are in the docs directory. You can view them in an editor or using the mkdocs tool.

    @@ -223,9 +223,9 @@

    DocumentationOr you can simply let make do this for you.

    make local-docs
     
    -

    Security

    +

    Security

    See CONTRIBUTING for more information.

    -

    License

    +

    License

    This library is licensed under the MIT-0 License. See the LICENSE file.

    @@ -309,5 +309,5 @@ diff --git a/job_preemption/index.html b/job_preemption/index.html index 05bc0651..5205db9d 100644 --- a/job_preemption/index.html +++ b/job_preemption/index.html @@ -123,7 +123,7 @@
    -

    Job Preemption

    +

    Job Preemption

    The cluster is set up with an interactive partition that has a higher priority than all other partitions. All other partitions are configured to allow jobs to be preempted by the interactive queue. When an interactive job is pending because of compute resources then it can preempt another job and use the resources. @@ -132,7 +132,7 @@

    Job PreemptionDocumentation

    +

    Documentation

    https://slurm.schedmd.com/preempt.html

    diff --git a/onprem/index.html b/onprem/index.html index e4b6095b..8cbd9401 100644 --- a/onprem/index.html +++ b/onprem/index.html @@ -139,23 +139,23 @@
    -

    On-Premises Integration

    +

    On-Premises Integration

    The Slurm cluster can also be configured to manage on-premises compute nodes. The user must configure the on-premises compute nodes and then give the configuration information.

    -

    Network Requirements

    +

    Network Requirements

    The on-prem network must have a CIDR range that doesn't overlap the Slurm cluster's VPC and the two networks need to be connected using VPN or AWS Direct Connect. The on-prem firewall must allow ingress and egress from the VPC. The ports are used to connect to the file systems, slurm controllers, and allow traffic between virtual desktops and compute nodes.

    -

    DNS Requirements

    +

    DNS Requirements

    Local network DNS must have an entry for the slurm controller or have a forwarding rule to the AWS provided DNS in the Slurm VPC.

    -

    File System Requirements

    +

    File System Requirements

    All of the compute nodes in the cluster, including the on-prem nodes, must have file system mounts that replicate the same directory structure. This can involve mounting filesystems across VPN or Direct Connect or synchronizing file systems using tools like rsync or NetApp FlexCache or SnapMirror. Performance will dictate the architecture of the file system.

    The onprem compute nodes must mount the Slurm controller's NFS export so that they have access to the Slurm binaries and configuration file. They must then be configured to run slurmd so that they can be managed by Slurm.

    -

    Slurm Configuration of On-Premises Compute Nodes

    +

    Slurm Configuration of On-Premises Compute Nodes

    The slurm cluster's configuration file allows the configuration of on-premises compute nodes. The Slurm cluster will not provision any of the on-prem nodes, network, or firewall, but it will configure the cluster's resources to be used by the on-prem nodes. @@ -205,7 +205,7 @@

    Slurm Configuration of SuspendExcParts=onprem -

    Simulating an On-Premises Network Using AWS

    +

    Simulating an On-Premises Network Using AWS

    Create a new VPC with public and private subnets and NAT gateways. To simulate the latency between an AWS region and on-prem you can create the VPC in a different region in your account. The CIDR must not overlap with the Slurm VPC.

    diff --git a/res_integration/index.html b/res_integration/index.html index eba5089a..63ee3c43 100644 --- a/res_integration/index.html +++ b/res_integration/index.html @@ -119,7 +119,7 @@
    -

    RES Integration

    +

    RES Integration

    Integration with Research and Engineering Studion (RES) is straightforward. You simply specify the --RESEnvironmentName option for the install.sh script or add the RESEnvironmentName configuration parameter to your configuration file. diff --git a/rest_api/index.html b/rest_api/index.html index 8f0c9d0f..fa943e5a 100644 --- a/rest_api/index.html +++ b/rest_api/index.html @@ -123,10 +123,10 @@

    -

    Slurm REST API

    +

    Slurm REST API

    The Slurm REST API give a programmatic way to access the features of Slurm. The REST API can be used, for example, to use a Lambda function to submit jobs to the Slurm cluster.

    -

    How to use the REST API

    +

    How to use the REST API

    The following shows how to run a simple REST call.

    source /opt/slurm/{{ClusterName}}/config/slurm_config.sh
     unset SLURM_JWT
    diff --git a/run_jobs/index.html b/run_jobs/index.html
    index 87b009e3..2a701efe 100644
    --- a/run_jobs/index.html
    +++ b/run_jobs/index.html
    @@ -159,11 +159,11 @@
     
    -

    Run Jobs

    +

    Run Jobs

    This page is to give some basic instructions on how to run and monitor jobs on Slurm. Slurm provides excellent man pages for all of its commands, so if you have questions refer to the man pages.

    -

    Set Up

    +

    Set Up

    Load the environment module for Slurm to configure your PATH and Slurm related environment variables.

    module load {{ClusterName}}
    @@ -176,7 +176,7 @@ 

    Set UpKey Slurm Commands

    +

    Key Slurm Commands

    The key Slurm commands are

    @@ -254,7 +254,7 @@

    Key Slurm Commandssbatch

    +

    sbatch

    The most common options for sbatch are listed here. For more details run man sbatch.

    @@ -323,7 +323,7 @@

    sbatchRun a simulation build followed by a regression

    +

    Run a simulation build followed by a regression

    build_jobid=$(sbatch -c 4 --mem 4G -L vcs_build -C 'GHz:4|GHz:4.5' -t 30:0 sim-build.sh)
     if sbatch -d "afterok:$build_jobid" -c 1 --mem 100M --wait submit-regression.sh; then
         echo "Regression Passed"
    @@ -331,7 +331,7 @@ 

    Run a simulation build echo "Regression Failed" fi

    -

    srun

    +

    srun

    The srun is usually used to open a pseudo terminal on a compute node for you to run interactive jobs. It accepts most of the same options as sbatch to request cpus, memory, and node features.

    To open up a pseudo terminal in your shell on a compute node with 4 cores and 16G of memory, execute the following command.

    @@ -349,18 +349,18 @@

    srun

    Another way to run interactive GUI jobs is to use srun's --x11 flag to enable X11 forwarding.

    srun -c 1 --mem 8G --pty --x11 emacs
     
    -

    squeue

    +

    squeue

    The squeue command shows the status of jobs.

    The output format can be customized using the --format or --Format options and you can configure the default output format using the corresponding SQUEUE_FORMAT or SQUEUE_FORMAT2 environment variables.

    squeue
     
    -

    sprio

    +

    sprio

    Use sprio to get information about a job's priority. This can be useful to figure out why a job is scheduled before or after another job.

    sprio -j10,11
     
    -

    sacct

    +

    sacct

    Display accounting information about jobs. For example, it can be used to get the requested CPU and memory and see the CPU time and memory actually used.

    sacct -o JobID,User,JobName,AllocCPUS,State,ExitCode,Elapsed,CPUTime,MaxRSS,MaxVMSize,ReqCPUS,ReqMem,SystemCPU,TotalCPU,UserCPU -j 44
    @@ -371,9 +371,9 @@ 

    sacct<

    For more information:

    man sacct
     
    -

    sreport

    +

    sreport

    The sreport command can be used to generate report from the Slurm database.

    -

    Other Slurm Commands

    +

    Other Slurm Commands

    Use man command to get information about these less commonly used Slurm commands.

    diff --git a/sitemap.xml.gz b/sitemap.xml.gz index ffd98bafa492d3003ee4e2cf916dc4f0880a703f..1e3ac6493a6d4994f4fd8a7871188149074d529b 100644 GIT binary patch delta 13 Ucmb=gXP58h;1FQ5naExN02cWJA^-pY delta 13 Ucmb=gXP58h;Mn`mdLnxT03X`~#sB~S diff --git a/soca_integration/index.html b/soca_integration/index.html index bc117c76..e32f1ae1 100644 --- a/soca_integration/index.html +++ b/soca_integration/index.html @@ -119,7 +119,7 @@
    -

    SOCA Integration

    +

    SOCA Integration

    Scale Out Computing on AWS (SOCA) is an AWS solution that was the basis for the Research and Engineering Studion (RES) service. Unless you are already a SOCA user, it is highly recommended that you use RES, which is a fully supported AWS service.