-
Notifications
You must be signed in to change notification settings - Fork 312
Home
hgreebe edited this page Nov 18, 2024
·
149 revisions
Welcome to the AWS ParallelCluster Wiki
- Upgrade NVIDIA GPU Drivers on a cluster
- Upgrade the OpenPMIx package on a Slurm cluster managed with AWS ParallelCluster
- Upgrade Slurm in an AWS ParallelCluster cluster
- Interactive Jobs with qlogin, qrsh (sge) or srun (slurm)
- Deprecation of SGE and Torque in ParallelCluster
- Transition from SGE to SLURM
- How to enable slurmrestd on ParallelCluster
- How to setup Public Private Networking
- Open MPI Install from Source and Uninstall
- Git Pull Request Instructions
- Use ED25519 Keys with Ubuntu 22.04
- Using a Multi-NIC instance as single NIC
- ParallelCluster: Launching a Login Node
- Launch instances with ODCR (On-Demand-Capacity-Reservations)
- Configuring all_or_nothing_batch launches
- MultiUser Support
- ParallelCluster Awesomeness
- Self patch a Cluster Used for Submitting Multi node Parallel Jobs through AWS Batch
- AWS Batch with a custom Dockerfile
- Use an Existing Elastic IP
- Create cluster with encrypted root volumes
- How to use a native NICE DCV Client
- Create Ubuntu AMI with Unattended Upgrades disabled
- Update cluster when snapshot associated to EBS volume is deleted
- Installing Alternate CUDA Versions on AWS ParallelCluster
- (3.11.x) Job submission failure with Amazon Linux 2023
- (3.9.1 ‐ latest) Speculative Return Stack Overflow (SRSO) mitigations introducing potential performance impact on some AMD processors
- (3.11.0) Job submission failure caused by race condition in Pyxis configuration
- (3.9.0‐current) Cluster creation fails on Rocky 9.4
- (3.9.0‐3.10.1) Cluster update intermittently fails because some compute nodes don’t execute update procedure
- (3.8.0+) Newer Linux kernels are no longer compatible with EFA and closed Source Nvidia drivers in instances with GPU Direct RDMA support
- (3.8.0 ‐ 3.9.3) ParallelCluster Build Image Failing during Installation of Minitar Ruby Gem Dependency
- (3.10.0) Build image fails in China regions
- (3.9.0‐3.9.1) Default ThreadsPerCore Slurm setting causes reduced CPU utilization
- (3.8.0-3.9.1) SharedStorageType: Efs not working on arm instances
- (3.3.0‐3.9.0) Potential data loss issue when removing storage with update‐cluster in AWS ParallelCluster 3.3.0‐3.9.0
- (3.4.0-3.9.0) Updating a cluster to include an EFS fs with encryption in transit fails
- (3.8.0-3.9.0) Slurmd Does not Start with EFS SharedStorageType on reboot
- (3.9.0-latest) SSH bootstrap cannot launch processes on remote host when using Intel MPI with Slurm 23.11
- (3.0.0-latest) Build image CloudFormation stacks fail to delete after images are successfully built
- (3.0.0-3.8.0) Interactive job submission through srun can fail after increasing the number of compute nodes in the cluster
- (3.0.0-3.7.2) Cluster update rollback can fail when modifying the list of instance types declared in the Compute Resources
- (3.6.0‐3.6.1) Slurm NodeHostName and NodeAddr mismatch for MultiNIC instance when managed DNS is disabled and EC2 Hostnames are used
- (3.6.0) NVIDIA GPU nodes fail to start with custom AMI built from DLAMI
- (3.0.0-3.6.0) Ptrace_scope not disabled for Ubuntu compute nodes
- (3.0.0-3.6.0) Compute Nodes Belonging To More Than One Partition Causes Compute Scaling To Overscale
- (3.2.0-3.5.1) GPU nodes not coming back online after
scontrol reboot
- (3.0.0-3.5.1) ParallelCluster CLI raises exception “module 'flask.json' has no attribute 'JSONEncoder'”
- (3.3.0-3.5.1) Cluster updates can break Slurm accounting functionality
- (3.3.0-3.5.0) Update cluster to remove shared EBS volumes can potentially cause node launching failures
- (3.0.0-3.5.0) DCV virtual session on Ubuntu 20.04 might show a black screen
- (3.3.0-3.4.1) Custom AMI creation fails on Ubuntu 20.04 during MySQL packages installation
- (3.3.0-3.4.0) Slurm cluster NodeName and NodeAddr mismatch after cluster scaling
- (3.0.0-3.2.1) Running nodes might be mistakenly replaced when new jobs are scheduled
- (3.0.0-3.2.1) ParallelCluster API cannot create new cluster
- (3.1.x) Termination of idle dynamic compute nodes potentially broken after performing a cluster update
- (3.0.0-3.1.4) ParallelCluster API Stack Upgrade Fails for ECR resources
- (3.0.0-3.1.4) Unable to perform cluster update when using API or documented user policies
- (3.0.0-3.1.3) Unable to create cluster or custom image when using API or CLI with documented user policies
- (3.0.0-3.1.3) AWSBatch Multi node Parallel jobs fail if no EBS defined in cluster
- (3.1.1-3.1.2) Profiles not loaded when connected through NICE DCV session
- (3.0.0-3.1.3) build image creates invalid images when using aws-cdk.aws-imagebuilder==1.153
- (3.0.0 and later) build image stack deletion failed after image successfully created
- (3.1.1) Issue with clusters in isolated networks
- (3.0.0) Cluster scaling fails after a head node reboot on Ubuntu 18.04 and Ubuntu 20.04
- (3.0.0) Deleting API Infrastructure produces CFN Stacks failure
- (2.2.1 3.3.0) Risk of deletion of managed FSx for Lustre file system when updating a cluster
- (3.0.2 / 2.11.3 and earlier) Possible performance degradation due to log4j cve 2021 44228 hotpatch service on Amazon Linux 2.
- (2.10.1-2.11.2 and 3.0.0) Custom AMI creation (
pcluster createami
orpcluster build-image
) fails with ARM architecture
- (3.0.2 / 2.11.3 and earlier) Custom AMI creation fails for centos7 and ubuntu1804 Issue started on 12/8/2021, resolved on 1/20/2022
- (2.8.0-2.10.1) Configuration validation failure: architecture of AMI and instance type does not match
- (2.10.0) Issue with CentOS 8 Custom AMI creation
- (2.5.0-2.10.0) Issue with Ubuntu 18.04 Custom AMI creation
- (2.10.1-2.10.2) Issue running Ubuntu 18 ARM AMI on first generation AWS Graviton instances
- (2.10.1-2.10.2) P4d support on Amazon Linux 1
- (2.6.0-2.10.3) Custom AMI creation (
pcluster createami
) fails - (2.9.1 and earlier) Custom AMI creation (
pcluster createami
) fails - (2.10.0 and earlier) Cluster creation fails if
enable_intel_hpc_platform=true
is in the configuration file - (2.10.4 and earlier) Batch cluster creation fails in China regions
- (2.11.0) Possible performance degradation on Amazon Linux 2 when enabling CloudWatch Logging
- (2.10.0-2.11.1) NVIDIA Fabric Manager stops running on Ubuntu 18.04 and Ubuntu 20.04
- (2.11.2 and earlier) Custom AMI creation (pcluster createami) fails when building SGE
- (2.11.4) DCV Connection Through Web Browsers Does Not Work
- (2.10.0-2.11.4) Tags in number interpreted as integer instead of string possible cause value error in Compute resource launch template
- (2.11.7 and earlier) Cluster creation fails with awsbatch scheduler