Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCluster 3.8 CREATE_FAILED using SharedStorage > FsxLustreSettings > FileSystemId #6353

Closed
enlznep opened this issue Jul 16, 2024 · 1 comment
Labels

Comments

@enlznep
Copy link

enlznep commented Jul 16, 2024

During pcluster create, I'm receiving

  • FSx Lustre is using the same VPC with pcluster
  • SecurityGroup for Lustre was added, ports (inbound): 988, 1018-1023
  • Attached the SecurityGroup for Lustre in AdditionalSecurityGroups

Question:

  • I have the base AMI created using pcluster build-image, but I need to install some packages for this AMI, ran yum update or alike, will it be the cause of this?
Running handlers complete
[2024-07-16T13:55:25+09:00] ERROR: Exception handlers complete
Infra Phase failed. 47 resources updated in 01 minutes 16 seconds
[2024-07-16T13:55:25+09:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/cinc-stacktrace.out
[2024-07-16T13:55:25+09:00] FATAL: ---------------------------------------------------------------------------------------
[2024-07-16T13:55:25+09:00] FATAL: PLEASE PROVIDE THE CONTENTS OF THE stacktrace.out FILE (above) IF YOU FILE A BUG REPORT
[2024-07-16T13:55:25+09:00] FATAL: ---------------------------------------------------------------------------------------
[2024-07-16T13:55:25+09:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: lustre[mount fsx] (aws-parallelcluster-environment::fsx line 33) had an error: Mixlib::ShellOut::ShellCommandFailed: mount[/scratch] (aws-parallelcluster-environment::fsx line 33) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '19'
---- Begin output of ["mount", "-t", "lustre", "-o", "defaults,_netdev,flock,user_xattr,noatime,noauto,x-systemd.automount", "fs-****.fsx.ap-northeast-1.amazonaws.com@tcp:/1234567", "/scratch"] ----
STDOUT:
STDERR: mount.lustre: mount fs-****.fsx.ap-northeast-1.amazonaws.com@tcp:/1234567 at /scratch failed: No such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
---- End output of ["mount", "-t", "lustre", "-o", "defaults,_netdev,flock,user_xattr,noatime,noauto,x-systemd.automount", "fs-****.fsx.ap-northeast-1.amazonaws.com@tcp:/1234567", "/scratch"] ----
Ran ["mount", "-t", "lustre", "-o", "defaults,_netdev,flock,user_xattr,noatime,noauto,x-systemd.automount", "fs-****.fsx.ap-northeast-1.amazonaws.com@tcp:/1234567", "/scratch"] returned 19

Required Info:

  • AWS ParallelCluster version [e.g. 3.1.1]: 3.8
  • Full cluster configuration without any credentials or personal data.
Region: ap-northeast-1
Image:
  Os: rocky8
  CustomAmi: ami-customAMI
Tags:
  - Key: Name
    Value: pcluster-101
SharedStorage:
  - MountDir: /scratch
    Name: myfsx
    StorageType: FsxLustre
    FsxLustreSettings:
      FileSystemId: fs-****
HeadNode:
  InstanceType: t2.small
  Networking:
    SubnetId: subnet-*****
    AdditionalSecurityGroups:
    - sg-****
    - sg-FOR-LUSTRE-[988,1018-1023]
  Ssh:
    KeyName: SSHKEY-001
  LocalStorage:
    RootVolume:
      Size: 100
  CustomActions:
    OnNodeStart:
      Script: s3://parallelcluster-202407/empty-file.sh
      Args:
        - test1
        - test2
        - test3
    OnNodeConfigured:
      Script: s3://parallelcluster-202407/empty-file.sh
      Args:
        - test1
        - test2
        - test3
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
  Image:
    CustomAmi: ami-customAMI
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    ScaledownIdletime: 5
  SlurmQueues:
    - Name: queue_XYZ
      ComputeResources:
      - Name: sapphirerapids
        Instances:
        - InstanceType: t2.micro
        # - InstanceType: r7i.8xlarge
        DisableSimultaneousMultithreading: true
        MinCount: 0
        MaxCount: 10
      CustomActions:
        OnNodeStart:
          Script: s3://parallelcluster-202407/empty-file.sh
          Args:
            - test1
            - test2
            - test3
        OnNodeConfigured:
          Script: s3://parallelcluster-202407/empty-file.sh
          Args:
            - test1
            - test2
            - test3
      Iam:
        AdditionalIamPolicies:
        - Policy: arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
      Networking:
        SubnetIds:
        - subnet-*****
        AssignPublicIp: false
        AdditionalSecurityGroups:
        - sg-****
        - sg-FOR-LUSTRE-988-1018-1023]
        PlacementGroup:
          Enabled: false
      Image:
        CustomAmi: ami-customAMI
  • Cluster name: xd-parallelcluster-20240702
  • Output of pcluster describe-cluster command.
{
  "creationTime": "2024-07-16T04:46:59.816Z",
  "headNode": {
    "launchTime": "2024-07-16T04:51:15.000Z",
    "instanceId": "i-*****",
    "instanceType": "t2.small",
    "state": "running",
    "privateIpAddress": "172.21.3.117"
  },
  "version": "3.8.0",
  "clusterConfiguration": {
    "url": "https://parallelcluster-038ab13aec538019-v1-do-not-delete.s3.ap-northeast-1.amazonaws.com/parallelcluster/3.8.0/clusters/xd-parallelcluster-20240702-tba5icboyki1lt4j/configs/cluster-config.yaml?versionId=gYmKO7j7DhpvURhkjXeGUkme8SD6EU85&AWSAccessKeyId=ASIASTACKNIABG3XUTBJ&Signature=1ZG%2Bb3MVwG79GrucEGAWuhxfq%2B4%3D&x-amz-security-token=FwoGZXIvYXdzEBcaDAJlWwUB2B3RqZaEuyK%2FAY9fC2Sd8intCz%2FLtpZJsgNwC43BvrC4JRFE%2FSUSF2S2NYsF7Fnc1AaSXplSHqMRGrBolev0zCz7FRCkTqi9k1Yl%2FJkLp7JiSUawC88BcHhgvZkI2ZU1x2cSZ%2B%2BIzxT7%2FBcBcSMEhbpmJhLjStSQE6f9pPc34c4silR%2F4Sx%2B6e4OBZp86Ve7%2FF9cfYWHbgAwppVGC%2B6VKUYY0CKDfxRb%2F5JbbqXEiArkNgDiZ10%2FFDGbhz%2BbnctZ51jEhLVM%2BgDiKKiB2LQGMi0YajErSdR4BkPHiLFe7Db8dEM6jV8wpm0D9Ol1h9ZnhU0AFrtQYmp4eaQX8Rc%3D&Expires=1721110202"
  },
  "tags": [
    {
      "value": "3.8.0",
      "key": "parallelcluster:version"
    },
    {
      "value": "xd-parallelcluster-20240702",
      "key": "parallelcluster:cluster-name"
    },
    {
      "value": "xd-parallelcluster-20240702",
      "key": "Name"
    }
  ],
  "cloudFormationStackStatus": "CREATE_FAILED",
  "clusterName": "xd-parallelcluster-20240702",
  "computeFleetStatus": "UNKNOWN",
  "cloudformationStackArn": "arn:aws:cloudformation:ap-northeast-1:*******:stack/xd-parallelcluster-20240702/6fe267c0-432e-11ef-9ea3-0e6af7e9d173",
  "lastUpdatedTime": "2024-07-16T04:46:59.816Z",
  "region": "ap-northeast-1",
  "clusterStatus": "CREATE_FAILED",
  "scheduler": {
    "type": "slurm"
  },
  "failures": [
    {
      "failureCode": "FsxMountFailure",
      "failureReason": "Failed to mount FSX."
    }
  ]
}
  • [Optional] Arn of the cluster CloudFormation main stack:
@enlznep enlznep added the 3.x label Jul 16, 2024
@enlznep enlznep changed the title PCluster CREATE_FAILED using SharedStorage > FsxLustreSettings > FileSystemId PCluster 3.8 CREATE_FAILED using SharedStorage > FsxLustreSettings > FileSystemId Jul 16, 2024
@enlznep
Copy link
Author

enlznep commented Jul 17, 2024

Resolved by recreating the AMI and avoiding to update the kernel version

@enlznep enlznep closed this as completed Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant