Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParallelCluster 3.10.1 fails to setup accounting for slurm cluster (port 6819 unreachable) #6398

Closed
ElDeveloper opened this issue Aug 16, 2024 · 2 comments
Labels

Comments

@ElDeveloper
Copy link

Required Info:

  • AWS ParallelCluster version [e.g. 3.1.1]: 3.10.1
  • Full cluster configuration without any credentials or personal data.
Region: us-east-1
Image:
  Os: rhel8
HeadNode:
  InstanceType: t2.large
  Networking:
    SubnetId: subnet-xxxxxx
  Ssh:
    KeyName: personal-login
  Iam:
    S3Access:
      - BucketName: xxxxxx
        EnableWriteAccess: true
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonS3FullAccess
      - Policy: arn:aws:iam::aws:policy/SecretsManagerReadWrite

Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: mainq
      ComputeResources:
        - Name: c52xlarge
          Instances:
            - InstanceType: c5.2xlarge
          MinCount: 0
          MaxCount: 32
        - Name: c5xlarge
          Instances:
            - InstanceType: c5.xlarge
          MinCount: 0
          MaxCount: 32
        - Name: r5a4xlarge
          Instances:
            - InstanceType: r5a.4xlarge
          MinCount: 0
          MaxCount: 2
      Networking:
        SubnetIds:
          - subnet-xxxx
      Iam:
        S3Access:
          - BucketName: xxxxxxxxx
            EnableWriteAccess: true

SharedStorage:
  - MountDir: "/scratch"
    Name: scratch
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
      DeploymentType: SCRATCH_1
  • Cluster name: clstr-a39
  • Output of pcluster describe-cluster command.
{
  "creationTime": "2024-08-15T23:54:47.078Z",
  "headNode": {
    "launchTime": "2024-08-16T00:04:13.000Z",
    "instanceId": "i-xxxxx",
    "publicIpAddress": "xxxxx",
    "instanceType": "t2.large",
    "state": "running",
    "privateIpAddress": "xxxxx"
  },
  "version": "3.10.1",
  "clusterConfiguration": {
    "url": "xxxxxxx"
  },
  "tags": [
    {
      "value": "3.10.1",
      "key": "parallelcluster:version"
    },
    {
      "value": "clstr-a39",
      "key": "parallelcluster:cluster-name"
    }
  ],
  "cloudFormationStackStatus": "CREATE_COMPLETE",
  "clusterName": "clstr-a39",
  "computeFleetStatus": "STOPPING",
  "cloudformationStackArn": "xxxxxxx",
  "lastUpdatedTime": "2024-08-15T23:54:47.078Z",
  "region": "us-east-1",
  "clusterStatus": "CREATE_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}

Bug description and how to reproduce:
The configuration file listed above lets me successfully create a cluster, however when I add:

  SlurmSettings:
    Database:
      Uri: xxxxxx.rds.amazonaws.com:3306
      UserName: admin
      PasswordSecretArn: arn:aws:secretsmanager:xxxxxxxxx

The commands below fail to setup the accounting:

pcluster update-compute-fleet --cluster-name clstr-a39--status STOP_REQUESTED
pcluster update-cluster -n clstr-a39 -c auto.yaml

From reviewing the logs, the errors that show up are:

sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:ip-10-0-0-25:6819: Connection refused
sacctmgr: error: Sending PersistInit msg: Connection refused

Please be sure to attach the following logs:
cfn-init.log
chef-client.log
completed.log

@ElDeveloper
Copy link
Author

The problem was the password contained a # character and slurm was failing to use the correct string. So instead of using pass#word it was only using pass and failing to connect to the database. FWIW, Secrets Manager or RDS autogenerated that password.

@VsniperKS
Copy link

I encountered the same situation. Will this bug be fixed in future versions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants