Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create new pcluster AMI #6358

Closed
jagga13 opened this issue Jul 18, 2024 · 2 comments
Closed

Unable to create new pcluster AMI #6358

jagga13 opened this issue Jul 18, 2024 · 2 comments
Labels

Comments

@jagga13
Copy link

jagga13 commented Jul 18, 2024

Hello,

I am trying to create a new parallel cluster AMI based on a custom AMI. This process seems to keep failing even though I have granted it full IAM access to S3/KMS/SSM. Here are the details:

AWS ParallelCluster version 3.10.1

Image config:

Region: us-west-2
Build:
  InstanceType: c5.xlarge
  ParentImage: ami-XXXXX
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::{xxxxxxx}:policy/kms-full-access
      - Policy: arn:aws:iam::aws:policy/AmazonS3FullAccess
      - Policy: arn:aws:iam::aws:policy/AmazonSSMFullAccess
  SecurityGroupIds:
    - sg-XXXXX
  SubnetId: subnet-XXXXX
  UpdateOsPackages:
    Enabled: false

I see the following errors in CloudFormation:

The following resource(s) failed to create: [ParallelClusterImage]. 
Resource handler returned message: "Error occurred during operation 'Workflow Execution ID: 'wf-b867ea03-6bf2-4910-a834-a548aa0728d2' failed with reason: Unable to bootstrap TOE'." (RequestToken: 395ac732-4b6b-1535-c8a3-b3c7413fa788, HandlerErrorCode: GeneralServiceException)

I see the following errors in CloudWatch:

Started step ApplyBuildComponents with action ExecuteComponents
Sending command to instance to run
Running command (command id: 3ec02c84-29d5-444d-9cb1-a639c91ac362)
Waiting for command to complete (command id: 3ec02c84-29d5-444d-9cb1-a639c91ac362). Attempt number: 1.
Command failed (command id: 3ec02c84-29d5-444d-9cb1-a639c91ac362, state: Failed)

The ec2 builder instance seems to come up in a healthy state but is terminated after this above failed step and I can't disable rollback on failure either by passing in the option since it might be too early in the build process. Any help would be appreciated!

Thanks!

@jagga13 jagga13 added the 3.x label Jul 18, 2024
@jagga13
Copy link
Author

jagga13 commented Jul 18, 2024

I see the following corresponding error in the ssm logs within the instance that might be a clue:

2024-07-18 01:31:08 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Got reply msg Id 95b23bb7-8d6b-45e2-825c-7ff1aedae581 for RunCommandResult aws.ssm.77068c10-b575-4d48-bd5b-69f726df3fdf.i-0ee71b5fa4a9c568b, starting reply thread
2024-07-18 01:31:08 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Got reply msg Id 4de7ad92-7487-42b1-8849-03c78b9e7c41 for RunCommandResult aws.ssm.77068c10-b575-4d48-bd5b-69f726df3fdf.i-0ee71b5fa4a9c568b, starting reply thread
2024-07-18 01:31:08 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] started reply processing - 4de7ad92-7487-42b1-8849-03c78b9e7c41
2024-07-18 01:31:08 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Sending reply {
  "additionalInfo": {
    "agent": {
      "lang": "en-US",
      "name": "amazon-ssm-agent",
      "os": "",
      "osver": "1",
      "ver": ""
    },
    "dateTime": "2024-07-18T01:31:08.161Z",
    "runId": "",
    "runtimeStatusCounts": {
      "Failed": 1
    }
  },
  "documentStatus": "Failed",
  "documentTraceOutput": "",
  "runtimeStatus": {
    "aws:runShellScript": {
      "status": "Failed",
      "code": 1,
      "name": "aws:runShellScript",
      "output": "Waiting for Cloud-init to initialize ...\nURL 'https://ec2imagebuilder-toe-us-west-2-prod.s3.us-west-2.amazonaws.com/bootstrap_scripts/bootstrap.sh' returned HTTP status '200'\n/var/lib/amazon/ssm/i-0ee71b5fa4a9c568b/document/orchestration/77068c10-b575-4d48-bd5b-69f726df3fdf/awsrunShellScript/0.awsrunShellScript/_script.sh: line 62: /tmp/imagebuilder/TaskOrchestratorAndExecutor/bootstrap.sh: Permission denied\n{\"failureMessage\":\"Unable to bootstrap TOE\"}\n\n----------ERROR-------\nfailed to run commands: exit status 1",
      "startDateTime": "2024-07-18T01:31:07.755Z",
      "endDateTime": "2024-07-18T01:31:08.160Z",
      "outputS3BucketName": "",
      "outputS3KeyPrefix": "",
      "stepName": "",
      "standardOutput": "Waiting for Cloud-init to initialize ...\nURL 'https://ec2imagebuilder-toe-us-west-2-prod.s3.us-west-2.amazonaws.com/bootstrap_scripts/bootstrap.sh' returned HTTP status '200'\n/var/lib/amazon/ssm/i-0ee71b5fa4a9c568b/document/orchestration/77068c10-b575-4d48-bd5b-69f726df3fdf/awsrunShellScript/0.awsrunShellScript/_script.sh: line 62: /tmp/imagebuilder/TaskOrchestratorAndExecutor/bootstrap.sh: Permission denied\n{\"failureMessage\":\"Unable to bootstrap TOE\"}\n",
      "standardError": "failed to run commands: exit status 1"
    }
  }
}
2024-07-18 01:31:08 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] successfully sent reply message id: 4de7ad92-7487-42b1-8849-03c78b9e7c41
2024-07-18 01:31:08 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] started reply processing - 95b23bb7-8d6b-45e2-825c-7ff1aedae581
2024-07-18 01:31:11 INFO [ssm-document-worker] [77068c10-b575-4d48-bd5b-69f726df3fdf] Stop the cloudwatchlogs publisher
2024-07-18 01:31:08 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Sending reply {

@jagga13
Copy link
Author

jagga13 commented Jul 18, 2024

Please disregard. This turned out to be a documented issue with /tmp being mounted with the noexec option. After fixing tmp, I was able to build the AMI successfully.

@jagga13 jagga13 closed this as completed Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant