Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to bootstrap pcluster-3.10.1 on Rocky LInux 9.4 #6371

Closed
rmarable-flaretx opened this issue Jul 29, 2024 · 5 comments
Closed

Unable to bootstrap pcluster-3.10.1 on Rocky LInux 9.4 #6371

rmarable-flaretx opened this issue Jul 29, 2024 · 5 comments
Labels

Comments

@rmarable-flaretx
Copy link

We are unable to bootstrap a custom Rocky LInux 9.4 AMI using ParallelCluster 3.10.1.

Here is the cfn-init log stream:

    {
      "message": "2024-07-29 14:07:13,212 [ERROR] Error encountered during build of chefConfig: Command chef failed",
      "timestamp": "2024-07-29T14:07:13.212Z"
    },
    {
      "message": "Traceback (most recent call last):\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 579, in run_config\n    CloudFormationCarpenter(config, self._auth_config, self.strict_mode).build(worklog)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 277, in build\n    changes['commands'] = CommandTool().apply(\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/command_tool.py\", line 127, in apply\n    raise ToolError(u\"Command %s failed\" % name)",
      "timestamp": "2024-07-29T14:07:13.212Z"
    },
    {
      "message": "cfnbootstrap.construction_errors.ToolError: Command chef failed",
      "timestamp": "2024-07-29T14:07:13.212Z"
    },
    {
      "message": "2024-07-29 14:07:13,296 [ERROR] -----------------------BUILD FAILED!------------------------",
      "timestamp": "2024-07-29T14:07:13.296Z"
    },
    {
      "message": "2024-07-29 14:07:13,296 [ERROR] Unhandled exception during build: Command chef failed",
      "timestamp": "2024-07-29T14:07:13.296Z"
    },
    {
      "message": "Traceback (most recent call last):\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/bin/cfn-init\", line 181, in <module>\n    worklog.build(metadata, configSets, strict_mode)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 137, in build\n    Contractor(metadata, strict_mode).build(configSets, self)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 567, in build\n    self.run_config(config, worklog)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 579, in run_config\n    CloudFormationCarpenter(config, self._auth_config, self.strict_mode).build(worklog)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 277, in build\n    changes['commands'] = CommandTool().apply(\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/command_tool.py\", line 127, in apply\n    raise ToolError(u\"Command %s failed\" % name)",
      "timestamp": "2024-07-29T14:07:13.296Z"
    },
    {
      "message": "cfnbootstrap.construction_errors.ToolError: Command chef failed",
      "timestamp": "2024-07-29T14:07:13.296Z"
    }

From the system-messages log strem:

    {
      "message": "Jul 29 14:07:23 ip-10-2-34-41 cloud-init[1084]: + /opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/bin/cfn-signal --exit-code=1 '--reason=Failed to run chef recipe aws-parallelcluster-slurm::config_munge_key line 27. Please check /var/log/chef-client.log in the head node, or check the chef-client.log in CloudWatch logs. Please refer to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html for more details.' 'https://cloudformation-waitcondition-us-east-2.s3.us-east-2.amazonaws.com/arn%3Aaws%3Acloudformation%3Aus-east-2%3A227394971585%3Astack/darius/3a0f8320-4db1-11ef-a95c-0a041a247431/3a117ef0-4db1-11ef-a95c-0a041a247431/HeadNodeWaitConditionHandle20240729134822?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240729T134828Z&X-Amz-SignedHeaders=host&X-Amz-Expires=86399&X-Amz-Credential=AKIAVRFIPK6PEIG2DZWK%2F20240729%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Signature=a7a1c96d932fa315e993bee2c2909d6ed8bbe74aa1377a0d97b064a6961a15fc' --region us-east-2 --url https://cloudformation.us-east-2.amazonaws.com",
      "timestamp": "2024-07-29T14:07:23.000Z"
    },

From the chef-client log:

    {
      "message": "    \n    ================================================================================\n    Error executing action `restart` on resource 'service[munge]'\n    ================================================================================\n    \n    Mixlib::ShellOut::ShellCommandFailed\n    ------------------------------------\n    Expected process to exit with [0], but received '1'\n    ---- Begin output of [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] ----\n    STDOUT: \n    STDERR: Job for munge.service failed because the control process exited with error code.\n    See \"systemctl status munge.service\" and \"journalctl -xeu munge.service\" for details.\n    ---- End output of [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] ----\n    Ran [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] returned 1\n    \n    Resource Declaration:\n    ---------------------\n    # In /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/resources/munge_key_manager.rb\n    \n     27:   declare_resource(:service, \"munge\") do\n     28:     supports restart: true\n     29:     action :restart\n     30:     retries 5\n     31:     retry_delay 10\n     32:   end unless on_docker?\n     33: end\n     34: \n    \n    Compiled Resource:\n    ------------------\n    # Declared in /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/resources/munge_key_manager.rb:27:in `restart_munge_service'\n    \n    service(\"munge\") do\n      action [:restart]\n      default_guard_interpreter :default\n      declared_type :service\n      cookbook_name \"aws-parallelcluster-slurm\"\n      recipe_name \"config_munge_key\"\n      supports {:restart=>true}\n      retries 5\n      retry_delay 10\n    end\n    \n    System Info:\n    ------------\n    chef_version=18.4.12\n    platform=rocky\n    platform_version=9.4\n    ruby=ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]\n    program_name=/bin/cinc-client\n    executable=/opt/cinc/bin/cinc-client\n    ",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] INFO: Running queued delayed notifications before re-raising exception\n",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "Running handlers:",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] ERROR: Running exception handlers\n  - WriteChefError::WriteHeadNodeChefError",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "Running handlers complete",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] ERROR: Exception handlers complete",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "Infra Phase failed. 64 resources updated in 01 minutes 09 seconds",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/cinc-stacktrace.out",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: ---------------------------------------------------------------------------------------",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: PLEASE PROVIDE THE CONTENTS OF THE stacktrace.out FILE (above) IF YOU FILE A BUG REPORT",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: ---------------------------------------------------------------------------------------",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: service[munge] (aws-parallelcluster-slurm::config_munge_key line 27) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "---- Begin output of [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] ----",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "STDOUT: ",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "STDERR: Job for munge.service failed because the control process exited with error code.",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "See \"systemctl status munge.service\" and \"journalctl -xeu munge.service\" for details.",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "---- End output of [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] ----",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "Ran [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] returned 1",
      "timestamp": "2024-07-29T14:07:17.561Z"
    }

We can't get into the head node so unfortunately we are unable to provide the log files referenced above.

For now, we are dropping back to Rocky Linux 8.

Any guidance you can provide would be appreciated.

@hanwen-pcluste
Copy link
Contributor

Sorry for the late reply,

This error seems to be related to #6378

@rmarable-flaretx
Copy link
Author

The munge key issue referred to in #6378 has been fixed but Rocky LInux 9 clusters are still failing.

Recipe: aws-parallelcluster-slurm::config_munge_key
  * munge_key_manager[manage_munge_key] action setup_munge_key[2024-08-27T14:25:47+00:00] INFO: Processing munge_key_manager[manage_munge_key] action setup_munge_key (aws-parallelcluster-slurm::config_munge_key line 73)
 (up to date)
  * execute[fetch_and_decode_munge_key] action run[2024-08-27T14:25:47+00:00] INFO: Processing execute[fetch_and_decode_munge_key] action run (aws-parallelcluster-slurm::config_munge_key line 66)

    [execute] Fetching munge key from AWS Secrets Manager: arn:aws:secretsmanager:us-east-2:[redacted]:secret:munge-key-blah-blah-blah
              Created symlink /etc/systemd/system/multi-user.target.wants/munge.service → /usr/lib/systemd/system/munge.service.
              Restarting munge service
              Job for munge.service failed because the control process exited with error code.
              See "systemctl status munge.service" and "journalctl -xeu munge.service" for details.
    
    ================================================================================
    Error executing action `run` on resource 'execute[fetch_and_decode_munge_key]'
    ================================================================================
    
    Mixlib::ShellOut::ShellCommandFailed
    ------------------------------------
    Expected process to exit with [0], but received '1'
    ---- Begin output of //opt/parallelcluster/scripts/slurm/update_munge_key.sh -d ----
    STDOUT: Fetching munge key from AWS Secrets Manager: arn:aws:secretsmanager:us-east-2:[redacted]:secret:munge-key-blah-blah-blah
    Restarting munge service
    STDERR: Created symlink /etc/systemd/system/multi-user.target.wants/munge.service → /usr/lib/systemd/system/munge.service.
    Job for munge.service failed because the control process exited with error code.
    See "systemctl status munge.service" and "journalctl -xeu munge.service" for details.
    ---- End output of //opt/parallelcluster/scripts/slurm/update_munge_key.sh -d ----
    Ran //opt/parallelcluster/scripts/slurm/update_munge_key.sh -d returned 1
    
    Resource Declaration:
    ---------------------
    # In /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/resources/munge_key_manager.rb
    
     66:   declare_resource(:execute, 'fetch_and_decode_munge_key') do
     67:     user 'root'
     68:     group 'root'
     69:     command "/#{node['cluster']['scripts_dir']}/slurm/update_munge_key.sh -d"
     70:   end
     71: end
    
    Compiled Resource:
    ------------------
    # Declared in /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/resources/munge_key_manager.rb:66:in `fetch_and_decode_munge_key'
    
    execute("fetch_and_decode_munge_key") do
      action [:run]
      default_guard_interpreter :execute
      command "//opt/parallelcluster/scripts/slurm/update_munge_key.sh -d"
      declared_type :execute
      cookbook_name "aws-parallelcluster-slurm"
      recipe_name "config_munge_key"
      user "root"
      group "root"
    end
    
    System Info:
    ------------
    chef_version=18.4.12
    platform=rocky
    platform_version=9.4
    ruby=ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
    program_name=/bin/cinc-client
    executable=/opt/cinc/bin/cinc-client

More logs:

[2024-08-27T14:25:49+00:00] ERROR: Running exception handlers
  - WriteChefError::WriteHeadNodeChefError

And more:

[2024-08-27T14:25:49+00:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: execute[fetch_and_decode_munge_key] (aws-parallelcluster-slurm::config_munge_key line 66) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'

So to reiterate, this works with Rocky 8 but NOT with Rocky 9.

@JamesDavidson13
Copy link

Hi @rmarable-flaretx,

I was able to resolve the issue on Rocky 9.5 with ParallelCluster 3.11 by adjusting the permissions of the /etc directory.

I created a bash script containing the following command:

sudo chmod 0755 /etc

Then, I updated the pcluster config-file.yml to include this script in both the HeadNode and SlurmQueues sections under CustomActions:

CustomActions: OnNodeStart: Script:

For reference, here is the documentation:
https://github.com/aws/aws-parallelcluster/wiki/(3.9.0%E2%80%90current)-Cluster-creation-fails-on-Rocky-9.4

@rmarable-flaretx
Copy link
Author

rmarable-flaretx commented Oct 24, 2024

hi @JamesDavidson13 - thanks for the feedback!

Changing the permissions on /etc using an OnStart custom action did the trick.

@rmarable-flaretx
Copy link
Author

Applying the suggested fixes outlined on https://github.com/aws/aws-parallelcluster/wiki/(3.9.0%E2%80%90current)-Cluster-creation-fails-on-Rocky-9.4 resolved this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants