Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm accounting with Amazon Aurora: slurm_persist_conn_open_without_init: failed to open persistent connection #119

Open
juz4u2me opened this issue May 24, 2024 · 1 comment

Comments

@juz4u2me
Copy link

Hi, I was following the guide on https://aws.amazon.com/blogs/hpc/leveraging-slurm-accounting-in-aws-parallelcluster/ and encountered the following error:

[execute] sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:ip-172-25-7-120:6819: Connection refused
		  sacctmgr: error: Sending PersistInit msg: Connection refused

================================================================================
Error executing action `run` on resource 'execute[wait for slurm database]'
================================================================================

Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '1'
---- Begin output of /opt/slurm/bin/sacctmgr show clusters -Pn ----
STDOUT: 
STDERR: sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:ip-172-25-7-120:6819: Connection refused
sacctmgr: error: Sending PersistInit msg: Connection refused
---- End output of /opt/slurm/bin/sacctmgr show clusters -Pn ----
Ran /opt/slurm/bin/sacctmgr show clusters -Pn returned 1

Resource Declaration:
---------------------
# In /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/recipes/config/config_slurm_accounting.rb

 60: execute "wait for slurm database" do
 61:   command "#{node['cluster']['slurm']['install_dir']}/bin/sacctmgr show clusters -Pn"
 62:   retries node['cluster']['slurmdbd_response_retries']
 63:   retry_delay 10
 64: end unless on_docker?
 65: 
 66: bash "bootstrap slurm database" do
 67:   user 'root'
 68:   group 'root'
 69:   code <<-BOOTSTRAP
 70:     SACCTMGR_CMD=#{node['cluster']['slurm']['install_dir']}/bin/sacctmgr
 71:     CLUSTER_NAME=#{node['cluster']['stack_name']}
 72:     DEF_ACCOUNT=pcdefault
 73:     SLURM_USER=#{node['cluster']['slurm']['user']}
 74:     DEF_USER=#{node['cluster']['cluster_user']}
 75: 
 76:     # Add cluster to database if it is not present yet
 77:     [[ $($SACCTMGR_CMD show clusters -Pn cluster=$CLUSTER_NAME | grep $CLUSTER_NAME) ]] || \
 78:         $SACCTMGR_CMD -iQ add cluster $CLUSTER_NAME
 79: 
 80:     # Add account-cluster association to database if it is not present yet
 81:     [[ $($SACCTMGR_CMD list associations -Pn cluster=$CLUSTER_NAME account=$DEF_ACCOUNT format=account | grep $DEF_ACCOUNT) ]] || \
 82:         $SACCTMGR_CMD -iQ add account $DEF_ACCOUNT Cluster=$CLUSTER_NAME \
 83:             Description="ParallelCluster default account" Organization="none"
 84: 
 85:     # Add user-account associations to database if they are not present yet
 86:     [[ $($SACCTMGR_CMD list associations -Pn cluster=$CLUSTER_NAME account=$DEF_ACCOUNT user=$SLURM_USER format=user | grep $SLURM_USER) ]] || \
 87:         $SACCTMGR_CMD -iQ add user $SLURM_USER Account=$DEF_ACCOUNT AdminLevel=Admin
 88:     [[ $($SACCTMGR_CMD list associations -Pn cluster=$CLUSTER_NAME account=$DEF_ACCOUNT user=$DEF_USER format=user | grep $DEF_USER) ]] || \
 89:         $SACCTMGR_CMD -iQ add user $DEF_USER Account=$DEF_ACCOUNT AdminLevel=Admin
 90: 
 91:     # sacctmgr might throw errors if the DEF_ACCOUNT is not associated to a cluster already defined on the database.
 92:     # This is not important for the scope of this script, so we return 0.
 93:     exit 0
 94:   BOOTSTRAP
 95: end unless on_docker?

Compiled Resource:
------------------
# Declared in /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/recipes/config/config_slurm_accounting.rb:60:in `from_file'

execute("wait for slurm database") do
  action [:run]
  default_guard_interpreter :execute
  command "/opt/slurm/bin/sacctmgr show clusters -Pn"
  declared_type :execute
  cookbook_name "aws-parallelcluster-slurm"
  recipe_name "config_slurm_accounting"
  retries 30
  retry_delay 10
end

System Info:
------------
chef_version=18.2.7
platform=amazon
platform_version=2
ruby=ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
program_name=/bin/cinc-client
executable=/opt/cinc/bin/cinc-client
@mhuguesaws
Copy link
Collaborator

The author can probably help @mwvaughn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants