Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ubuntu Slurm Cgroups #5

Closed
griffji opened this issue Oct 20, 2021 · 8 comments
Closed

Ubuntu Slurm Cgroups #5

griffji opened this issue Oct 20, 2021 · 8 comments

Comments

@griffji
Copy link

griffji commented Oct 20, 2021

Hello,

I'm running your playbook on Ubuntu 20.04 LTS within AWS, and I'm getting the following errors. Would you be able to provide some guidance and/or input on how to resolve.

TASK [slurm : Enable and start cgroup services if this is a worker_node] ****************************************************************************************************************
task path: /Users/jimmy.griffin/Desktop/AnsibleDevProject/slurm5/slurm/tasks/slurm-worker.yml:67
failed: [c1] (item=cgred) => {"ansible_loop_var": "item", "changed": false, "item": "cgred", "msg": "Could not find the requested service cgred: host"}
failed: [c2] (item=cgred) => {"ansible_loop_var": "item", "changed": false, "item": "cgred", "msg": "Could not find the requested service cgred: host"}
failed: [c1] (item=cgconfig) => {"ansible_loop_var": "item", "changed": false, "item": "cgconfig", "msg": "Could not find the requested service cgconfig: host"}
failed: [c2] (item=cgconfig) => {"ansible_loop_var": "item", "changed": false, "item": "cgconfig", "msg": "Could not find the requested service cgconfig: host"}
META: noop
META: noop
META: noop
META: noop
META: noop

TASK [slurm : Configure slurm submit hosts] *********************************************************************************************************************************************
task path: /Users/jimmy.griffin/Desktop/AnsibleDevProject/slurm5/slurm/tasks/main.yml:187
included: /Users/jimmy.griffin/Desktop/AnsibleDevProject/slurm5/slurm/tasks/slurm-submit.yml for head

TASK [slurm : Create /etc/slurm in RedHat based systems] ********************************************************************************************************************************
task path: /Users/jimmy.griffin/Desktop/AnsibleDevProject/slurm5/slurm/tasks/slurm-submit.yml:3
skipping: [head] => {"changed": false, "skip_reason": "Conditional result was False"}

TASK [slurm : Make a symlink /etc/slurm >> /etc/slurm-llnl on Debian based systems] *****************************************************************************************************
task path: /Users/jimmy.griffin/Desktop/AnsibleDevProject/slurm5/slurm/tasks/slurm-submit.yml:12
ok: [head] => {"changed": false, "dest": "/etc/slurm", "gid": 0, "group": "root", "mode": "0777", "owner": "root", "size": 15, "src": "/etc/slurm-llnl", "state": "link", "uid": 0}

TASK [slurm : Make a symlink /var/log/slurm >> /var/log/slurm-llnl on Debian based systems] *********************************************************************************************
task path: /Users/jimmy.griffin/Desktop/AnsibleDevProject/slurm5/slurm/tasks/slurm-submit.yml:21
ok: [head] => {"changed": false, "dest": "/var/log/slurm", "gid": 0, "group": "root", "mode": "0777", "owner": "root", "size": 19, "src": "/var/log/slurm-llnl", "state": "link", "uid": 0}

TASK [slurm : Deploy /etc/slurm/slurm.conf] *********************************************************************************************************************************************
task path: /Users/jimmy.griffin/Desktop/AnsibleDevProject/slurm5/slurm/tasks/slurm-submit.yml:30
ok: [head] => {"changed": false, "checksum": "d514bce8f38cb45baf7d0cf61222c56e6965bee5", "dest": "/etc/slurm/slurm.conf", "gid": 64030, "group": "slurm", "mode": "0644", "owner": "slurm", "path": "/etc/slurm/slurm.conf", "size": 4661, "state": "file", "uid": 64030}
META: role_complete for head
META: ran handlers
META: ran handlers

PLAY RECAP ******************************************************************************************************************************************************************************
c1 : ok=17 changed=0 unreachable=0 failed=1 skipped=13 rescued=0 ignored=0
c2 : ok=17 changed=0 unreachable=0 failed=1 skipped=13 rescued=0 ignored=0

@pescobar
Copy link
Member

pescobar commented Oct 22, 2021

Thanks for your feedback.

I have just realised that those cgroups services are not provided in ubuntu but I hadn't noticed it because I don't install them in the CI tests and I do my testing using centos.

I have pushed a new commit to master branch which should fix this problem by skipping this step in Debian/Ubuntu systems. Can you test the latest code in master?

To be honest I don't know how the slurm cgroup limits will work on ubuntu. I have never tested it to verify that the cpu/memory limits defined in slurm apply correctly on ubuntu systems.

@griffji
Copy link
Author

griffji commented Oct 22, 2021

Hello,

Thanks for getting back to me, below is the recent update. Maybe this would help https://github.com/mknoxnv/ubuntu-slurm. / https://blog.llandsmeer.com/tech/2020/03/02/slurm-single-instance.html. Does this work with AlmaLinux and.or Rocky Linux? I was able to get it installed on Amazon linux, but I need to add packages to the actual OS, I can share shortly.

Thanks

TASK [slurm : Make a symlink /etc/slurm >> /etc/slurm-llnl on Debian based systems] *******************************************************************************************************************************
fatal: [head]: FAILED! => {"changed": false, "gid": 0, "group": "root", "mode": "0755", "msg": "src file does not exist, use \"force=yes\" if you really want to create the link: /etc/slurm-llnl", "owner": "root", "path": "/etc/slurm", "size": 4096, "src": "/etc/slurm-llnl", "state": "directory", "uid": 0}

TASK [slurm : Configure slurm master daemon] **********************************************************************************************************************************************************************
skipping: [c1]
skipping: [c2]

TASK [slurm : Configure slurm workers] ****************************************************************************************************************************************************************************
included: /Users/jimmy.griffin/Desktop/AnsibleDevProject/Slurm6/slurm/tasks/slurm-worker.yml for c1, c2

TASK [slurm : Install slurm worker packages] **********************************************************************************************************************************************************************
ok: [c2]
ok: [c1]

TASK [slurm : Create /etc/slurm in RedHat based systems] **********************************************************************************************************************************************************
skipping: [c1]
skipping: [c2]

TASK [slurm : Make a symlink /etc/slurm >> /etc/slurm-llnl on Debian based systems] *******************************************************************************************************************************
fatal: [c2]: FAILED! => {"changed": false, "gid": 0, "group": "root", "mode": "0755", "msg": "src file does not exist, use \"force=yes\" if you really want to create the link: /etc/slurm-llnl", "owner": "root", "path": "/etc/slurm", "size": 4096, "src": "/etc/slurm-llnl", "state": "directory", "uid": 0}
fatal: [c1]: FAILED! => {"changed": false, "gid": 0, "group": "root", "mode": "0755", "msg": "src file does not exist, use \"force=yes\" if you really want to create the link: /etc/slurm-llnl", "owner": "root", "path": "/etc/slurm", "size": 4096, "src": "/etc/slurm-llnl", "state": "directory", "uid": 0}

@griffji
Copy link
Author

griffji commented Oct 22, 2021

FYI... Running on Amazon Linux

[ec2-user@head ~]$ uname -a
Linux head 4.14.248-189.473.amzn2.x86_64 #1 SMP Mon Sep 27 05:52:26 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[ec2-user@head ~]$ sinfo -V
slurm 20.11.8
[ec2-user@head ~]$ sinfo -lNe
Fri Oct 22 13:04:50 2021
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
c1 1 compute* idle 2 1:2:1 3942 0 1 (null) none
c2 1 compute* idle 2 1:2:1 3942 0 1 (null) none
[ec2-user@head ~]$

@pescobar
Copy link
Member

if I understand correctly the role works ok on Amazon linux, right? Amazon linux is based on RHEL (like CentOS) so it should work but I haven't tested it myself

I don't get why you get an error in task Make a symlink /etc/slurm >> /etc/slurm-llnl on Debian based systems . Folder /etc/slurm-llnl is created when installing slurmdbd, slurmctld or slurm-client on Ubuntu and those packages are installed by the role. Can you verify if the role is installing any of those packages before you get the error? To debug it ffurther I would need the complete output of the ansible execution.

The only explanation I could think of is that your apt cache is not updated and the packages are not installed and thus the folder /etc/slurm-llnl is not created

@pescobar
Copy link
Member

can you check if using the latest version in master branch you still get an error in task Make a symlink /etc/slurm >> /etc/slurm-llnl on Debian based systems when deploying on Ubuntu?

I have added a task to make sure that the apt cache is always updated.

@griffji
Copy link
Author

griffji commented Oct 22, 2021

Hello,
Regarding the Amazon Linux, the following packages needed to be installed:
python-devel & mysql-devel. In addition, there's no module for Amazon's "amazon-linux-extras install epel -y". The command module is required to execute that for now. Also, the ec2-user account isn't being added to the slurm account to execute commands either. I'm more than happy to assist with this. I will redeploy a few ubuntu servers and let you know. In addition, Ubuntu 20.04LTS, the slurm version is 19.05 I believe and configless deployment wasn't until slurm version 20+. I'll redeploy and get back to you.

Much appreciated

-JG-

@griffji
Copy link
Author

griffji commented Oct 22, 2021

Hello,

I had pulled down the wrong repo. That worked!

ubuntu@head:$ sinfo -V
slurm-wlm 19.05.5
ubuntu@head:
$ sinfo -lNe
Fri Oct 22 19:24:11 2021
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
c1 1 compute* unknown* 2 1:2:1 3928 0 1 (null) none
c2 1 compute* unknown* 2 1:2:1 3928 0 1 (null) none
ubuntu@head:~$

@pescobar
Copy link
Member

I have published a new version 0.0.8 fixing the Ubuntu problems initially reported in this issue so I am closing it as solved.

https://github.com/scicore-unibas-ch/ansible-role-slurm/releases/tag/0.0.8
https://galaxy.ansible.com/scicore/slurm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants