Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurm_job_batch_script can't return string #250

Closed
Autumn-Roy opened this issue Dec 20, 2022 · 12 comments · Fixed by #258
Closed

slurm_job_batch_script can't return string #250

Autumn-Roy opened this issue Dec 20, 2022 · 12 comments · Fixed by #258

Comments

@Autumn-Roy
Copy link

Details

  • Slurm Version:
  • 21.08.4
  • Python Version:
  • Python 3.6.8
  • Cython Version:
  • Cython (0.29.30)
  • PySlurm Branch:
  • pyslurm (21.8.4.0)
  • Linux Distribution:
  • Linux version 3.10.0-1062.el7.x86_64

Issue

When I try to get the batch_script by job().slurm_job_batch_script(jobid), it only returns 0 or -1, and print the scripts one the screen. But in the doc, a str will return here.
image
So, any problem on my usage? Or, can I get the script by other way?
image

@tazend
Copy link
Member

tazend commented Dec 22, 2022

Hi @Autumn-Roy

in the current implementation, indeed a slurm function is used that simply outputs the script to stdout and just returns the error-code of that function (0 or -1). Of course, the script should actually just be returned as a string - which isn't possible right now.

Whenever I find the time, I can go ahead and implement it though.

@tazend
Copy link
Member

tazend commented Jan 19, 2023

Hi @Autumn-Roy

I implemented the functionality to get the script as a string.
Could you test it out on your side with this branch?
(The branch is for slurm 21.08 - in case your slurm version upgraded to 22.05 in the meantime, you can take this branch instead)

@Autumn-Roy
Copy link
Author

Hi, @tazend
Thanks for upgrading the pyslurm. I will test the branch on my side. When I finish, the result will reply to you. Thank you again.

@Autumn-Roy
Copy link
Author

@tazend
I tried to get the script as a string from new branch. But it occurs an error:
Traceback (most recent call last): File "pyslurm_test.py", line 51, in <module> a = pyslurm.job().slurm_job_batch_script(job_id) File "pyslurm/pyslurm.pyx", line 2353, in pyslurm.pyslurm.job.slurm_job_batch_script ValueError: Error contacting slurmctld while fetching batch script.

@tazend
Copy link
Member

tazend commented Jan 30, 2023

Hi @Autumn-Roy

mh, that is weird - I can unfortunately not replicate this error on my side.
Seems like a communication error with the slurmctld. Do you perhaps have access to the logs of the slurmctld and can check whether it reports any errors?
Does it work if you execute this?:

scontrol write batch_script <jobid> -

I have pushed a change to this branch, so you should now see the exact error message and error code after rebuilding pyslurm with this.

@Autumn-Roy
Copy link
Author

Hi @tazend

I can‘t get the logs of slurmctld. Then I try to use scontrol command, and it works. The detail error code was pasted here:
image

@tazend
Copy link
Member

tazend commented Jan 31, 2023

Hi @Autumn-Roy

I tested it again on another cluster - but wasn't able to replicate the error you get.
I can only suspect that there might be something broken in your pyslurm installation, since the code pyslurm and scontrol use to fetch the batch-script is basically the same. Do other pyslurm functions work? For example, does the following work?

import pyslurm
j = pyslurm.job().find_id(job_id)

If not yet tried, could you create a seperate python virtual environment to install pyslurm? Like this:

python3 -m venv pyslurm-env
source pyslurm-env/bin/activate

cd pyslurm-src
pip install -r test_requirements.yml
scripts/build.sh

... fetch script again ...

When it builds, it shows you the detected slurm version, and most importantly the directory to the slurm shared lib that pyslurm links to (for example /usr/lib64).
In contrast, which lib does the scontrol command point to? You can check by doing: ldd $(which scontrol).
(scontrol links to libslurmfull.so instead of libslurm.so like pyslurm does, so this difference is correct)

If pyslurm finds the lib in /usr/lib64 for example and scontrol shows /usr/lib64/slurm, then everything should be fine.

@Autumn-Roy
Copy link
Author

Hi @tazend

I tried the way you metioned. But the same error occured.
image
image

Temporarily, I use scontrol write batch_script <jobid> - in my script.

@tazend
Copy link
Member

tazend commented Feb 1, 2023

@Autumn-Roy

do other functions like:

import pyslurm
j = pyslurm.job().find_id(job_id)

work though?
Another thing: Is the slurm cluster you are using perhaps part of a federation? (https://slurm.schedmd.com/federation.html)
You should be able to check with scontrol show federation / sacctmgr show federation, and it may show multiple slurmctld addresses if in a federation.
If not in a federation, is there perhaps a multi-cluster setup? You can check with sacctmgr show clusters

@Autumn-Roy
Copy link
Author

@tazend

I check the slurm cluster, it's not in a federation and not a multi-cluster.
image

also, the other functions about pyslurm is okay:
image

Because you guess there maybe something broken in pyslurm installtion, so I repeat the installation and show below:

> git clone -b feature/retrieve-batch-script-21.08 https://github.com/tazend/pyslurm.git
> cd pyslurm/
> pip3 install .

I tried pip3 install -r test_requirements.txt. But the error Could not find a version that satisfies the requirement pytest==7.1.2 was occurred. So I use pip3 install . instead.

@tazend
Copy link
Member

tazend commented Feb 2, 2023

Hi @Autumn-Roy

I was now able to track down the cause of this error.
The problem is, that this code will only work with slurm versions starting from 21.08.8

The reason for this is that in 21.08.8 a critical CVE was fixed regarding credential abuse by a potential malicious actor.
Therefore, the slurm devs made some heavy structural changes to the code regarding authentication/communication with client and server.
The error stems from the fact that the code in pyslurm for fetching the script uses the updated struct definitions - thus any slurmctld below 21.08.8 rejects communication.

I have created this branch for now which reverts to the struct definitions pre-CVE release and it should work now for you.

For the main 21.08 branch however the original code which only works on releases post-cve will be backported. Ideally, your cluster administrators should update to the latest patch for 21.08 (to also fix the potential security issues)
Or you may need to manually comment out the offending struct members before building 21.08 after the code is merged (you can check here to see the two which I commented out.)

@Autumn-Roy
Copy link
Author

Hi @tazend
Thank you very much! The new branch works now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants