-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: ClusterManager not working on PBS #419
Comments
Thanks! This looks like it might be an issue in ClusterManagers.jl JuliaParallel/ClusterManagers.jl#179 What is your |
pbs_version = 20.0.1 |
Okay this might take a bit longer to solve. It turns out to be really hard to set up a local version of PBS for testing things. But I'm working on it! |
Basically what we need to do is modify these lines to fix ClusterManagers.jl: qsub_cmd = pipeline(`echo $(Base.shell_escape(cmd))` , (isPBS ?
`qsub -N $jobname -wd $wd -j oe -k o -t 1-$np $queue` :
`qsub -N $jobname -wd $wd -terse -j y -R y -t 1-$np -V $queue`)) It sounds like they haven't yet updated this If you are proficient with qsub and know what flags need to be used here, you might be able to make a local modification of ClusterManagers.jl, and then switch to that copy of ClusterManagers.jl with PySR with: cd ClusterManagers.jl
julia [email protected] -e 'using Pkg; pkg"dev ."' This will get the PySR environment for 0.16.3 to use the local copy of ClusterManagers.jl. Then if you are able to update the |
Thank you Miles for investigating this! I think I figured out the new PBS 20 flags and changed it accordingly. So I added these two lines to my submission shell script
but it doesn't look like it is picking up the local package. The julia version I am using is globally installed on the cluster. I can't recall, does the ClusterManagers.jl need to be in a specific folder? Do I need to set some path somewhere? |
Even if the Julia version is globally installed, you should have the environments appear in your local folder If you open the file julia --project=@pysr-0.16.3 -e 'using Pkg; Pkg.develop(path="/path/to/clustermanagers.jl")' and give the full absolute path (to the location of your modified ClusterManagers.jl) there? |
Oh wait, sorry. I just realized you said in the original post that you are using PySR 0.14.1. So either (1) update to PySR 0.16.3 and go through the normal installation with |
okay so that part seems okay now, thanks! |
Hm, yeah the sysadmin might know best for that type of issue. How are you running things? You could also try running a parallel Julia command manually, just to see if it gives a more helpful error message. First, create an interactive job on the cluster that you can ssh into. Ssh into it and start Julia with: import Distributed: pmap
import ClusterManagers: addprocs_pbs
num_workers = 10
# Create the workers:
procs = addprocs_pbs(num_workers)
# Run a computation on each worker:
pmap(worker_id -> worker_id^2, procs) It should return a vector like |
What happened?
When using the cluster manager on pbs the code breaks. It seems to fail to activate the workers due to wrong qsub flags.
Version
0.14.1
Operating System
Linux
Package Manager
pip
Interface
Script (i.e.,
python my_script.py
)Relevant log output
Extra Info
Setting multithreading to False doesn't change anything.
The text was updated successfully, but these errors were encountered: