-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
“An error has occurred when calling silent_system2
:”
#29
Comments
This seems to be an issue with Slurm's configuration. See if you could try the following: library(slurmR)
Slurm_lapply(1:10, function(x) runif(10), njobs = 4) That is the bare minimum. Creating a cluster object may be more complicated. |
I'm also seeing this issue. It appears with the bare minimum example you just posted. When I set
I just set up slurm on my laptop for testing, so it certainly could be a problem with my configuration. But given that it all ran and the answers are right there as expected, it seems like Edit: I'm using |
Thanks, @ekernf01, I'll try to reproduce your error using Docker. I'm not sure what could be causing it. In the case of @thistleknot, I believe this is an issue with the setup of his cluster. I currently don't have access to a cluster that allows using ssh between nodes (which is what |
If it's helpful in setting up the container, I used this guide to set up my slurm. https://blog.llandsmeer.com/tech/2020/03/02/slurm-single-instance.html |
Can't stop thinking of futurama. |
I am experiencing the same issue. I built an Odroid XU4-based cluster (an XU4 front-end and 12 MC1s as the nodes). When I submit: job<-SlurmEvalQ(slurmR::WhoAmI(),njobs=20,plan="submit") It says the job was submitted. Looking at slurmctld.log, I can see the jobs were submitted to the 12 nodes, and the remaining 8 jobs assigned as the first jobs were completed, and subsequently completed. But, when I enter "job" or "res<-Slurm.collect(job), I get: Slurm accounting storage is disabled The same issue occurs with the minimal Slurm_lapply example above. Any suggestions will be greatly appreciated! The system is connected to an NFS server, but I am running R on the front-end, not on the server. |
@edisto69 @ekernf01 @thistleknot I believe you may have found a bug. It could be still that your systems may have an issue or two with the Slurm config (which I will check ASAP to see how to give it the right treatment), yet I would appreciate it if you could install this version instead, re-run your code, and report back whatever you see. To install this version, you can either do use git: git clone https://github.com/USCbiostats/slurmR/tree/issue029
R CMD INSTALL slurmR Or download the zip, unzip it, and then install, e.g., wget https://github.com/USCbiostats/slurmR/archive/refs/heads/issue029.zip
unzip issue029.zip
R CMD INSTALL slurmR-issue029 I appreciate your help! |
Thanks for following up! Now, when I run: library(slurmR) I get the more specific error message: Error: An error has occurred when calling |
I found one reason I was having an issue (by looking at slurmd.log). I have a single user on all nodes and on the front end, but they don't have a shared home folder...I'm trying to figure out how/if I can have the users share a home folder on the NFS share. |
Thank you @edisto69, I just pushed an update. Could you try to install it again? Thanks |
I am probably messing you up by changing things...I'm still working on getting R installed on the NFS server so all the nodes have access, but I have tried it a few times after a new R installation using: install.packages("devtools") And I get the generic error: Error: An error has occurred when calling 'silent_system2': I hope to have things configured by the end of the week, and I'll try it again. |
Sorry for spamming the thread...I am pretty sure that my configuration is good now. I just ran the rslurm::slurm_apply example, and got back the expected results. Running the minimal example that you gave above, I still get: Error: An error has occurred when calling 'silent_system2': But the slurmr-job directory now has no errors in the '02-output-' files, and has '03-answer-' files and 'X_0001.rds' to 'X_0004.rds' (now we have Futurama and the X-files...). |
Hey @edisto69, thanks for trying that. The issue is that you got the bugged version, not the patched one. You can either install the updated version like this: wget https://github.com/USCbiostats/slurmR/archive/refs/heads/issue029.zip
unzip issue029.zip
R CMD INSTALL slurmR-issue029 Or, if you want to use devtools, like this devtools::install_github("USCBiostats/slurmR", ref = "issue029") I'll now try to replicate the issue using docker. |
Well...it is different. I ran: library(slurmR) It now says that it cannot create the slurmr job file in the users home directory (which is an NFS mount) because permission is denied, but I can access the directory from the terminal, and rslurm::slurm_map() has no issues setting up the job directory. For my slurm_map() scripts I have been using /home/user/work as my wd, where 'user' is a link to the NFS mount home directory, and work is a link in that directory to a different NFS mount folder. Setting the same wd for the above script resulted in the same error. |
Thank you very much, @edisto69, I really appreciate all the time you are giving me! I think it would be great if we could talk more at length to see what's going. Would you be willing to have a conference call to talk about this? If so, feel free to email me at [email protected]. Regarding the docker image, @ekernf01, I was able to build one using an existing image with Slurm. It is available at https://hub.docker.com/repository/docker/uscbiostats/slurmr-dev, and the instructions (partial, though) are here. |
Has this problem been resolved? I just started getting this message occasionally (that is not consistently) when submitting the same job:
Does the latest development version slurmR fix this? Follow up: I got the development version of slurmR installed on the HPC, but still getting the same error ... any ideas? |
Unfortunately, I have the same problem as @kgoldfeld `Submitted batch job 892035
Submitting job...Warning: The call to -sacct- failed. This is probably due to not having slurm accounting up and running. For more information, checkout this discussion: https://github.com/USCbiostats/slurmR/issues/29
Error in UseMethod("get_tmp_path") :
no applicable method for 'get_tmp_path' applied to an object of class "c('integer', 'numeric')"
Calls: Slurm_sapply ... wait_slurm.integer -> status -> status.default -> sacct_ -> get_tmp_path
In addition: Warning messages:
1: In normalizePath(file.path(tmp_path, job_name)) :
path[1]="/home/jobst/test/slurmr-job-9c9aa2b50d464": No such file or directory
2: `X` is not a list. The function will coerce it into one using `as.list`
Execution halted` Does there already exist a solution? This would be great!!! |
Hey @jobstdavid and @kgoldfeld (and others!), I just pushed what I think is a fix to the master branch. I'd appreciate you installing the package and giving it a try. |
@gvegayon - I installed the package on our HPC, and did some quick tests. It seems like things are working again - though I will keep you posted in case the errors reappear. Thanks so much for the fix. |
https://stackoverflow.com/questions/65402764/slurmr-trying-to-run-an-example-job-an-error-has-occurred-when-calling-silent
I setup a slurm cluster and I can issue a srun -N4 hostname just fine.
I keep seeing "silent_system2" errors. I've installed slurmR using devtools::install_github("USCbiostats/slurmR")
I'm following the second example 3: https://github.com/USCbiostats/slurmR
here are my files
cat slurmR.R
cat rscript.slurm
cat slurmR.out
The text was updated successfully, but these errors were encountered: