-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controller hangs perpetually if a Slurm worker is killed via OOM #19
Comments
In your situation, I think the code appears to hang because In the general case: to monitor workers, |
We can do a bit better in this case. Without the change - the task is perpetually in the state of non-completion. So when it crashes, With |
Yep, this is definitely what is happening, and I can see a new job submitted every 20 seconds. If I set However, the error file gets truncated whenever a new worker is submitted, so a user has to actively watch the file as I have done because Also, It would be really helpful if there was a user friendly way to watch this log file from the R terminal, or perhaps have a general controller log mechanism that the Slurm controller implements by reading the Slurm error log. As it stands, it's a multi-step process that requires a bit of Linux know-how to pull off, whereas I'm hoping for something that an R-only user can handle. |
Thanks @shikokuchuo. Will Targets benefit from this change automatically, or will it have to first check for this type of error first and then explicitly terminate the worker? In terms of user friendliness, it would be nice to have an option on the |
Very encouraging, thanks for implementing that.
Unfortunately this does not seem feasible, since the
I recommend creating a new log file for each new worker. In SGE, if you set the log to be a directory (with a trailing slash) then different workers have different files. In SLURM, I believe there is a way to do this with text patterns.
This is very dependent on the platform and gets outside of the things that
If there is a path forward, it should automatically carry forward to any
It might be possible to do this at the level of a controller as a whole, if there is a path forward. |
Hmm. This would be of reasonable value to my workplace (where we hope to use Targets + Slurm a decent amount). If I could find time to help, would you be interested in help with such a feature? I think ideally it would be a general |
I have mixed feelings about a built-in log reader, but we can talk more if you find a robust way to read them no matter what the user sets for the log path arguments (or if necessary, a robust way to check if logs can be read). The method would fit best in the SLURM launcher, which is what varies from plug-in to plug-in. The controller that contains it is generic. |
Prework
crew.cluster
package itself and not a user error, known limitation, or issue from another package thatcrew.cluster
depends on.Description
When you submit a task to the Slurm controller which uses greater memory than allocated by Slurm, the Slurm job itself will fail with OOM, but the controller just hangs forever instead of reporting this fatal error.
Reproducible example
The following will run forever:
Expected result
In the above case, I expect the worker to be submitted, and the command to be attempted, as it currently does. However at the point of the worker being killed, I would expect crew (mirai?) to detect this, and report it as a pipeline failure.
Diagnostic information
Session Info:
The text was updated successfully, but these errors were encountered: