-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler
: Allow terminating job if submission script is invalid
#5849
Merged
sphuber
merged 2 commits into
aiidateam:main
from
sphuber:feature/2955/scheduler-submit-output-parsing
Dec 20, 2022
Merged
Scheduler
: Allow terminating job if submission script is invalid
#5849
sphuber
merged 2 commits into
aiidateam:main
from
sphuber:feature/2955/scheduler-submit-output-parsing
Dec 20, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Currently, when a `CalcJob` is launched, the job will be submitted to the scheduler by calling `Scheduler.submit_from_script`. This is either expected to succeed, and return the job id in that case, or to raise an exception if the submission failed. If an exception is raised, the exponential backoff retry mechanism kicks in, which will try again, until the maximum number of retries is reached, at which point the job will be paused. This behavior is useful if the cause of the exception being raised is transient, for example the scheduler being overloaded or the connection to the remote computer failing. But for problems that will always fail, this will just cause the job to be stuck as soon as it gets paused. There should be a way for scheduler plugins to detect these terminal problems, such as an invalid submission script, and communicate that the calculation job should simply be terminated. The signature of the `Scheduler._parse_submit_output` abstract method is updated to allow to return an `ExitCode` instead of the job id. If an exit code is returned, the engine will immediately terminate the calculation job and assign the exit code to the node. This now allows scheduler plugins to parse specific problems known to that particular scheduler that are guaranteed to be unrecoverable and prevent the jobs from unnecessarily going through the experimentall backoff mechanism and eventually getting stuck.
sphuber
force-pushed
the
feature/2955/scheduler-submit-output-parsing
branch
from
December 20, 2022 15:35
c7fdf0f
to
5451809
Compare
Fantastic, let me have a look |
ltalirz
reviewed
Dec 20, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a quick comment/question before I go line-by-line
ltalirz
approved these changes
Dec 20, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sphuber , the rest looks good to me
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, when a
CalcJob
is launched, the job will be submitted tothe scheduler by calling
Scheduler.submit_from_script
. This is eitherexpected to succeed, and return the job id in that case, or to raise an
exception if the submission failed. If an exception is raised, the
exponential backoff retry mechanism kicks in, which will try again,
until the maximum number of retries is reached, at which point the job
will be paused.
This behavior is useful if the cause of the exception being raised is
transient, for example the scheduler being overloaded or the connection
to the remote computer failing. But for problems that will always fail,
this will just cause the job to be stuck as soon as it gets paused.
There should be a way for scheduler plugins to detect these terminal
problems, such as an invalid submission script, and communicate that the
calculation job should simply be terminated.
The signature of the
Scheduler._parse_submit_output
abstract method isupdated to allow to return an
ExitCode
instead of the job id. If anexit code is returned, the engine will immediately terminate the
calculation job and assign the exit code to the node. This now allows
scheduler plugins to parse specific problems known to that particular
scheduler that are guaranteed to be unrecoverable and prevent the jobs
from unnecessarily going through the experimentall backoff mechanism and
eventually getting stuck.