Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler: Allow terminating job if submission script is invalid #5849

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Dec 20, 2022

Currently, when a CalcJob is launched, the job will be submitted to
the scheduler by calling Scheduler.submit_from_script. This is either
expected to succeed, and return the job id in that case, or to raise an
exception if the submission failed. If an exception is raised, the
exponential backoff retry mechanism kicks in, which will try again,
until the maximum number of retries is reached, at which point the job
will be paused.

This behavior is useful if the cause of the exception being raised is
transient, for example the scheduler being overloaded or the connection
to the remote computer failing. But for problems that will always fail,
this will just cause the job to be stuck as soon as it gets paused.
There should be a way for scheduler plugins to detect these terminal
problems, such as an invalid submission script, and communicate that the
calculation job should simply be terminated.

The signature of the Scheduler._parse_submit_output abstract method is
updated to allow to return an ExitCode instead of the job id. If an
exit code is returned, the engine will immediately terminate the
calculation job and assign the exit code to the node. This now allows
scheduler plugins to parse specific problems known to that particular
scheduler that are guaranteed to be unrecoverable and prevent the jobs
from unnecessarily going through the experimentall backoff mechanism and
eventually getting stuck.

Currently, when a `CalcJob` is launched, the job will be submitted to
the scheduler by calling `Scheduler.submit_from_script`. This is either
expected to succeed, and return the job id in that case, or to raise an
exception if the submission failed. If an exception is raised, the
exponential backoff retry mechanism kicks in, which will try again,
until the maximum number of retries is reached, at which point the job
will be paused.

This behavior is useful if the cause of the exception being raised is
transient, for example the scheduler being overloaded or the connection
to the remote computer failing. But for problems that will always fail,
this will just cause the job to be stuck as soon as it gets paused.
There should be a way for scheduler plugins to detect these terminal
problems, such as an invalid submission script, and communicate that the
calculation job should simply be terminated.

The signature of the `Scheduler._parse_submit_output` abstract method is
updated to allow to return an `ExitCode` instead of the job id. If an
exit code is returned, the engine will immediately terminate the
calculation job and assign the exit code to the node. This now allows
scheduler plugins to parse specific problems known to that particular
scheduler that are guaranteed to be unrecoverable and prevent the jobs
from unnecessarily going through the experimentall backoff mechanism and
eventually getting stuck.
@sphuber sphuber force-pushed the feature/2955/scheduler-submit-output-parsing branch from c7fdf0f to 5451809 Compare December 20, 2022 15:35
@sphuber sphuber requested a review from ltalirz December 20, 2022 15:36
@sphuber
Copy link
Contributor Author

sphuber commented Dec 20, 2022

@ltalirz this is a lead-up to #5850 which will fix a long-standing feature request to detect invalid submission script for SLURM because of an invalid account setting and abort the job.

@ltalirz
Copy link
Member

ltalirz commented Dec 20, 2022

Fantastic, let me have a look

Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a quick comment/question before I go line-by-line

aiida/engine/daemon/execmanager.py Show resolved Hide resolved
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sphuber , the rest looks good to me

@sphuber sphuber merged commit 9309678 into aiidateam:main Dec 20, 2022
@sphuber sphuber deleted the feature/2955/scheduler-submit-output-parsing branch December 20, 2022 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants