Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-exec: need ability to control exceptions for jobs on disconnected brokers #3965

Open
grondo opened this issue Nov 16, 2021 · 0 comments
Open

Comments

@grondo
Copy link
Contributor

grondo commented Nov 16, 2021

As mentioned in #3906, we need a method to control how and when job exceptions are raised for jobs running on brokers that go offline in the system instance.

In a normal instance, the execution system will monitor for brokers going offline and will immediately raise a job exception for any jobs with job shells on offline ranks. However, this will not work for the system instance, where we want to support brokers restarting without losing running jobs (#3801), so we'll need a way to control this behavior.

Of course, we'll still want to kill jobs on crashed nodes in a timely manner, so we'll need a strategy to determine the difference there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant