job-exec: need ability to control exceptions for jobs on disconnected brokers #3965

grondo · 2021-11-16T15:40:43Z

As mentioned in #3906, we need a method to control how and when job exceptions are raised for jobs running on brokers that go offline in the system instance.

In a normal instance, the execution system will monitor for brokers going offline and will immediately raise a job exception for any jobs with job shells on offline ranks. However, this will not work for the system instance, where we want to support brokers restarting without losing running jobs (#3801), so we'll need a way to control this behavior.

Of course, we'll still want to kill jobs on crashed nodes in a timely manner, so we'll need a strategy to determine the difference there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-exec: need ability to control exceptions for jobs on disconnected brokers #3965

job-exec: need ability to control exceptions for jobs on disconnected brokers #3965

grondo commented Nov 16, 2021

job-exec: need ability to control exceptions for jobs on disconnected brokers #3965

job-exec: need ability to control exceptions for jobs on disconnected brokers #3965

Comments

grondo commented Nov 16, 2021