You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As mentioned in #3906, we need a method to control how and when job exceptions are raised for jobs running on brokers that go offline in the system instance.
In a normal instance, the execution system will monitor for brokers going offline and will immediately raise a job exception for any jobs with job shells on offline ranks. However, this will not work for the system instance, where we want to support brokers restarting without losing running jobs (#3801), so we'll need a way to control this behavior.
Of course, we'll still want to kill jobs on crashed nodes in a timely manner, so we'll need a strategy to determine the difference there.
The text was updated successfully, but these errors were encountered:
As mentioned in #3906, we need a method to control how and when job exceptions are raised for jobs running on brokers that go offline in the system instance.
In a normal instance, the execution system will monitor for brokers going offline and will immediately raise a job exception for any jobs with job shells on offline ranks. However, this will not work for the system instance, where we want to support brokers restarting without losing running jobs (#3801), so we'll need a way to control this behavior.
Of course, we'll still want to kill jobs on crashed nodes in a timely manner, so we'll need a strategy to determine the difference there.
The text was updated successfully, but these errors were encountered: