You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Why is this needed:
The current background policy allows deleting the job object without waiting for its pods to be cleaned up first. This means JobSet will directly recreate a replacement job in the case of recreate failure policy, and so pods from the failed and new job may exist at the same time. In most cases, the new pods will likely target the same nodes of the failed job, and so the scheduling of those pods will block.
The above behavior is especially problematic when exclusive placement policy is used. The major issue is that the followers pods will be rejected until the leader (index 0) is scheduled first, and the job controller will handle those repeated follower pods creation failures by placing their recreation retries in exponential backoff (starting from 1 sec to the max of 1min).
By doing foreground deletion, we block the creation of the new job, and so the follower pods, until the pod they replace are cleaned up, and so the leader of the new job will likely schedule faster, and so avoiding too many creation failures of the follower pods that may lead to long backoffs.
The text was updated successfully, but these errors were encountered:
What would you like to be added:
Use Foreground deletion when deleting child jobs at
jobset/pkg/controllers/jobset_controller.go
Line 517 in 42fd647
Why is this needed:
The current background policy allows deleting the job object without waiting for its pods to be cleaned up first. This means JobSet will directly recreate a replacement job in the case of recreate failure policy, and so pods from the failed and new job may exist at the same time. In most cases, the new pods will likely target the same nodes of the failed job, and so the scheduling of those pods will block.
The above behavior is especially problematic when exclusive placement policy is used. The major issue is that the followers pods will be rejected until the leader (index 0) is scheduled first, and the job controller will handle those repeated follower pods creation failures by placing their recreation retries in exponential backoff (starting from 1 sec to the max of 1min).
By doing foreground deletion, we block the creation of the new job, and so the follower pods, until the pod they replace are cleaned up, and so the leader of the new job will likely schedule faster, and so avoiding too many creation failures of the follower pods that may lead to long backoffs.
The text was updated successfully, but these errors were encountered: