Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use foreground deletion when deleting child jobs #392

Closed
Tracked by #350
ahg-g opened this issue Jan 31, 2024 · 1 comment · Fixed by #393
Closed
Tracked by #350

Use foreground deletion when deleting child jobs #392

ahg-g opened this issue Jan 31, 2024 · 1 comment · Fixed by #393
Assignees

Comments

@ahg-g
Copy link
Contributor

ahg-g commented Jan 31, 2024

What would you like to be added:

Use Foreground deletion when deleting child jobs at

backgroundPolicy := metav1.DeletePropagationBackground

Why is this needed:
The current background policy allows deleting the job object without waiting for its pods to be cleaned up first. This means JobSet will directly recreate a replacement job in the case of recreate failure policy, and so pods from the failed and new job may exist at the same time. In most cases, the new pods will likely target the same nodes of the failed job, and so the scheduling of those pods will block.

The above behavior is especially problematic when exclusive placement policy is used. The major issue is that the followers pods will be rejected until the leader (index 0) is scheduled first, and the job controller will handle those repeated follower pods creation failures by placing their recreation retries in exponential backoff (starting from 1 sec to the max of 1min).

By doing foreground deletion, we block the creation of the new job, and so the follower pods, until the pod they replace are cleaned up, and so the leader of the new job will likely schedule faster, and so avoiding too many creation failures of the follower pods that may lead to long backoffs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants