Use foreground deletion when deleting child jobs #392

ahg-g · 2024-01-31T18:58:48Z

What would you like to be added:

Use Foreground deletion when deleting child jobs at

jobset/pkg/controllers/jobset_controller.go

Line 517 in 42fd647

backgroundPolicy := metav1.DeletePropagationBackground

Why is this needed:
The current background policy allows deleting the job object without waiting for its pods to be cleaned up first. This means JobSet will directly recreate a replacement job in the case of recreate failure policy, and so pods from the failed and new job may exist at the same time. In most cases, the new pods will likely target the same nodes of the failed job, and so the scheduling of those pods will block.

The above behavior is especially problematic when exclusive placement policy is used. The major issue is that the followers pods will be rejected until the leader (index 0) is scheduled first, and the job controller will handle those repeated follower pods creation failures by placing their recreation retries in exponential backoff (starting from 1 sec to the max of 1min).

By doing foreground deletion, we block the creation of the new job, and so the follower pods, until the pod they replace are cleaned up, and so the leader of the new job will likely schedule faster, and so avoiding too many creation failures of the follower pods that may lead to long backoffs.

ahg-g · 2024-01-31T18:58:58Z

/assign @danielvegamyhre

k8s-ci-robot assigned danielvegamyhre Jan 31, 2024

danielvegamyhre mentioned this issue Feb 1, 2024

Migrate from background to foreground cascading deletion policy #393

Merged

k8s-ci-robot closed this as completed in #393 Feb 2, 2024

danielvegamyhre mentioned this issue Feb 9, 2024

☂️ Requirements for v0.4.0 release #350

Closed

7 tasks

danielvegamyhre mentioned this issue Mar 1, 2024

[TPU Provisioner] Bump node pool deletion check interval GoogleCloudPlatform/ai-on-gke#269

Merged

nstogner mentioned this issue Aug 29, 2024

ForegroundDeletion of Jobs is not always enforced before recreation #665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use foreground deletion when deleting child jobs #392

Use foreground deletion when deleting child jobs #392

ahg-g commented Jan 31, 2024

ahg-g commented Jan 31, 2024

Use foreground deletion when deleting child jobs #392

Use foreground deletion when deleting child jobs #392

Comments

ahg-g commented Jan 31, 2024

ahg-g commented Jan 31, 2024