Suspend a running Job without requeueing #1252

vicentefb · 2023-10-24T21:24:23Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Some systems need the ability to "terminate" arbitrary jobs (because it's running for too long or other policies) without deleting the Job object, so it can be debugged.

Which issue(s) this PR fixes:

Fixes #1091

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Support for a mechanism to suspend a running Job without requeueing

linux-foundation-easycla · 2023-10-24T21:24:26Z

The committers listed above are authorized under a signed CLA.

✅ login: vicentefb / name: Vicente Ferrara (8f3ef1b)

netlify · 2023-10-24T21:24:29Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`8f3ef1b`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65693f9f91ab040008cab7c8

k8s-ci-robot · 2023-10-24T21:24:31Z

Welcome @vicentefb!

It looks like this is your first PR to kubernetes-sigs/kueue 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kueue has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2023-10-24T21:24:32Z

Hi @vicentefb. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

apis/kueue/v1beta1/workload_types.go

pkg/controller/jobframework/reconciler.go

pkg/workload/workload.go

apis/kueue/v1beta1/workload_types.go

tenzen-y · 2023-10-25T11:57:52Z

/ok-to-test

tenzen-y · 2023-10-25T13:49:03Z

@vicentefb @andrewsykim I think we should write a KEP before moving this forward since this is a slightly big change.

cc: @alculquicondor

andrewsykim · 2023-10-25T14:38:07Z

@tenzen-y we had some back and forth in #1091 about the API design. I wouldn't consider it a big change since it's only introducing a single (optional) field, but I'm fairly new to the code base and I might be missing something.

Is there specific concern or issue you want covered in more details that would warrant a KEP? Happy to write one but I didn't feel this change was that big. Also see discussion in #1091 for more context

tenzen-y · 2023-10-25T14:45:40Z

@tenzen-y we had some back and forth in #1091 about the API design. I wouldn't consider it a big change since it's only introducing a single (optional) field, but I'm fairly new to the code base and I might be missing something.

Is there specific concern or issue you want covered in more details that would warrant a KEP? Happy to write one but I didn't feel this change was that big. Also see discussion in #1091 for more context

Let me check the discussion on that issue.

apis/kueue/v1beta1/workload_types.go

alculquicondor

I think we are also missing a change in pkg/controller/jobframework/reconciler.go so that when we find a job that has:

Workload.spec.queueingPolicy: Never
Job.suspend: true
Workload.status.admission not nil and Workload.status.Condition[Admitted]=true)

Then, set Workload.status.Condition[Evicted]=true

tenzen-y

I'm wondering if we should be able to modify queueingPolicy via Job like suspend
since I'm wondering if we shouldn't provide features that must manually modify workloads.

Manually modifying workloads looks slightly have risks.

Actually, we used to provide all features so that users can be able to propagate via Job to workloads.

@alculquicondor @kerthcet @mimowo WDYT?

apis/kueue/v1beta1/workload_types.go

tenzen-y · 2023-10-27T20:04:55Z

apis/kueue/v1beta1/workload_types.go

@@ -59,6 +59,15 @@ type WorkloadSpec struct {
 	// +kubebuilder:default=""
 	// +kubebuilder:validation:Enum=kueue.x-k8s.io/workloadpriorityclass;scheduling.k8s.io/priorityclass;""
 	PriorityClassSource string `json:"priorityClassSource,omitempty"`
+
+	// QueueingPolicy that will determine if a job needs to be requeued or not after being suspended


Suggested change

// QueueingPolicy that will determine if a job needs to be requeued or not after being suspended

// ReQueueingPolicy that will determine if a job needs to be requeued or not after being suspended

Also, QueuingPolicy sounds like the queuing method can be changed regardless of whether the job is suspended or not.

So, we should mention as an API name that this policy will apply to the job after jobs are admitted at least once.

WDYT? @alculquicondor @kerthcet

So, we should mention as an API name that this policy will apply to the job after jobs are admitted at least once.

I think in most cases the use-case will be suspending already admitted jobs, but you can have a job that is not yet admitted (due to no available resources) and then have queueing disabled

Also, QueuingPolicy sounds like the queuing method can be changed regardless of whether the job is suspended or not.

Yes, this is true as of the current implementation. If you disable for suspended job, it will be suspended forever. If you disable for a running job, it will NOT suspend the job but disable re-queueing if manually suspended

but you can have a job that is not yet admitted (due to no available resources) and then have queueing disabled

Currently, we can disable queuing by removing the queue name label from the job.
So I think that it will cause confusion for users to support the same feature in multiple ways.

I found the QueueingPolicy=Never might be useful on the workflow use case: #1091 (comment).

However, if we introduce job readiness gates (kubernetes/kubernetes#121681), we have similar features on both the kueue side and the job-controller side.

NVM

Maybe QueueingPolicy=Never and Job readiness gates can co-exist.

I think so, we still need a representation in the Workload API.

Yes, I imagined the following steps:

The external controller adds a gate to the job.

jobframework reconciler creates workload with QueueingPolicy=Never.

The external controller deletes a gate to the job.

jobframework reconciler updates workload with QueueingPolicy=ALways.

So, I agree with the QueueingPolicy={Always, Never} API instead of my suggestion, ReQueuingPolicy={Never}.

alculquicondor · 2023-10-27T20:43:48Z

Manually modifying workloads looks slightly have risks.

Actually, we used to provide all features so that users can be able to propagate via Job to workloads.

For now, this is just meant to be used by administrators, which will have access to the Workload object.

Other than that, it's important that we start working on a CLI #487, which would allow administrators to make these kind of changes safely.

pkg/controller/core/workload_controller.go

pkg/controller/jobframework/reconciler.go

andrewsykim

Please squash / rebase commits, otherwise LGTM :)

test/integration/controller/jobs/job/job_controller_test.go

alculquicondor

/approve
I'll leave the LGTM to @andrewsykim
/hold
for review comments

Please squash

apis/kueue/v1beta1/workload_types.go

vicentefb · 2023-11-30T22:27:44Z

/retest

tenzen-y

Otherwise LGTM

/approve

pkg/util/testing/wrappers.go

k8s-ci-robot · 2023-11-30T23:36:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, tenzen-y, vicentefb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vicentefb · 2023-11-30T23:50:18Z

/retest

tenzen-y · 2023-12-01T01:09:14Z

Flaky test happened: #1389

apis/kueue/v1beta1/workload_types.go

removes evicted condition status update, removes job_controller unit tests, adds pkg job unit tests updated workload_controller log added StrictFIFO queueing strategy in e2e test while creating a second job when the first one is suspended changed behaviour in reconciler to allow kueue suspend a job directly from the workload signal, updated integration and job test, still missing to update the field name changing implementation to use active and not suspending the job manually updated yaml file addressed comments, removed log lines, used step 6 to handle workload eviction, fixed Active field to be a pointer, fixed e2e tests to use on Consistently method, still missing to make one unit test case work updated unit tests and integration tests with new implementation to evict a workload based on spec.active field added crd charts yaml added workloadspec.go file updated e2e and unit tests, missing to rebase commits added zz generated file addressed comments, moved e2e test to happen inside integration tests nit comments addressed nit comments addressed final nit comments addressed and changed Eviction constant to be WorkloadEvictedByDeactivation deleted by accident a comment, reverted updated workload types comment

andrewsykim

Please update release notes!

Thanks @vicentefb

/lgtm

k8s-ci-robot · 2023-12-01T02:20:27Z

LGTM label has been added.

Git tree hash: 8dbf7e4995e772630f663b245e3c1011a8a1195c

tenzen-y · 2023-12-01T02:23:37Z

/hold cancel

tenzen-y · 2023-12-01T08:17:51Z

/release-note-edit

Support for a mechanism to suspend a running Job without requeueing

tenzen-y · 2024-02-12T20:56:13Z

/kind api-change

removes evicted condition status update, removes job_controller unit tests, adds pkg job unit tests updated workload_controller log added StrictFIFO queueing strategy in e2e test while creating a second job when the first one is suspended changed behaviour in reconciler to allow kueue suspend a job directly from the workload signal, updated integration and job test, still missing to update the field name changing implementation to use active and not suspending the job manually updated yaml file addressed comments, removed log lines, used step 6 to handle workload eviction, fixed Active field to be a pointer, fixed e2e tests to use on Consistently method, still missing to make one unit test case work updated unit tests and integration tests with new implementation to evict a workload based on spec.active field added crd charts yaml added workloadspec.go file updated e2e and unit tests, missing to rebase commits added zz generated file addressed comments, moved e2e test to happen inside integration tests nit comments addressed nit comments addressed final nit comments addressed and changed Eviction constant to be WorkloadEvictedByDeactivation deleted by accident a comment, reverted updated workload types comment

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Oct 24, 2023

k8s-ci-robot requested review from denkensk and trasc October 24, 2023 21:24

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 24, 2023

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 24, 2023

andrewsykim reviewed Oct 24, 2023

View reviewed changes

apis/kueue/v1beta1/workload_types.go Outdated Show resolved Hide resolved

apis/kueue/v1beta1/workload_types.go Outdated Show resolved Hide resolved

pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved

pkg/workload/workload.go Outdated Show resolved Hide resolved

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 24, 2023

andrewsykim reviewed Oct 24, 2023

View reviewed changes

pkg/workload/workload.go Outdated Show resolved Hide resolved

andrewsykim reviewed Oct 24, 2023

View reviewed changes

apis/kueue/v1beta1/workload_types.go Outdated Show resolved Hide resolved

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 25, 2023

andrewsykim mentioned this pull request Oct 25, 2023

A mechanism to suspend a running Job without requeueing #1091

Closed

3 tasks

alculquicondor reviewed Oct 26, 2023

View reviewed changes

apis/kueue/v1beta1/workload_types.go Outdated Show resolved Hide resolved

alculquicondor reviewed Oct 26, 2023

View reviewed changes

tenzen-y reviewed Oct 27, 2023

View reviewed changes

apis/kueue/v1beta1/workload_types.go Outdated Show resolved Hide resolved

apis/kueue/v1beta1/workload_types.go Outdated Show resolved Hide resolved

tenzen-y reviewed Oct 27, 2023

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2023

andrewsykim reviewed Nov 30, 2023

View reviewed changes

pkg/controller/core/workload_controller.go Outdated Show resolved Hide resolved

pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved

pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved

vicentefb requested a review from andrewsykim November 30, 2023 03:04

andrewsykim reviewed Nov 30, 2023

View reviewed changes

test/integration/controller/jobs/job/job_controller_test.go Outdated Show resolved Hide resolved

alculquicondor reviewed Nov 30, 2023

View reviewed changes

apis/kueue/v1beta1/workload_types.go Outdated Show resolved Hide resolved

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 30, 2023

vicentefb force-pushed the SuspendableJobWithoutRequeuing branch from 10b35fb to d3e0d18 Compare November 30, 2023 22:19

vicentefb requested a review from andrewsykim November 30, 2023 22:47

tenzen-y reviewed Nov 30, 2023

View reviewed changes

pkg/util/testing/wrappers.go Show resolved Hide resolved

vicentefb force-pushed the SuspendableJobWithoutRequeuing branch from d3e0d18 to 7341b38 Compare November 30, 2023 23:40

andrewsykim reviewed Dec 1, 2023

View reviewed changes

apis/kueue/v1beta1/workload_types.go Outdated Show resolved Hide resolved

vicentefb force-pushed the SuspendableJobWithoutRequeuing branch from 7341b38 to 8f3ef1b Compare December 1, 2023 02:06

andrewsykim reviewed Dec 1, 2023

View reviewed changes

k8s-ci-robot assigned andrewsykim Dec 1, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 1, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 1, 2023

k8s-ci-robot merged commit 262c366 into kubernetes-sigs:main Dec 1, 2023
14 checks passed

k8s-ci-robot added this to the v0.6 milestone Dec 1, 2023

k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Feb 12, 2024

tenzen-y mentioned this pull request Mar 28, 2024

Move logic to evict on Workload deactivation to the Workload controller #1841

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suspend a running Job without requeueing #1252

Suspend a running Job without requeueing #1252

vicentefb commented Oct 24, 2023 •

edited by k8s-ci-robot

Loading

linux-foundation-easycla bot commented Oct 24, 2023 •

edited

Loading

netlify bot commented Oct 24, 2023 •

edited

Loading

k8s-ci-robot commented Oct 24, 2023

k8s-ci-robot commented Oct 24, 2023

tenzen-y commented Oct 25, 2023

tenzen-y commented Oct 25, 2023

andrewsykim commented Oct 25, 2023

tenzen-y commented Oct 25, 2023

alculquicondor left a comment

tenzen-y left a comment

tenzen-y Oct 27, 2023

andrewsykim Oct 27, 2023 •

edited

Loading

andrewsykim Oct 27, 2023

tenzen-y Oct 31, 2023

tenzen-y Nov 2, 2023

tenzen-y Nov 2, 2023

alculquicondor Nov 2, 2023

tenzen-y Nov 2, 2023

tenzen-y Nov 2, 2023

alculquicondor commented Oct 27, 2023

andrewsykim left a comment

alculquicondor left a comment

vicentefb commented Nov 30, 2023

tenzen-y left a comment

k8s-ci-robot commented Nov 30, 2023

vicentefb commented Nov 30, 2023

tenzen-y commented Dec 1, 2023

andrewsykim left a comment

k8s-ci-robot commented Dec 1, 2023

tenzen-y commented Dec 1, 2023

tenzen-y commented Dec 1, 2023

tenzen-y commented Feb 12, 2024

	// QueueingPolicy that will determine if a job needs to be requeued or not after being suspended
	// ReQueueingPolicy that will determine if a job needs to be requeued or not after being suspended

Suspend a running Job without requeueing #1252

Suspend a running Job without requeueing #1252

Conversation

vicentefb commented Oct 24, 2023 • edited by k8s-ci-robot Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

linux-foundation-easycla bot commented Oct 24, 2023 • edited Loading

netlify bot commented Oct 24, 2023 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

k8s-ci-robot commented Oct 24, 2023

k8s-ci-robot commented Oct 24, 2023

tenzen-y commented Oct 25, 2023

tenzen-y commented Oct 25, 2023

andrewsykim commented Oct 25, 2023

tenzen-y commented Oct 25, 2023

alculquicondor left a comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewsykim Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Oct 27, 2023

andrewsykim left a comment

Choose a reason for hiding this comment

alculquicondor left a comment

Choose a reason for hiding this comment

vicentefb commented Nov 30, 2023

tenzen-y left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 30, 2023

vicentefb commented Nov 30, 2023

tenzen-y commented Dec 1, 2023

andrewsykim left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Dec 1, 2023

tenzen-y commented Dec 1, 2023

tenzen-y commented Dec 1, 2023

tenzen-y commented Feb 12, 2024

vicentefb commented Oct 24, 2023 •

edited by k8s-ci-robot

Loading

linux-foundation-easycla bot commented Oct 24, 2023 •

edited

Loading

netlify bot commented Oct 24, 2023 •

edited

Loading

andrewsykim Oct 27, 2023 •

edited

Loading