-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suspend a running Job without requeueing #1252
Suspend a running Job without requeueing #1252
Conversation
|
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
Welcome @vicentefb! |
Hi @vicentefb. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
@vicentefb @andrewsykim I think we should write a KEP before moving this forward since this is a slightly big change. cc: @alculquicondor |
@tenzen-y we had some back and forth in #1091 about the API design. I wouldn't consider it a big change since it's only introducing a single (optional) field, but I'm fairly new to the code base and I might be missing something. Is there specific concern or issue you want covered in more details that would warrant a KEP? Happy to write one but I didn't feel this change was that big. Also see discussion in #1091 for more context |
Let me check the discussion on that issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are also missing a change in pkg/controller/jobframework/reconciler.go
so that when we find a job that has:
- Workload.spec.queueingPolicy: Never
- Job.suspend: true
- Workload.status.admission not nil and Workload.status.Condition[Admitted]=true)
Then, set Workload.status.Condition[Evicted]=true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should be able to modify queueingPolicy
via Job like suspend
since I'm wondering if we shouldn't provide features that must manually modify workloads.
Manually modifying workloads looks slightly have risks.
Actually, we used to provide all features so that users can be able to propagate via Job to workloads.
@alculquicondor @kerthcet @mimowo WDYT?
apis/kueue/v1beta1/workload_types.go
Outdated
@@ -59,6 +59,15 @@ type WorkloadSpec struct { | |||
// +kubebuilder:default="" | |||
// +kubebuilder:validation:Enum=kueue.x-k8s.io/workloadpriorityclass;scheduling.k8s.io/priorityclass;"" | |||
PriorityClassSource string `json:"priorityClassSource,omitempty"` | |||
|
|||
// QueueingPolicy that will determine if a job needs to be requeued or not after being suspended |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// QueueingPolicy that will determine if a job needs to be requeued or not after being suspended | |
// ReQueueingPolicy that will determine if a job needs to be requeued or not after being suspended |
Also, QueuingPolicy
sounds like the queuing method can be changed regardless of whether the job is suspended or not.
So, we should mention as an API name that this policy will apply to the job after jobs are admitted at least once.
WDYT? @alculquicondor @kerthcet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we should mention as an API name that this policy will apply to the job after jobs are admitted at least once.
I think in most cases the use-case will be suspending already admitted jobs, but you can have a job that is not yet admitted (due to no available resources) and then have queueing disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, QueuingPolicy sounds like the queuing method can be changed regardless of whether the job is suspended or not.
Yes, this is true as of the current implementation. If you disable for suspended job, it will be suspended forever. If you disable for a running job, it will NOT suspend the job but disable re-queueing if manually suspended
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but you can have a job that is not yet admitted (due to no available resources) and then have queueing disabled
Currently, we can disable queuing by removing the queue name label from the job.
So I think that it will cause confusion for users to support the same feature in multiple ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the QueueingPolicy=Never
might be useful on the workflow use case: #1091 (comment).
However, if we introduce job readiness gates (kubernetes/kubernetes#121681), we have similar features on both the kueue side and the job-controller side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVM
Maybe QueueingPolicy=Never
and Job readiness gates can co-exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, we still need a representation in the Workload API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I imagined the following steps:
- The external controller adds a gate to the job.
- jobframework reconciler creates workload with
QueueingPolicy=Never
. - The external controller deletes a gate to the job.
- jobframework reconciler updates workload with
QueueingPolicy=ALways
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I agree with the QueueingPolicy={Always, Never}
API instead of my suggestion, ReQueuingPolicy={Never}
.
For now, this is just meant to be used by administrators, which will have access to the Workload object. Other than that, it's important that we start working on a CLI #487, which would allow administrators to make these kind of changes safely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please squash / rebase commits, otherwise LGTM :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
I'll leave the LGTM to @andrewsykim
/hold
for review comments
Please squash
10b35fb
to
d3e0d18
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, tenzen-y, vicentefb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
d3e0d18
to
7341b38
Compare
/retest |
Flaky test happened: #1389 |
removes evicted condition status update, removes job_controller unit tests, adds pkg job unit tests updated workload_controller log added StrictFIFO queueing strategy in e2e test while creating a second job when the first one is suspended changed behaviour in reconciler to allow kueue suspend a job directly from the workload signal, updated integration and job test, still missing to update the field name changing implementation to use active and not suspending the job manually updated yaml file addressed comments, removed log lines, used step 6 to handle workload eviction, fixed Active field to be a pointer, fixed e2e tests to use on Consistently method, still missing to make one unit test case work updated unit tests and integration tests with new implementation to evict a workload based on spec.active field added crd charts yaml added workloadspec.go file updated e2e and unit tests, missing to rebase commits added zz generated file addressed comments, moved e2e test to happen inside integration tests nit comments addressed nit comments addressed final nit comments addressed and changed Eviction constant to be WorkloadEvictedByDeactivation deleted by accident a comment, reverted updated workload types comment
7341b38
to
8f3ef1b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM label has been added. Git tree hash: 8dbf7e4995e772630f663b245e3c1011a8a1195c
|
/hold cancel |
/release-note-edit
|
/kind api-change |
removes evicted condition status update, removes job_controller unit tests, adds pkg job unit tests updated workload_controller log added StrictFIFO queueing strategy in e2e test while creating a second job when the first one is suspended changed behaviour in reconciler to allow kueue suspend a job directly from the workload signal, updated integration and job test, still missing to update the field name changing implementation to use active and not suspending the job manually updated yaml file addressed comments, removed log lines, used step 6 to handle workload eviction, fixed Active field to be a pointer, fixed e2e tests to use on Consistently method, still missing to make one unit test case work updated unit tests and integration tests with new implementation to evict a workload based on spec.active field added crd charts yaml added workloadspec.go file updated e2e and unit tests, missing to rebase commits added zz generated file addressed comments, moved e2e test to happen inside integration tests nit comments addressed nit comments addressed final nit comments addressed and changed Eviction constant to be WorkloadEvictedByDeactivation deleted by accident a comment, reverted updated workload types comment
What type of PR is this?
/kind feature
What this PR does / why we need it:
Some systems need the ability to "terminate" arbitrary jobs (because it's running for too long or other policies) without deleting the Job object, so it can be debugged.
Which issue(s) this PR fixes:
Fixes #1091
Special notes for your reviewer:
Does this PR introduce a user-facing change?