Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix preemption blocked by earlier pending Workload #1866

Merged

Conversation

alculquicondor
Copy link
Contributor

@alculquicondor alculquicondor commented Mar 18, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

The previous check didn't allow any preemptions if the workload has any common resources with pending workloads that were sorted first. In other words, a preemption only was permitted for the first workload in the list for each cohort.

By definition, preempting workloads don't fit in the cohort. Here I propose that, by cohort, we just allow one preemption per cycle if there wasn't already an admission in the same cohort. This could be enhanced with a more comprehensive approach.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

I propose a follow up: #1867

However, the fix in this PR is a better candidate for backport.

Does this PR introduce a user-facing change?

Fix preemption to reclaim quota that is blocked by an earlier pending Workload from another ClusterQueue in the same cohort.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Mar 18, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 18, 2024
@alculquicondor
Copy link
Contributor Author

/assign @yaroslava-serdiuk

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 18, 2024
Copy link

netlify bot commented Mar 18, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit c2eeb3c
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65f9c0a61402c200089ceebe

@@ -241,6 +245,9 @@ func (s *Scheduler) schedule(ctx context.Context) {
e.inadmissibleMsg += fmt.Sprintf(". Pending the preemption of %d workload(s)", preempted)
e.requeueReason = queue.RequeueReasonPendingPreemption
}
if cq.Cohort != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the cohort is nil? I mean the CQ is standalone. The workload can still preempt another workload in this CQ, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only try to admit one workload per CQ in a cycle (because we just take the heads). So we don't need to prevent any other admissions in this case.

@@ -262,6 +269,9 @@ func (s *Scheduler) schedule(ctx context.Context) {
if err := s.admit(ctx, e, cq.AdmissionChecks); err != nil {
e.inadmissibleMsg = fmt.Sprintf("Failed to admit workload: %v", err)
}
if cq.Cohort != nil {
cycleCohortsSkipPreemption.Insert(cq.Cohort.Name)

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if there is a workload admitted during the cycle, then the preemption calculations are no longer valid.

@@ -716,6 +717,61 @@ func TestSchedule(t *testing.T) {
"eng-alpha/borrower": *utiltesting.MakeAdmission("eng-alpha").Assignment(corev1.ResourceCPU, "on-demand", "60").Obj(),
},
},
"multiple CQs need preemption": {
focus: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a leftover, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Deleted.

Request(corev1.ResourceCPU, "1").
Obj()
gomega.Expect(k8sClient.Create(ctx, preemptorBetaWl)).To(gomega.Succeed())
util.ExpectWorkloadsToBePreempted(ctx, k8sClient, useAllAlphaWl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useAllAlphaWl has higher priority than preemptorBetaWl, why it's preempted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's going over the nominal quota.

util.ExpectWorkloadsToBePreempted(ctx, k8sClient, useAllAlphaWl)
util.FinishEvictionForWorkloads(ctx, k8sClient, useAllAlphaWl)
util.ExpectWorkloadsToBeAdmitted(ctx, k8sClient, preemptorBetaWl)
//util.ExpectPendingWorkloadsMetric(alphaCQ, 2, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, fixed

@yaroslava-serdiuk
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 19, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 8ee36029553220709ce07c2c91e45dbc2f428d9d

Change-Id: I69584dd95c57539e163067a4fb93cdb32fc57461
@alculquicondor
Copy link
Contributor Author

/cherry-pick release-0.6

@k8s-infra-cherrypick-robot
Copy link
Contributor

@alculquicondor: once the present PR merges, I will cherry-pick it on top of release-0.6 in a new PR and assign it to you.

In response to this:

/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot merged commit ba46285 into kubernetes-sigs:main Mar 19, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.7 milestone Mar 19, 2024
@k8s-infra-cherrypick-robot
Copy link
Contributor

@alculquicondor: new pull request created: #1868

In response to this:

/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vsoch pushed a commit to researchapps/kueue that referenced this pull request Apr 18, 2024
)

Change-Id: I69584dd95c57539e163067a4fb93cdb32fc57461
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
)

Change-Id: I69584dd95c57539e163067a4fb93cdb32fc57461
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants