Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[workload] WaitForPodsReady: Requeue at the back of the queue after timeout #689

Merged
merged 2 commits into from
Apr 27, 2023

Conversation

trasc
Copy link
Contributor

@trasc trasc commented Apr 7, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

A new workload condition , Evicted, is added, which gets set when:

  1. The workload gets preempted
  2. The workload hits the PodsReady timeout

In case of PodsReady timeout, the condition transition timestamp will be used in scheduler sorting, therefore the workload will be moved at the end of the queue.

Which issue(s) this PR fixes:

Fixes #599

Special notes for your reviewer:

The preemption based Eviction will be used in solving #510.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 7, 2023
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 7, 2023
@netlify
Copy link

netlify bot commented Apr 7, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 468cc41
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/644a71841bc8490007698835

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 7, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @trasc. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 7, 2023
@trasc trasc force-pushed the evicted_to_end_of_queue branch 4 times, most recently from 2285aad to 6464e21 Compare April 11, 2023 06:46
@trasc
Copy link
Contributor Author

trasc commented Apr 11, 2023

/unhold
/cc @alculquicondor
/cc @mwielgus

@trasc trasc marked this pull request as ready for review April 11, 2023 06:50
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 11, 2023
@k8s-ci-robot k8s-ci-robot requested a review from ahg-g April 11, 2023 06:50
@kerthcet
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 11, 2023
@alculquicondor alculquicondor added this to the v0.4 milestone Apr 11, 2023
Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mimowo

pkg/controller/core/workload_controller.go Outdated Show resolved Hide resolved
pkg/controller/core/workload_controller.go Outdated Show resolved Hide resolved
pkg/controller/core/workload_controller.go Show resolved Hide resolved
pkg/controller/core/workload_controller.go Outdated Show resolved Hide resolved
pkg/queue/cluster_queue_impl.go Show resolved Hide resolved
pkg/scheduler/preemption/preemption.go Outdated Show resolved Hide resolved
pkg/workload/workload.go Outdated Show resolved Hide resolved
pkg/workload/workload.go Outdated Show resolved Hide resolved
pkg/workload/workload_test.go Show resolved Hide resolved
@trasc trasc force-pushed the evicted_to_end_of_queue branch 3 times, most recently from 68edd48 to 3e4fa75 Compare April 19, 2023 12:24
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 20, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 25, 2023
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 25, 2023

- Move the workloads evicted due to pods ready timeout to the end of the queue. #689

## Production Readiness
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm... this was a mistake... they should all be ###

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 25, 2023
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 26, 2023
@trasc
Copy link
Contributor Author

trasc commented Apr 26, 2023

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 26, 2023
@trasc trasc force-pushed the evicted_to_end_of_queue branch 2 times, most recently from c3d9155 to 749d43c Compare April 26, 2023 05:31
@alculquicondor
Copy link
Contributor

/hold cancel
/lgtm
/label tide/merge-method-squash

I thought we had configuration that prevented merge commits.

@k8s-ci-robot k8s-ci-robot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 26, 2023
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 26, 2023
@@ -127,6 +127,17 @@ func (r *WorkloadReconciler) Reconcile(ctx context.Context, req ctrl.Request) (c
ctx = ctrl.LoggerInto(ctx, log)
log.V(2).Info("Reconciling Workload")

// if a pods ready timeout eviction is ongoing.
if evictionCond := apimeta.FindStatusCondition(wl.Status.Conditions, kueue.WorkloadEvicted); evictionCond != nil && evictionCond.Status == metav1.ConditionTrue &&
evictionCond.Reason == kueue.WorkloadEvictedByPodsReadyTimeout &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I said this before, but I think it shouldn't be conditional on the reason.
But we can standardize with preemption in a follow up, and based on the outcome of #510

Leaving this comment for future reference.

pkg/controller/core/workload_controller.go Show resolved Hide resolved
The new condition is set when a workload is preempted or
it's pod ready timeout expired.

In case of pods ready timeout, the condition's transition
timestamp will be used in ordering the workloads in the
scheduling queues.
@alculquicondor
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit a036cd0 into kubernetes-sigs:main Apr 27, 2023
@trasc trasc deleted the evicted_to_end_of_queue branch April 28, 2023 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WaitForPodsReady: Requeue at the back of the queue after timeout
5 participants