Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaitForPodsReady: Reset the requeueState while reconciling #1838

Conversation

tenzen-y
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

As @alculquicondor mentioned #1821 (comment), mutating webhooks seems not to be possible to update spec and status in a single webhook call.

So, we must reset requeueState while reconciling instead of webhooks.

Which issue(s) this PR fixes:

Fixes #1821

Special notes for your reviewer:

Does this PR introduce a user-facing change?

WaitForPodsReady: Fix a bug that the requeueState isn't reset.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 13, 2024
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 13, 2024
Copy link

netlify bot commented Mar 13, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 688f223
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65f3242a7a18de0008579f8d

@tenzen-y
Copy link
Member Author

/cherry-pick release-0.6

@k8s-infra-cherrypick-robot
Copy link
Contributor

@tenzen-y: once the present PR merges, I will cherry-pick it on top of release-0.6 in a new PR and assign it to you.

In response to this:

/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tenzen-y tenzen-y force-pushed the reset-requeueState-while-reconciling branch from ffba87c to 975b5f2 Compare March 13, 2024 23:31
@@ -248,7 +248,7 @@ var _ = ginkgo.Describe("SchedulerWithWaitForPodsReady", func() {
// To avoid flakiness, we don't verify if the workload has a QuotaReserved=false with pending reason here.
})

ginkgo.It("Should re-admit a timed out workload and deactivate a workload exceeded the re-queue count limit", func() {
ginkgo.It("Should re-admit a timed out workload and deactivate a workload exceeded the re-queue count limit. After that re-activating a workload", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/activating/activate

may we consider a separate test for this, assuming that setting up the state isn't too difficult?

Copy link
Member Author

@tenzen-y tenzen-y Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to divide this test into two tests, but we need to start another manager to configure the backoffLimitCount=1 or 0 like this:

requeuingBackoffLimitCount = ptr.To[int32](2)

It could increase integration test time. So, to avoid increasing time, I implement this test like the current one.

}, util.Timeout, util.Interval).Should(gomega.Succeed(), "Reactivate inactive Workload")
gomega.Eventually(func(g gomega.Gomega) {
g.Expect(k8sClient.Get(ctx, client.ObjectKeyFromObject(prodWl), prodWl)).Should(gomega.Succeed())
g.Expect(prodWl.Status.RequeueState).Should(gomega.BeNil())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we verify at the start of this sequence that the Requeue state is in a certain configuration (non-nil, or something more specific), in case the previous part of test changes?

Not relevant if this test is split out and this state is set explicitly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already verify that here:

util.ExpectWorkloadToHaveRequeueCount(ctx, k8sClient, client.ObjectKeyFromObject(prodWl), ptr.To[int32](2))

Does it make sense?

Comment on lines +283 to +291
apimeta.SetStatusCondition(&prodWl.Status.Conditions, metav1.Condition{
Type: kueue.WorkloadEvicted,
Status: metav1.ConditionTrue,
Reason: kueue.WorkloadEvictedByDeactivation,
Message: "evicted by Test",
})
g.Expect(k8sClient.Status().Update(ctx, prodWl)).Should(gomega.Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't the controller add this already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the jobs controller adds this condition here:

workload.SetEvictedCondition(wl, kueue.WorkloadEvictedByDeactivation, "The workload is deactivated")
, we don't register the jobs controller to the manager.

cfg := &config.Configuration{
WaitForPodsReady: &config.WaitForPodsReady{
Enable: true,
BlockAdmission: &blockAdmission,
Timeout: &metav1.Duration{Duration: value},
RequeuingStrategy: &config.RequeuingStrategy{
Timestamp: ptr.To(requeuingTimestamp),
BackoffLimitCount: requeuingBackoffLimitCount,
},
},
}
mgr.GetScheme().Default(cfg)
err := indexer.Setup(ctx, mgr.GetFieldIndexer())
gomega.Expect(err).NotTo(gomega.HaveOccurred())
cCache := cache.New(mgr.GetClient(), cache.WithPodsReadyTracking(cfg.WaitForPodsReady.Enable && cfg.WaitForPodsReady.BlockAdmission != nil && *cfg.WaitForPodsReady.BlockAdmission))
queues := queue.NewManager(
mgr.GetClient(), cCache,
queue.WithPodsReadyRequeuingTimestamp(requeuingTimestamp),
)
failedCtrl, err := core.SetupControllers(mgr, queues, cCache, cfg)
gomega.Expect(err).ToNot(gomega.HaveOccurred(), "controller", failedCtrl)
failedWebhook, err := webhooks.Setup(mgr)
gomega.Expect(err).ToNot(gomega.HaveOccurred(), "webhook", failedWebhook)
err = workloadjob.SetupIndexes(ctx, mgr.GetFieldIndexer())
gomega.Expect(err).NotTo(gomega.HaveOccurred())
sched := scheduler.New(
queues, cCache, mgr.GetClient(), mgr.GetEventRecorderFor(constants.AdmissionName),
scheduler.WithPodsReadyRequeuingTimestamp(requeuingTimestamp),
)
err = sched.Start(ctx)
gomega.Expect(err).NotTo(gomega.HaveOccurred())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just register the indexers:

err = workloadjob.SetupIndexes(ctx, mgr.GetFieldIndexer())
gomega.Expect(err).NotTo(gomega.HaveOccurred())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting...
We should probably move that piece of code to the Workload controller in a follow up. It doesn't belong in the job reconciler, as it is completely independent from the job. Otherwise, if anyone wants to write a custom integration, they would need to implement this.

Copy link
Member Author

@tenzen-y tenzen-y Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, but we need to remember why we put this here.
Let me open an issue for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a TODO to remove this status update and link to #1841?

Copy link
Member Author

@tenzen-y tenzen-y Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thank you for raising it!

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 255b9d366fb6ad738bc6ce44f50e054bf1017507

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [alculquicondor,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alculquicondor
Copy link
Contributor

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 14, 2024
@tenzen-y tenzen-y force-pushed the reset-requeueState-while-reconciling branch from 975b5f2 to 105c016 Compare March 14, 2024 16:15
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2024
@tenzen-y tenzen-y force-pushed the reset-requeueState-while-reconciling branch from 105c016 to 70ac2d3 Compare March 14, 2024 16:16
@alculquicondor
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 1d5ed08e2f7d9626e4d9f6cc1bcae6b51e5bad99

@tenzen-y
Copy link
Member Author

/hold

@tenzen-y tenzen-y force-pushed the reset-requeueState-while-reconciling branch from 70ac2d3 to 62bb01f Compare March 14, 2024 16:17
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2024
@tenzen-y tenzen-y force-pushed the reset-requeueState-while-reconciling branch from 62bb01f to 688f223 Compare March 14, 2024 16:22
@tenzen-y
Copy link
Member Author

@alculquicondor Could you give lgtm again?

@tenzen-y
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 14, 2024
@alculquicondor
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 0d8aaed7c14f0efb389c215977f5b98d47beb0fd

@tenzen-y
Copy link
Member Author

/test pull-kueue-test-integration-main
due to #1829

@k8s-ci-robot k8s-ci-robot merged commit 7ccc556 into kubernetes-sigs:main Mar 14, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.7 milestone Mar 14, 2024
@k8s-infra-cherrypick-robot
Copy link
Contributor

@tenzen-y: #1838 failed to apply on top of branch "release-0.6":

Applying: WaitForPodsReady: Reset the requeueState while reconciling instead of webhook
Using index info to reconstruct a base tree...
M	charts/kueue/templates/webhook/webhook.yaml
M	pkg/controller/core/workload_controller.go
M	pkg/webhooks/workload_webhook.go
M	pkg/webhooks/workload_webhook_test.go
M	test/integration/scheduler/podsready/scheduler_test.go
M	test/integration/webhook/workload_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/integration/webhook/workload_test.go
CONFLICT (content): Merge conflict in test/integration/webhook/workload_test.go
Auto-merging test/integration/scheduler/podsready/scheduler_test.go
Auto-merging pkg/webhooks/workload_webhook_test.go
Auto-merging pkg/webhooks/workload_webhook.go
Auto-merging pkg/controller/core/workload_controller.go
Auto-merging charts/kueue/templates/webhook/webhook.yaml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 WaitForPodsReady: Reset the requeueState while reconciling instead of webhook
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tenzen-y tenzen-y deleted the reset-requeueState-while-reconciling branch March 14, 2024 16:45
@tenzen-y
Copy link
Member Author

I'm preparing the cherry-pick PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RequeueState isn't always reset
5 participants