Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional garbage collection of finished workloads #2686

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

mwysokin
Copy link
Contributor

@mwysokin mwysokin commented Jul 24, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds an optional and configurable garbage collection of finished Workloads. If the object retention policy is configured for Workloads a completed workload will be either re-queued for later deletion (if the policy is configured but its retention period hasn't elapsed) or deleted (both for existing Workloads that might've been created by the previous version or instance of kueue but now are expired according to the retention policy or for newly created Workloads when their retention period elapses).

It decreases both kueue's memory footprint as well as etcd's storage.

Which issue(s) this PR fixes:

Fixes #1618

Special notes for your reviewer:

This PR comes with a couple of configuration scenarios:

  • if the object retention policy is not configured for Workloads (default) or the policy is explicitly set to null all Workloads are kept indefinitely like in the previous versions of kueue,
  • if the object retention policy is configured to 0s it means that there's effectively no retention and all completed Workloads are going to be immediately delated,
  • if the object retention policy is set to any other value than 0s kueue will delete Workloads when their retention period elapses (both for existing and newly created Workloads).
Known issue

The Workload deletion triggers a Workload delete event to be emitted. It's needed because in response to that event kueue performs its internal cleanup (e.g. caches and queues). However it seems that emitting that event automatically causes a Job related to the Workload to be suspended. In this scenario it's confusing and unnecessary because the job is already completed. This is how it chronologically looks:

{'level': 'Level(-2)', 'caller': 'core/workload_controller.go:154', 'msg': 'Reconciling Workload'}
{'level': 'Level(-2)', 'caller': 'core/workload_controller.go:169', 'msg': 'Deleting workload because it has finished and the retention period has elapsed'}
{'level': 'debug, 'caller': 'recorder/recorder.go:104', 'msg': 'Deleted workload job-stress-job-1-pqsxf-dd783'}
{'level': 'Level(-2)', 'caller': 'core/workload_controller.go:627', 'msg': 'Workload delete event'}
{'level': 'Level(-2)', 'caller': 'core/clusterqueue_controller.go:330', 'msg': 'Got generic event'}
{'level': 'Level(-2)', 'caller': 'jobframework/reconciler.go:313', 'msg': 'Reconciling Job'}
{'level': 'Level(-2)', 'caller': 'jobframework/reconciler.go:606', 'msg': 'job with no matching workload, suspending'}

The author of this PR commits to creating a fix/improvement for the showcased scenario. It's up to the reviewers to decide whether the fix should be either: part of this PR or its own subsequent issue/PR and whether this PR should be blocked or not until the fix is done.

Does this PR introduce a user-facing change?

Introduce ObjectRetentionPolicies.FinishedWorkloadRetention flag, which allows to configure a retention policy for Workloads, Workloads will be deleted when their retention period elapses, by default this new behavior is disabled

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 24, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mwysokin
Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from mimowo and trasc July 24, 2024 12:12
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 24, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @mwysokin. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 24, 2024
Copy link

netlify bot commented Jul 24, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 0b9933f
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66b4c01c7ad9d90008a181a2

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm. I'm not certain about is the finalizer - it might be worth adding an integration test for it.

I think the issue with suspending a Job when deleting a workload is pre-existing, and can be handled separately from this feature (feel free to open a separate one).

apis/config/v1beta1/configuration_types.go Show resolved Hide resolved
pkg/controller/core/workload_controller.go Outdated Show resolved Hide resolved
pkg/controller/core/workload_controller.go Outdated Show resolved Hide resolved
pkg/controller/core/workload_controller_test.go Outdated Show resolved Hide resolved
pkg/controller/core/workload_controller.go Show resolved Hide resolved
Obj(),
wantError: nil,
},
"should delete the workload because the retention period has elapsed, object retention configured": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test case for a workload with finalizer? I would like to also consider an integration test for it.

Copy link
Contributor Author

@mwysokin mwysokin Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing 👍 It'd greatly help me to understand the case if you could respond to my comment above first about cleaning up Workloads with a finalizer 🙏

@trasc
Copy link
Contributor

trasc commented Jul 24, 2024

I think the issue with suspending a Job when deleting a workload is pre-existing, and can be handled separately from this feature (feel free to open a separate one).

In my opinion it's better to be done with this PR, it should just be a matter of moving

log.V(2).Info("job with no matching workload, suspending")
within
if _, _, finished := job.Finished(); !finished {
var msg string
if w == nil {
msg = "Missing Workload; unable to restore pod templates"
} else {
msg = "No matching Workload; restoring pod templates according to existent Workload"
}
if err := r.stopJob(ctx, job, w, StopReasonNoMatchingWorkload, msg); err != nil {
return nil, fmt.Errorf("stopping job with no matching workload: %w", err)
}
}
.

(I don't think the job is really suspended the logs are lying :) )

@mwysokin
Copy link
Contributor Author

mwysokin commented Jul 25, 2024

Thank you @mimowo @tenzen-y @trasc for your comments 🙏
The plan with this PR is following:

  • create a small KEP for it (KEP 1618: Optional garbage collection of finished Workloads #2742),
  • add logic/a test case for an expired Workload with a finalizer once I understand that case correctly (26d083e),
  • either fix the suspension or remove the confusing message about removing a Job that was completed if it's a small enough change so that it doesn't pollute the PR.

@tenzen-y
Copy link
Member

Thank you @mimowo @tenzen-y @trasc for your comments 🙏 The plan with this PR is following:

  • create a small KEP for it,
  • add logic/a test case for an expired Workload with a finalizer once I understand that case correctly,
  • either fix the suspension or remove the confusing message about removing a Job that was completed if it's a small enough change so that it doesn't pollute the PR.

It would better to open a dedicated PR for the KEP :)

@mimowo
Copy link
Contributor

mimowo commented Jul 25, 2024

Sounds like a great plan.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 7, 2024
@mwysokin mwysokin requested a review from mimowo August 8, 2024 12:55
@mwysokin mwysokin requested a review from tenzen-y August 8, 2024 12:55
@mwysokin
Copy link
Contributor Author

mwysokin commented Aug 8, 2024

It took me a while but I applied all requested updates. Thank you so much for your time spent on the initial review 🙏 May I please ask for a little bit more of your time for a second round? 🙏

@alculquicondor
Copy link
Contributor

/ok-to-test
/cc

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 14, 2024
@mwysokin
Copy link
Contributor Author

I'll take a look at the failing test on Friday. I'm pretty sure I run them last week and they worked fine. Maybe one of the main merges broke something. I'll investigate.

@mwysokin
Copy link
Contributor Author

FYI I just wanted to let everyone know that I didn't abandon the work. I've been on vacation last week and I'll be back in September. Sorry for the delay 🙇

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving the API discussion to #2742

@@ -193,4 +193,7 @@ func SetDefaults_Configuration(cfg *Configuration) {
if fs := cfg.FairSharing; fs != nil && fs.Enable && len(fs.PreemptionStrategies) == 0 {
fs.PreemptionStrategies = []PreemptionStrategy{LessThanOrEqualToFinalShare, LessThanInitialShare}
}
if cfg.ObjectRetentionPolicies == nil {
cfg.ObjectRetentionPolicies = &ObjectRetentionPolicies{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not adding anything relevant, let's not add any code to "defaults"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test

log.Error(err, "Failed to delete workload from the API server")
return ctrl.Result{}, fmt.Errorf("deleting workflow from the API server: %w", err)
}
r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention: %v", workload.Key(&wl))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention: %v", workload.Key(&wl))
r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention")

Because the event is already targeting the workload, we don't need to add it's name again in the message.

log.Error(err, "Failed to delete workload from the API server")
return ctrl.Result{}, fmt.Errorf("deleting workflow from the API server: %w", err)
}
r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention: %v", workload.Key(&wl))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are missing a return for this happy path.

r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention: %v", workload.Key(&wl))
} else {
remainingTime := expirationTime.Sub(now)
log.V(2).Info("Requeueing workload for deletion after retention period", "remainingTime", remainingTime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.V(2).Info("Requeueing workload for deletion after retention period", "remainingTime", remainingTime)
log.V(3).Info("Requeueing workload for deletion after retention period", "remainingTime", remainingTime)

Or even 4.

expirationTime := finishedCond.LastTransitionTime.Add(r.objectRetention.Duration)
if now.After(expirationTime) {
log.V(2).Info("Deleting workload because it has finished and the retention period has elapsed", "retention", r.objectRetention.Duration)
if err := r.client.Delete(ctx, &wl); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably ignore a NotFound error.

Obj(),
wantError: nil,
},
"shouldn't try to delete the workload (no event emitted) because it is already being deleted by kubernetes, object retention configured": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if it has an owner reference?

ginkgo.By("workload should be deleted after the retention period")
gomega.Eventually(func() error {
return k8sClient.Get(ctx, client.ObjectKeyFromObject(wl), &createdWorkload)
}, 2*time.Second, time.Second).ShouldNot(gomega.Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the existing timeout constants.


func managerSetupWithWorkloadRetentionEnabled(ctx context.Context, mgr manager.Manager) {
objectRetention := &config.ObjectRetentionPolicies{FinishedWorkloadRetention: &v1.Duration{
Duration: 3 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit too long, given that this influences how long the test runs for.

@mwysokin
Copy link
Contributor Author

I'm really sorry it's been dragging. I'm coming back to it starting Monday.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 19, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

@mwysokin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-build-image-main 0b9933f link true /test pull-kueue-build-image-main
pull-kueue-test-multikueue-e2e-main 0b9933f link true /test pull-kueue-test-multikueue-e2e-main
pull-kueue-verify-main 0b9933f link true /test pull-kueue-verify-main
pull-kueue-test-unit-main 0b9933f link true /test pull-kueue-test-unit-main
pull-kueue-test-e2e-main-1-30 0b9933f link true /test pull-kueue-test-e2e-main-1-30
pull-kueue-test-integration-main 0b9933f link true /test pull-kueue-test-integration-main
pull-kueue-test-e2e-main-1-31 0b9933f link true /test pull-kueue-test-e2e-main-1-31
pull-kueue-test-scheduling-perf-main 0b9933f link true /test pull-kueue-test-scheduling-perf-main
pull-kueue-test-e2e-main-1-29 0b9933f link true /test pull-kueue-test-e2e-main-1-29
pull-kueue-test-tas-e2e-main 0b9933f link true /test pull-kueue-test-tas-e2e-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optional garbage collection of finished Workloads
6 participants