-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[multikueue] Remove remote objects synchronously when reachable. #2347
[multikueue] Remove remote objects synchronously when reachable. #2347
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
85ee548
to
8c87ac6
Compare
/assign @mimowo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation lgtm. As for testing, I would like to see a test case for the baseline scenario.
@@ -59,6 +61,7 @@ type wlReconciler struct { | |||
clusters *clustersReconciler | |||
origin string | |||
workerLostTimeout time.Duration | |||
deletedWlCache *utilmaps.SyncMap[string, *kueue.Workload] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this could leak a key-value pair if MultiKueueGC removes the workload in the meanwhile, when the synchronous delete could fail, leaking the entry. Should we also delete the entry when performing MultiKueueGC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, in the reconciler the group will be empty the object deletion skipped and the workload deleted from the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, but I think there could be a race condition when the group contains a worker, but in the meanwhile the workload is removed by GC. In that case the delete request in the new code would probably fail as the object does not exist, and thus it would not delete the key. Can you double check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NotFound errors are ignored during the object deletion, other errors will trigger a new reconcile that will eventually get to remove the cache entry.
(I have extended a bit the unit tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking and the update to the tests, the extra check is useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
@alculquicondor @tenzen-y I think it would be good to include this in 0.7.x line. WDYT?
LGTM label has been added. Git tree hash: 83604b2ac7412b89720fa3ca5896a150db3a7714
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, trasc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ok with me to cherry-pick. |
SGTM |
/cherry-pick release-0.7 |
@trasc: new pull request created: #2360 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/release-note-edit
|
…ernetes-sigs#2347) * [multikueue] Remove remote objects synchronously when reachable. * Extend unit test coverage. * Review remarks
…ernetes-sigs#2347) * [multikueue] Remove remote objects synchronously when reachable. * Extend unit test coverage. * Review remarks
What type of PR is this?
/kind feature
What this PR does / why we need it:
Remove the multikueue remote objects synchronously when reachable based on the filters cache workload content.
Which issue(s) this PR fixes:
Fixes #2320
Special notes for your reviewer:
Does this PR introduce a user-facing change?