Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update AllocatableResourceGeneration #2984

Merged
merged 1 commit into from
Sep 5, 2024

Conversation

gabesaba
Copy link
Contributor

@gabesaba gabesaba commented Sep 4, 2024

What type of PR is this?

/kind cleanup
/kind bug

What this PR does / why we need it:

We update AllocatableResourceGeneration to be compatible with HierarchicalCohorts (#79). We delete this number from Cohorts, and only store it in the ClusterQueue. After this change, if an update occurs in any part of the tree, we bump the ClusterQueues' numbers when running the resource update.

The previous implementation would be complex with HierarchicalCohorts - we would have to do a root traversal to see if any of the generations increased. Additionally, it was non-monotonic and contained a bug: if a ClusterQueue was deleted, the Cohort's generation could decrease, despite the available resources of the Cohort having had changed.

Does this PR introduce a user-facing change?

Calculate AllocatableResourceGeneration more accurately. This fixes a bug where a workload might not have the Flavors it was assigned in a previous scheduling cycle invalidated, when the resources in the Cohort had changed. This bug could occur when other ClusterQueues were deleted from the Cohort.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 4, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 4, 2024
Copy link

netlify bot commented Sep 4, 2024

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit d138152
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66d893bc3b67960008bd14e0
😎 Deploy Preview https://deploy-preview-2984--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@gabesaba
Copy link
Contributor Author

gabesaba commented Sep 4, 2024

/assign @alculquicondor

@gabesaba
Copy link
Contributor Author

gabesaba commented Sep 4, 2024

/retest

@mimowo
Copy link
Contributor

mimowo commented Sep 5, 2024

Additionally, it was non-monotonic and contained a bug: if a ClusterQueue was deleted, the Cohort's generation could decrease, despite the available resources of the Cohort having had changed.

Good point! Given this fixes a bug - should we add a release note? Also, can we add a test which demonstrates the fix (can be in follow up)?

@@ -154,6 +154,7 @@ func (r ResourceNode) calculateLendable() map[corev1.ResourceName]int64 {
}

func updateClusterQueueResourceNode(cq *clusterQueue) {
cq.AllocatableResourceGeneration += 1
Copy link
Contributor

@mimowo mimowo Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself and potentially other reviewers: I see, so when updating any CQ (see here) we either call updateClusterQueueResourceNode directly, or call it for each child via updateCohortResourceNode. This way any CQ update entails generation update of all CQs in the cohort.

@mimowo
Copy link
Contributor

mimowo commented Sep 5, 2024

LGTM, but I would like to better understand the impact of the change on the end user - please add release note if there is any. Then, the PR might be categorized as bugfix and me may consider cherry-picking.

@gabesaba
Copy link
Contributor Author

gabesaba commented Sep 5, 2024

Additionally, it was non-monotonic and contained a bug: if a ClusterQueue was deleted, the Cohort's generation could decrease, despite the available resources of the Cohort having had changed.

Good point! Given this fixes a bug - should we add a release note? Also, can we add a test which demonstrates the fix (can be in follow up)?

Discussed offline. I will add a release note and classify this as a bug.

I don't think the test is worth the effort. It would have to be an integration test, as we are deleting the old field which we could check doesn't increase in a unit test. Additionally, I think we're well covered by testing the AllocatableResourceGeneration in the cache/snapshot packages.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Sep 5, 2024
@gabesaba
Copy link
Contributor Author

gabesaba commented Sep 5, 2024

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 5, 2024
@mimowo
Copy link
Contributor

mimowo commented Sep 5, 2024

We have also discussed the alternative approach of keeping the Generation in the tree root cohort (or ClusterQueue) only. However, this would be more involving as (1) some CQs are not part of any cohort, (2) would need non-trivial updates in case of HierarchicalCohorts when the root is deleted, or added.

Finally, the implemented approach does not increase computational complexity as it injects the bumps to already existing function invocations.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 3908dbea0bddb06325d2d74bf71e834b5ee3aa15

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gabesaba, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 5, 2024
@k8s-ci-robot k8s-ci-robot merged commit 4c50cbc into kubernetes-sigs:main Sep 5, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Sep 5, 2024
@gabesaba gabesaba deleted the resource_generation branch September 5, 2024 09:17
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants