Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scheduler] Support preemption of pods using ReadWriteOncePod PVCs #114051

Merged
merged 1 commit into from
Feb 13, 2023

Conversation

chrishenzie
Copy link
Member

@chrishenzie chrishenzie commented Nov 21, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds support for preemption of pods using ReadWriteOncePod PVCs. This is a required feature for beta graduation.

Which issue(s) this PR fixes:

Fixes #103132

Does this PR introduce a user-facing change?

Adds scheduler preemption support for pods using `ReadWriteOncePod` PVCs

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2485-read-write-once-pod-pv-access-mode

Testing

This feature can be tested the following ways:

# Unit tests.
go test ./pkg/scheduler/framework/... ./pkg/scheduler/internal/...

# Integration tests.
go test -v -run TestUnschedulablePodBecomesSchedulable/scheduled_pod_uses_read-write-once-pod_pvc ./test/integration/scheduler/filters
go test -v ./test/integration/scheduler/preemption/ -run TestReadWriteOncePodPreemption

# And E2E tests.
make WHAT=test/e2e/e2e.test
_output/local/bin/linux/amd64/e2e.test --kubeconfig=${HOME}/.kube/config -ginkgo.focus='Feature:ReadWriteOncePod'

/sig storage
/sig scheduling
/cc @msau42
/cc @jsafrane
/cc @alculquicondor

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Nov 21, 2022
@k8s-ci-robot k8s-ci-robot requested a review from msau42 November 21, 2022 20:38
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Nov 21, 2022
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.26 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.26.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Mon Nov 21 15:27:42 UTC 2022.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 21, 2022
@chrishenzie chrishenzie changed the title Rwop preemption [scheduler] Support preemption of pods using ReadWriteOncePod PVCs Nov 21, 2022
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Nov 21, 2022
@chrishenzie
Copy link
Member Author

Thinking out loud -- building the conflictingPods map in cycle state requires scanning all pods for conflicts which is O(N) where N is # pods.

We may be able to do better than this by repurposing the PVCRefCounts cache to be a mapping of PVC name to pod using the PVC, call it PVCRefs. Then when we calculate the cycle state we can build conflictingPods by searching PVCRefs with the name of the PVCs using ReadWriteOncePod. This solution would be O(M) where M is the number of ReadWriteOncePod volumes used by the PVC, which is most likely < N.

@chrishenzie chrishenzie force-pushed the rwop-preemption branch 3 times, most recently from c431e7b to 9d97867 Compare November 22, 2022 00:25
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review tests yet.

@chrishenzie
Copy link
Member Author

I restructured the PR to go with an approach like #114051 (comment). Please see the commit descriptions for each one for more context.

At a high-level, I've changed the caches to store references to the specific pods using each PVC. In VolumeRestrictions, we query this cache with the pod-to-be-scheduled's ReadWriteOncePod PVCs to build a cache of conflicting pods.

@chrishenzie
Copy link
Member Author

chrishenzie commented Nov 23, 2022

/milestone v1.27

@k8s-ci-robot
Copy link
Contributor

@chrishenzie: You must be a member of the kubernetes/milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Milestone Maintainers Team and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone 1.27

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chrishenzie chrishenzie force-pushed the rwop-preemption branch 2 times, most recently from cf95111 to bac3076 Compare November 23, 2022 20:06
@chrishenzie
Copy link
Member Author

This doesn't appear to be working (preempting lower priority pods) in a multi-node environment, but does for single-node. Looking into this.

Copy link
Member

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is ready to merge now, anything left? @alculquicondor

PVCs using the ReadWriteOncePod access mode can only be referenced by a
single pod. When a pod is scheduled that uses a ReadWriteOncePod PVC,
return "Unschedulable" if the PVC is already in-use in the cluster.

To support preemption, the "VolumeRestrictions" scheduler plugin
computes cycle state during the PreFilter phase. This cycle state
contains the number of references to the ReadWriteOncePod PVCs used by
the pod-to-be-scheduled.

During scheduler simulation (AddPod and RemovePod), we add and remove
reference counts from the cycle state if they use any of these
ReadWriteOncePod PVCs.

In the Filter phase, the scheduler checks if there are any PVC reference
conflicts, and returns "Unschedulable" if there is a conflict.

This is a required feature for the ReadWriteOncePod beta. See for more context:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2485-read-write-once-pod-pv-access-mode#beta
t.Errorf("Unexpected PreFilter status (-want, +got): %s", diff)
}
// If PreFilter fails, then Filter will not run.
if test.preFilterWantStatus.IsSuccess() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @kidddddddddddddddddddddd regarding #114898

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/hold cancel
/lgtm

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 31, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 31, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: f308a56055752323b5ae6ae2fe129430c4da8eaa

@alculquicondor
Copy link
Member

/assign @aojea
for test/

@chrishenzie
Copy link
Member Author

PRR approved, this is ready for merge. See #114494 for e2e test changes and feature gate bump to beta.

@kerthcet
Copy link
Member

kerthcet commented Feb 3, 2023

/lgtm
ping @aojea for tests. We have other PR blocked by this one.

@aojea
Copy link
Member

aojea commented Feb 13, 2023

/lgtm ping @aojea for tests. We have other PR blocked by this one.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, aojea, chrishenzie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2023
@k8s-ci-robot k8s-ci-robot merged commit b8b18ec into kubernetes:master Feb 13, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Feb 13, 2023
@chrishenzie chrishenzie deleted the rwop-preemption branch March 21, 2023 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enforce ReadWriteOncePod PVC access mode during scheduling
5 participants