Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1472: storage capacity tracking: GA #3229

Merged
merged 3 commits into from
Mar 2, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-storage/1472.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 1472
beta:
approver: "@wojtek-t"
stable:
approver: "@wojtek-t"
49 changes: 27 additions & 22 deletions keps/sig-storage/1472-storage-capacity-tracking/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,10 +77,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [X] (R) Graduation criteria is in place
- [X] (R) Production readiness review completed
- [ ] Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
- [X] Production readiness review approved
- [X] "Implementation History" section is up-to-date for milestone
- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [X] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

<!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
Expand Down Expand Up @@ -806,7 +806,7 @@ checks for events that describe the problem.
- 5 installs
- More rigorous forms of testing e.g., downgrade tests and scalability tests
- Allowing time for feedback
- Integration with [Cluster Autoscaler](https://github.com/kubernetes/autoscaler)
- Design for support in [Cluster Autoscaler](https://github.com/kubernetes/autoscaler)
Copy link
Contributor Author

@pohly pohly Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a slightly relaxed criteria: kubernetes/autoscaler#3887 shows that the current in-tree API is sufficient to enable autoscaling, the PR just hasn't been merged yet because SIG Autoscaling wanted more time to investigate how this can be made simpler for users.

The recommendation from the SIG Autoscaling meeting on 2022-02-21 was to not wait for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to this PR in the KEP to show this is in progress?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this section with a summary of the design?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. It basically works exactly as alluded in that section: labeling of generated nodes must be modified to distinguish them from real ones and then manually created CSIStorageCapacity objects provide the information about those future nodes.


### Upgrade / Downgrade Strategy

Expand Down Expand Up @@ -842,15 +842,14 @@ enhancement:
### Feature enablement and rollback

* **How can this feature be enabled / disabled in a live cluster?**
- [X] Feature gate
- Feature gate name: CSIStorageCapacity
- Components depending on the feature gate:
- apiserver
- [X] CSIDriver.StorageCapacity field can be modified
wojtek-t marked this conversation as resolved.
Show resolved Hide resolved
- Components depending on the field:
- kube-scheduler

* **Does enabling the feature change any default behavior?**

Enabling it only in kube-scheduler and api-server and not any of the
Enabling it only in kube-scheduler and api-server by updating
to a Kubernetes version where it is enabled and not in any of the
running CSI drivers causes no changes. Everything continues as
before because no `CSIStorageCapacity` objects are created and
kube-scheduler does not wait for any.
Expand All @@ -861,12 +860,19 @@ enhancement:

* **Can the feature be disabled once it has been enabled (i.e. can we rollback
the enablement)?**
Yes.

In Kubernetes 1.19 and 1.20, registration of the
`CSIStorageCapacity` type was controlled by the feature gate. In
1.21, the type will always be enabled in the v1beta1 API
group. Depending on the combination of Kubernetes release and
Yes, by disabling it in the CSI driver deployment:
`CSIDriver.StorageCapacity=false` causes kube-scheduler to ignore storage
capacity for the driver. In addition, external-provisioner can be deployed so
that it does not publish capacity information (`--enable-capacity=false`).

Downgrading to a previous Kubernetes release may also disable the feature or
allow disabling it via a feature gate: In Kubernetes 1.19 and 1.20,
registration of the `CSIStorageCapacity` type was controlled by the feature
gate. In 1.21, the type will always be enabled in the v1beta1 API group. In
1.24, the type is always enabled in the v1 API unconditionally.

Depending on the combination of Kubernetes release and
wojtek-t marked this conversation as resolved.
Show resolved Hide resolved
feature gate, the type will be disabled. However, any existing
objects will still remain in the etcd database, they just won't be
visible.
Expand Down Expand Up @@ -934,7 +940,7 @@ consumption, increased latency), specifically

* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**

Not yet, but will be done manually before transition to beta.
This was done manually before transition to beta.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any findings? Can you describe the environment in which it was run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No surprises. I used a kubeadm-based cluster in VMs. I've extended the text.


* **Is the rollout accompanied by any deprecations and/or removals of features,
APIs, fields of API types, flags, etc.?**
Expand All @@ -951,18 +957,16 @@ scheduling workloads onto nodes, but not while those run.
That a CSI driver provides storage capacity information can seen in the
following metric data that will be provided by external-provisioner instances:
- total number of `CSIStorageCapacity` objects that the external-provisioner
is currently meant to manage for the driver
is currently meant to manage for the driver: `csistoragecapacities_desired_goal`
- number of such objects that currently exist and can be kept because
they have a topology/storage class pair that is still valid
they have a topology/storage class pair that is still valid: `csistoragecapacities_desired_current`
- number of such objects that currently exist and need to be deleted
because they have an outdated topology/storage class pair
- work queue length for creating, updating or deleting objects
because they have an outdated topology/storage class pair: `csistoragecapacities_obsolete`
- work queue length for creating, updating or deleting objects: `csistoragecapacity` work queue

The CSI driver name will be used as label. When using distributed
provisioning, the node name will be used as additional label.

TODO: mention the exact metrics names once they are implemented.

* **What are the SLIs (Service Level Indicators) an operator can use to
determine the health of the service?**

Expand Down Expand Up @@ -1100,6 +1104,7 @@ to `CSIStorageCapacity` objects.
- Kubernetes 1.19: alpha
- Kubernetes 1.21: beta
- Kubernetes 1.23: `CSIDriver.Spec.StorageCapacity` became mutable.
- Kubernetes 1.24: GA

## Drawbacks

Expand Down
16 changes: 9 additions & 7 deletions keps/sig-storage/1472-storage-capacity-tracking/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,21 +17,23 @@ approvers:
- "@msau42"
prr-approvers:
- "@wojtek-t"
stage: beta
stage: stable
see-also:
- "https://docs.google.com/document/d/1WtX2lRJjZ03RBdzQIZY3IOvmoYiF5JxDX35-SsCIAfg"
latest-milestone: "v1.21"
latest-milestone: "v1.24"
milestone:
alpha: "v1.19"
beta: "v1.21"
stable: "v1.23"
stable: "v1.24"
feature-gates:
- name: CSIStorageCapacity
components:
- kube-apiserver
- kube-scheduler
disable-supported: true
disable-supported: false

# The following PRR answers are required at beta release
#metrics:
# - my_feature_metric
metrics:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

- csistoragecapacities_desired_goal
- csistoragecapacities_desired_current
- csistoragecapacities_obsolete
- csistoragecapacity work queue