- Release Signoff Checklist
- Summary
- Glossary
- Motivation
- Proposal
- Design Details
- Kubernetes Changes, Access Mode
- CSI Specification Changes, Volume Capabilities
- Test Plan
- Graduation Criteria
- Upgrade / Downgrade Strategy
- Version Skew Strategy
- API Server Version N / Scheduler Version N / Kubelet Version N-1 or N-2
- API Server Version N / Scheduler Version N-1 / Kubelet Version N-1 or N-2
- API Understands ReadWriteOncePod, CSI Sidecars Do Not
- CSI Controller Service Understands New CSI Access Modes, CSI Node Service Does Not
- API Server Has Feature Enabled, Scheduler and Kubelet Do Not
- API Server Has Feature Enabled, Scheduler Does not, Kubelet Does
- API Server Has Feature Enabled, Scheduler Does, Kubelet Does Not
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
This KEP introduces a new ReadWriteOncePod access mode for PersistentVolumes that restricts access to a single pod on a single node. This access mode differs from the existing ReadWriteOnce (RWO) access mode, which restricts access to a single node, but allows simultaneous access from many pods on that node.
Additionally, this KEP outlines required changes to the CSI spec, drivers, and sidecars in order to support this new access mode while maintaining backwards compatibility.
- Node
- A virtual or physical machine in a Kubernetes cluster that runs pods
- PersistentVolume
- A piece of storage in the cluster that has been provisioned by an
administrator or dynamically provisioned using
StorageClasses
- A piece of storage in the cluster that has been provisioned by an
administrator or dynamically provisioned using
- Access mode
- A description of how a PersistentVolume can be accessed
- ReadWriteOnce (RWO)
- An access mode that restricts PersistentVolume access to a single node
- ReadWriteOncePod (RWOP)
- A new access mode that restricts PersistentVolume access to a single pod on a single node
- CSI
- The Container Storage Interface, a specification for storage provider plugins to integrate with cluster orchestrators (like Kubernetes)
Kubernetes does not have an access mode for PersistentVolumes that allows users to restrict access to a single pod on a single node. This can cause problems for certain workloads. For example, if you had a workload (using ReadWriteOnce) performing an update of a storage device and the workload scaled to more than one Pod, you could encounter issues if the second pod landed on the same node and started simultaneously modifying the device.
For sensitive workloads, users have to work around the lack of a single-workload access mode in other ways (for example, scheduling only a single pod on a node and using ReadWriteOnce), which can lead to inefficient use of resources in their cluster.
See #30085 and #26567 for issues related to this.
In the CSI spec there are conflicting definitions of the SINGLE_NODE_WRITER
access mode. By definition, SINGLE_NODE_WRITER
means "Can only be published
once as read/write on a single node, at any given time." The problem is how this
access mode is used during NodePublishVolume
, which is typically where volume
mounting is performed.
The CSI spec defines that when NodePublishVolume
is called a second time for
a volume with a non-MULTI_NODE
access mode and with a different target path,
the plugin should return FAILED_PRECONDITION
. For CSI plugins that strictly
adhere to the spec, this guarantees that a volume can only be mounted to a
single target path, which means SINGLE_NODE_WRITER
restricts access to a
single pod on a single node. This behavior conflicts with the original
definition. Due to this conflict, we do not have an access mode that represents
multiple writers on the same node.
- Outline expected behavior of the ReadWriteOncePod access mode
- Provide a high level design for ReadWriteOncePod access mode support
- Define API changes needed to support this access mode
- Outline changes needed in CSI spec and sidecars to support the ReadWriteOncePod access mode
- Outline changes needed in CSI spec and sidecars to continue supporting the ReadWriteOnce access mode
See the version skew strategy section below for additional scenarios.
This scenario asserts a ReadWriteOncePod can only be bind mounted into a single pod on a single node.
- User creates a PVC with ReadWriteOncePod access mode
- User creates pod 1 using this PVC, scheduled on node 1
- User creates pod 2 using this PVC, also scheduled on node 1
- User observes pod 2 fails to start because the referenced PVC is in-use by another pod on the same node
Additionally, for attachment:
- User creates pod 3 using this PVC, scheduled on node 2
- User observes pod 3 fails to start because the referenced PVC is attached to another node
This scenario asserts the existing ReadWriteOnce behavior is preserved for old CSI drivers. The exact behavior may differ across CSI drivers since not all drivers conform to the CSI spec, but it should be consistent with how it behaved before.
- User creates a PVC with ReadWriteOnce access mode
- User creates pod 1 using this PVC, scheduled on node 1
- User observes pod running
In Kubernetes, we should add a new ReadWriteOncePod persistent volume access mode to PersistentVolumes and PersistentVolumeClaims. This change will require adding a feature gate to the kube-apiserver, kube-scheduler, and kubelet. Validation logic will need updating to accept this access mode type if the feature gate is enabled.
// can be mounted read/write mode to exactly 1 pod
ReadWriteOncePod PersistentVolumeAccessMode = "ReadWriteOncePod"
This access mode will be enforced in two places:
First is at the time a pod is scheduled. When scheduling a pod, if another pod is found using the same PVC and the PVC uses ReadWriteOncePod, then scheduling will fail and the pod will be considered UnschedulableAndUnresolvable.
In order to determine if a pod using a ReadWriteOncePod PVC can be scheduled, we need to enumerate all pods and check if any are already consuming this PVC. This logic will take place as part of the PreFilter extension point in the volume restrictions plugin.
The node info cache will be extended to map the PVC name to a reference count for the PVC. In the PreFilter extension point, if the pod's PVC is using ReadWriteOncePod, we will query this map for each node checking for references to the scheduled pod's PVC. If one is found the pod will fail scheduling and be marked UnschedulableAndUnresolvable.
Support for pod preemption is enforced in beta.
When a pod (A) with a ReadWriteOncePod PVC is scheduled, if another pod (B) is found using the same PVC and pod (A) has higher priority, the scheduler will return an "Unschedulable" status and attempt to preempt pod (B).
The implementation goes like follows:
In the PreFilter phase of the volume restrictions scheduler plugin, we will build a cache of the ReadWriteOncePod PVCs for the pod-to-be-scheduled and the number of conflicting PVC references (pods already using any of these PVCs). This cache will be saved as part of the scheduler's cycleState and forwarded to the following step. During AddPod and RemovePod, if there is a conflict we will add or subtract from the number of conflicting references. During the Filter phase, if the cache contains a non-zero amount of conflicting references then return "Unschedulable". If the pod has a PVC that cannot be found, return "UnschedulableAndUnresolvable".
As an additional precaution this will also be enforced at the time a volume is mounted for filesystem devices, and at the time a volume is mapped for block devices. During the mount operation, kubelet will check the actual state of the world cache to determine if the volume is already in-use by another pod. If it is, kubelet will fail mounting with an appropriate error message.
In the CSI spec we should add two new access modes that explicitly state the number of writers on a single node.
// Can only be published once as read/write at a single worklad on
// a single node, at any given time.
SINGLE_NODE_SINGLE_WRITER = 6;
// Can be published as read/write at multiple workloads on a
// single node simultaneously.
SINGLE_NODE_MULTI_WRITER = 7;
These access modes are modeled after the existing MULTI_NODE_SINGLE_WRITER
and
MULTI_NODE_MULTI_WRITER
access modes. The reason for making this distinction
is because the SINGLE_NODE_WRITER
volume capability has conflicting
definitions (see the motivation section for context).
In order to preserve backwards compatibility, we must be careful about how to map between Kubernetes access modes and the new CSI access modes. The way we control this is by maintaining different mappings based on the CSI driver's capabilities.
Both the controller and node services should have capability bits that represent that they support the new access modes:
// Indicates the SP supports the SINGLE_NODE_SINGLE_WRITER and/or
// SINGLE_NODE_MULTI_WRITER access modes.
// These access modes are intended to replace the
// SINGLE_NODE_WRITER access mode to clarify the number of writers
// for a volume on a single node. Plugins MUST accept and allow
// use of the SINGLE_NODE_WRITER access mode when either
// SINGLE_NODE_SINGLE_WRITER and/or SINGLE_NODE_MULTI_WRITER are
// supported, in order to permit older COs to continue working.
SINGLE_NODE_MULTI_WRITER = 13;
Although it controls support for two access modes, SINGLE_NODE_MULTI_WRTIER
is chosen as the capability name because it represents the access mode that is
unsupported.
For ReadWriteOncePod, if the CSI driver supports the SINGLE_NODE_MULTI_WRTER
capability, then ReadWriteOncePod will map to SINGLE_NODE_SINGLE_WRITER
. If
it does not, then ReadWriteOncePod will map to SINGLE_NODE_WRITER
. This
mapping is chosen because we can safely rely on Kubernetes to enforce the
access mode outside of the CSI driver. It also has the advantage of enabling
existing CSI drivers to start using ReadWriteOncePod.
For ReadWriteOnce, if the CSI driver supports the SINGLE_NODE_MULTI_WRITER
capability, then ReadWriteOnce will map to SINGLE_NODE_MULTI_WRITER
. If it
does not, then ReadWriteOnce will map to SINGLE_NODE_WRITER
, which is the
existing behavior.
Put more succinctly:
Driver Supports SINGLE_NODE_MULTI_WRITER Capability |
Driver Does Not Support SINGLE_NODE_MULTI_WRITER Capability |
|
---|---|---|
ReadWriteOncePod | SINGLE_NODE_SINGLE_WRITER | SINGLE_NODE_WRITER |
ReadWriteOnce | SINGLE_NODE_MULTI_WRITER | SINGLE_NODE_WRITER (Existing behavior) |
CSI clients that will need updating are kubelet, external-provisioner, external-attacher, and external-resizer.
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
None. New tests will be added for the transition to beta to support scheduler changes.
In alpha, the following unit tests were updated. See kubernetes/kubernetes#102028 and kubernetes/kubernetes#103082 for more context.
k8s.io/kubernetes/pkg/apis/core/helper
:09-22-2022
-26.2
k8s.io/kubernetes/pkg/apis/core/v1/helper
:09-22-2022
-56.9
k8s.io/kubernetes/pkg/apis/core/validation
:09-22-2022
-82.3
k8s.io/kubernetes/pkg/controller/volume/persistentvolume
:09-22-2022
-79.4
k8s.io/kubernetes/pkg/kubelet/volumemanager/cache
:09-22-2022
-66.3
k8s.io/kubernetes/pkg/volume/csi/csi_client.go
:09-22-2022
-76.2
k8s.io/kubernetes/pkg/scheduler/apis/config/v1beta2
:09-22-2022
-76.8
k8s.io/kubernetes/pkg/scheduler/framework/plugins/volumerestrictions
:09-22-2022
-85
k8s.io/kubernetes/pkg/scheduler/framework
:09-22-2022
-77.1
In beta, there will be additional unit test coverage for
k8s.io/kubernetes/pkg/scheduler/framework/plugins/volumerestrictions
to cover
preemption logic.
Integration tests for scheduler plugin behavior are available here:
For alpha, to test this feature end to end, we will need to check the following cases:
- A ReadWriteOncePod volume will succeed mounting when consumed by a single pod on a node
- A ReadWriteOncePod volume will fail to mount when consumed by a second pod on the same node
- A ReadWriteOncePod volume will fail to attach when consumed by a second pod on a different node
For testing the mapping for ReadWriteOnce, we will update the CSI hostpath driver to support the new volume capability access modes and cut a release. The existing Kubernetes end to end tests will be updated to use this version which will test the K8s to CSI access mode mapping behavior because most storage end to end tests rely on the ReadWriteOnce access mode, which now maps to the SINGLE_NODE_MULTI_WRITER CSI access mode.
For beta, we will want to cover the additional cases for preemption:
- A high-priority pod requesting a ReadWriteOncePod volume that's already in-use will result in the preemption of the pod previously using the volume
- A low-priority (or no priority) pod requesting a ReadWriteOncePod volume that's already in-use will result in it being UnschedulableAndUnresolvable
E2E tests for alpha and beta behavior can be found here:
To test the validation logic of the PersistentVolumeSpec, we need to check the following cases:
- Validation succeeds when feature gate is enabled and PersistentVolume is created with ReadWriteOncePod access mode
- Validation fails when feature gate is disabled and PersistentVolume is created with ReadWriteOncePod access mode
- Validation succeeds when feature gate is enabled and PersistentVolumeClaim is created with ReadWriteOncePod access mode
- Validation fails when feature gate is disabled and PersistentVolumeClaim is created with ReadWriteOncePod access mode
To test mount behavior, we need to check the following cases:
- Mounting a volume with ReadWriteOncePod succeeds if the volume isn't already mounted
- Mounting a volume with ReadWriteOncePod fails if the volume is already mounted
Existing unit tests should cover this scenario.
This test involves asserting the behavior in the above table. The volume capability access mode for ReadWriteOnce will depend on the capabilities of the CSI driver. A test asserting this behavior will be needed in both Kubernetes as well as in CSI sidecars.
- CSI spec supports
SINGLE_NODE_*_WRITER
access modes - Kubernetes supports ReadWriteOncePod access mode, has unit test coverage, has updated CSI spec
- CSI sidecars support
SINGLE_NODE_*_WRITER
access modes and have unit test coverage
- Scheduler enforces ReadWriteOncePod access mode by marking pods as Unschedulable, preemption logic added
- ReadWriteOncePod access mode has end to end test coverage
- Hostpath CSI driver supports
SINGLE_NODE_*_WRITER
access modes, relevant end to end tests updated to use this driver
- Kubernetes API and CSI spec changes are stable
- CSI drivers support
SINGLE_NODE_*_WRITER
access modes
In order to upgrade a cluster to use this feature, the user will need to restart the kube-apiserver, kube-scheduler, and kubelet with the ReadWriteOncePod feature gate enabled. Additionally they will need to update their CSI drivers and sidecars to versions that depend on the new Kubernetes API and CSI spec.
When downgrading a cluster to disable this feature, the user will need to disable the ReadWriteOncePod feature gate in kube-apiserver, kube-scheduler, and kubelet. They may also roll back their CSI sidecars if they are encountering errors.
When disabling this feature gate, any existing volumes with the ReadWriteOncePod access mode will continue to exist, but can only be deleted. An alternative is to allow these volumes to be treated as ReadWriteOnce, however that would violate the intent of the user and so it is not recommended.
If a user downgrades their CSI drivers or sidecars, any existing volumes using
ReadWriteOnce should continue working (switching from SINGLE_NODE_MULTI_WRITER
to SINGLE_NODE_WRITER
). This behavior is ultimately up to each CSI driver, but
they should be designed with this backwards compatibility in mind.
When starting two pods with both using the same PVC with ReadWriteOncePod, one pod will successfully start, but the other will not be scheduled due to the ReadWriteOncePod access mode conflict.
When starting the same two pods but also setting pod.spec.nodeName
to the same
node, kubelet will not enforce the access mode and will proceed with starting
both pods.
For older kubelets, ReadWriteOncePod will map to access mode UNKNOWN
. How
this access mode is used will vary across CSI drivers. By definition, the CSI
spec says "If ANY of the specified volume capabilities are not supported by the
SP, the call MUST return the appropriate gRPC error code", see the
volume_capabilities
field in CreateVolumeRequest. However, not all CSI drivers
strictly adhere to this spec. For example, the EBS CSI driver will error when
supplied an unsupported access mode. Other drivers like the mock CSI driver
won't check the supplied access modes, meaning UNKNOWN
is valid.
When creating a pod using ReadWriteOncePod, the scheduler will not enforce this access mode during scheduling. It will be possible for two pods using the same PVC with this access mode to be assigned the same node.
Same as the above case, with an older kubelet ReadWriteOncePod will map to
access mode UNKNOWN
. How this access mode is used will vary across CSI
drivers.
Both the the CSI attacher and the CSI resizer will error if they do not understand ReadWriteOncePod and this access mode is used on a PV.
The CSI provisioner will map ReadWriteOncePod to a nil access mode. How this access mode is used will vary across CSI drivers.
If the CSI driver running the controller service understands the new access modes, then volumes will be provisioned and attached using these access modes (if ReadWriteOncePod or ReadWriteOnce are used). If the CSI driver running the node service does not understand these access modes, the behavior will depend on the CSI driver and how it treats unknown access modes. The recommendation is to upgrade the CSI drivers for the controller and node services together.
In this scenario, the kube-scheduler will not enforce the ReadWriteOncePod access mode and proceed to schedule pods sharing the same ReadWriteOncePod PVCs.
If you have two pods sharing the same ReadWriteOncePod PVC and they land on separate nodes, the volume will only be able to attach to a single node. The other pod will be stuck because the volume is already attached elsewhere.
However, if both pods land on the same node, kubelet will not enforce the access mode and allow both pods to mount the same ReadWriteOncePod volume.
In this scenario, the kube-scheduler will not enforce the ReadWriteOncePod access mode and proceed to schedule pods sharing the same ReadWriteOncePod PVCs.
If you have two pods sharing the same ReadWriteOncePod PVC and they land on separate nodes, the volume will only be able to attach to a single node. The other pod will be stuck because the volume is already attached elsewhere.
If both pods land on the same node, kubelet will enforce the access mode and only allow one pod to mount the volume.
In this scenario, the kube-scheduler will enforce the ReadWriteOncePod access mode and ensure only a single pod may use a ReadWriteOncePod PVC.
If you have two pods sharing the same ReadWriteOncePod PVC and they both have
spec.nodeName
set, then scheduling will be bypassed. See the above scenario
on the expected behavior.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: ReadWriteOncePod
- Components depending on the feature gate:
- kube-apiserver
- kube-scheduler
- kubelet
No.
When the feature gate is disabled, existing ReadWriteOncePod volumes will continue working. The only allowed operation will be the deletion of ReadWriteOncePod volumes.
Any existing ReadWriteOncePod and ReadWriteOnce volumes will continue working. Upon re-enabling of the feature gate, users can begin creating ReadWriteOncePod volumes again.
There will be unit test coverage for API validation and mount behavior with the feature gate enabled and disabled. There will also be end to end test coverage for mount behavior (if the the feature gate is enabled).
Rolling out this feature involves enabling the ReadWriteOncePod feature gate across kube-apiserver, kube-scheduler, kubelet, and updating CSI driver and sidecar versions. The order in which these are performed does not matter.
The only way this rollout can fail is if a user does not update all components, in which case the feature will not work. See the above section on version skews for behavior in this scenario.
Rolling out this feature does not impact any running workloads.
If pods using ReadWriteOncePod PVCs fail to schedule, you may see an increase in
scheduler_unschedulable_pods{plugin="VolumeRestrictions"}
.
For enforcement in kubelet, if there are issues you may see changes in metrics
for "volume_mount" operations. For example, an increase in
storage_operation_duration_seconds_bucket{operation_name="volume_mount"}
for
larger buckets may indicate issues with mount.
Manual tests were performed to test the whole end to end flow.
Starting with the upgrade path:
- Unsuccessfully create workloads using ReadWriteOncePod PVCs prior to upgrade
- Perform the upgrade in two stages:
- First, update CSI sidecars
- Second, enable feature flag
- Successfully create workloads (1) and (2) using ReadWriteOncePod PVCs (1) and (2).
- Unsuccessfully create workload (3) using ReadWriteOncePod PVC (2) (already in-use)
- Observe the workloads and PVCs are healthy
For the downgrade path:
- Perform the downgrade in two stages:
- First, disable feature flag
- Second, downgrade CSI sidecars
- Observe the workloads and PVCs are still healthy
- Successfully delete workload (1) and ReadWriteOncePod PVC (1)
And re-upgrading the feature again:
- Perform the upgrade in two stages:
- First, update CSI sidecars
- Second, enable feature flag
- Successfully create workload (1) using ReadWriteOncePod PVC (1)
- Unsuccessfully create workload (3) using ReadWriteOncePod PVC (2) (already in-use)
- Observe the workloads and PVCs are still healthy
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
An operator can query for PersistentVolumeClaims and PersistentVolumes in the cluster with the ReadWriteOncePod access mode. If any exist then the feature is in use.
- Other
- Details:
- Create two Pods using the same PersistentVolumeClaim with the ReadWriteOncePod access mode
- (If cluster access available) A PersistentVolume should be created with
.status.phase=Bound
- A PersistentVolumeClaim should be created with
.status.phase=Bound
and have ExternalProvisioning, Provisioning, and ProvisioningSucceeded events - (If cluster access available) A VolumeAttachment should be created with
.status.attached=True
- One Pod should have a SuccessfulAttachVolume event and its Ready status condition set to True
- The other Pod should have a PodScheduled status condition set to False wth reason "Unschedulable" and FailedScheduling events
- The successful Pod should be able to access the volume at the provided mount path
- Details:
Defining an SLO for this metric is difficult because a pod may be "Unschedulable" due to user error; they mistakenly scheduled a workload that has volume conflicts. Additionally, this metric captures volume conflicts for some legacy in-tree drivers that do not support the ReadWriteOncePod feature but are part of the same scheduler plugin.
Any unexpected increases in
scheduler_unschedulable_pods{plugin="VolumeRestrictions"}
should be
investigated by checking the status of pods failing scheduling.
If there are failures during attach, detach, mount, or unmount, you may see an
increase in the storage_operation_duration_seconds
metric exported by
kubelet.
You may also see an increase in the csi_sidecar_operations_seconds_bucket
metric exported by CSI sidecars if there are issues performing CSI operations.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
scheduler_unschedulable_pods{plugin="VolumeRestrictions"}
- [Optional] Aggregation method:
- Components exposing the metric:
- kube-scheduler
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
This feature depends on the cluster having CSI drivers and sidecars that use CSI spec v1.5.0 at minimum.
- [CSI drivers and sidecars]
- Usage description:
- Impact of its outage on the feature: Inability to perform CSI storage operations on ReadWriteOncePod PVCs and PVs (for example, provisioning volumes)
- Impact of its degraded performance or high-error rates on the feature: Increase in latency performing CSI storage operations (due to repeated retries)
- Usage description:
No.
No, it will introduce a new "ReadWriteOncePod" value for the PersistentVolumeAccessMode type, added to the internal and v1 APIs.
No.
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No, the solution will involve using the same ActualStateOfWorld cache in kubelet.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Existing ReadWriteOncePod volumes will continue working, however users will not be able to make any changes to them.
None.
- Delete any unhealthy pods / PVCs using ReadWriteOncePod
- Disable the feature gate (target the API server first to prevent creation of new PVCs)
- Downgrade CSI sidecars and drivers if you're seeing elevated errors there
- 3/10/2021: Implementation started
When it comes to handling ReadWriteOnce, an alternative that was considered was
not introducing a SINGLE_NODE_MULTI_WRITER
access mode in the CSI spec and
continuing to use SINGLE_NODE_WRITER
. This solution was ruled out because the
SINGLE_NODE_WRITER
access mode has conflicting definitions, and since we're
introducing a SINGLE_NODE_SINGLE_WRITER
access mode we should also address
this issue to reduce confusion for developers.
None.