Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	Link for ReadWriteOncePod alpha and beta e2e tests	Oct 19, 2023
kep.yaml	kep.yaml	KEP-2485: mark status implemented	Nov 27, 2023

KEP-2485: ReadWriteOncePod PersistentVolume AccessMode

Release Signoff Checklist
Summary
Glossary
Motivation
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

Summary

This KEP introduces a new ReadWriteOncePod access mode for PersistentVolumes that restricts access to a single pod on a single node. This access mode differs from the existing ReadWriteOnce (RWO) access mode, which restricts access to a single node, but allows simultaneous access from many pods on that node.

Additionally, this KEP outlines required changes to the CSI spec, drivers, and sidecars in order to support this new access mode while maintaining backwards compatibility.

Glossary

Node
- A virtual or physical machine in a Kubernetes cluster that runs pods
PersistentVolume
- A piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using StorageClasses
Access mode
- A description of how a PersistentVolume can be accessed
ReadWriteOnce (RWO)
- An access mode that restricts PersistentVolume access to a single node
ReadWriteOncePod (RWOP)
- A new access mode that restricts PersistentVolume access to a single pod on a single node
CSI
- The Container Storage Interface, a specification for storage provider plugins to integrate with cluster orchestrators (like Kubernetes)

Motivation

Kubernetes Changes

Kubernetes does not have an access mode for PersistentVolumes that allows users to restrict access to a single pod on a single node. This can cause problems for certain workloads. For example, if you had a workload (using ReadWriteOnce) performing an update of a storage device and the workload scaled to more than one Pod, you could encounter issues if the second pod landed on the same node and started simultaneously modifying the device.

For sensitive workloads, users have to work around the lack of a single-workload access mode in other ways (for example, scheduling only a single pod on a node and using ReadWriteOnce), which can lead to inefficient use of resources in their cluster.

See #30085 and #26567 for issues related to this.

CSI Specification Changes

In the CSI spec there are conflicting definitions of the SINGLE_NODE_WRITER access mode. By definition, SINGLE_NODE_WRITER means "Can only be published once as read/write on a single node, at any given time." The problem is how this access mode is used during NodePublishVolume, which is typically where volume mounting is performed.

The CSI spec defines that when NodePublishVolume is called a second time for a volume with a non-MULTI_NODE access mode and with a different target path, the plugin should return FAILED_PRECONDITION. For CSI plugins that strictly adhere to the spec, this guarantees that a volume can only be mounted to a single target path, which means SINGLE_NODE_WRITER restricts access to a single pod on a single node. This behavior conflicts with the original definition. Due to this conflict, we do not have an access mode that represents multiple writers on the same node.

Goals

Outline expected behavior of the ReadWriteOncePod access mode
Provide a high level design for ReadWriteOncePod access mode support
Define API changes needed to support this access mode
Outline changes needed in CSI spec and sidecars to support the ReadWriteOncePod access mode
Outline changes needed in CSI spec and sidecars to continue supporting the ReadWriteOnce access mode

Non-Goals

Proposal

User Stories (Optional)

See the version skew strategy section below for additional scenarios.

ReadWriteOncePod PVC Used Twice Fails for Second Consumer

This scenario asserts a ReadWriteOncePod can only be bind mounted into a single pod on a single node.

User creates a PVC with ReadWriteOncePod access mode
User creates pod 1 using this PVC, scheduled on node 1
User creates pod 2 using this PVC, also scheduled on node 1
User observes pod 2 fails to start because the referenced PVC is in-use by another pod on the same node

Additionally, for attachment:

User creates pod 3 using this PVC, scheduled on node 2
User observes pod 3 fails to start because the referenced PVC is attached to another node

ReadWriteOnce PVC Continues to Succeed with New Kubernetes, Old CSI Driver

This scenario asserts the existing ReadWriteOnce behavior is preserved for old CSI drivers. The exact behavior may differ across CSI drivers since not all drivers conform to the CSI spec, but it should be consistent with how it behaved before.

User creates a PVC with ReadWriteOnce access mode
User creates pod 1 using this PVC, scheduled on node 1
User observes pod running

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

Kubernetes Changes, Access Mode

In Kubernetes, we should add a new ReadWriteOncePod persistent volume access mode to PersistentVolumes and PersistentVolumeClaims. This change will require adding a feature gate to the kube-apiserver, kube-scheduler, and kubelet. Validation logic will need updating to accept this access mode type if the feature gate is enabled.

       // can be mounted read/write mode to exactly 1 pod
       ReadWriteOncePod PersistentVolumeAccessMode = "ReadWriteOncePod"

This access mode will be enforced in two places:

Scheduler Enforcement

Alpha

First is at the time a pod is scheduled. When scheduling a pod, if another pod is found using the same PVC and the PVC uses ReadWriteOncePod, then scheduling will fail and the pod will be considered UnschedulableAndUnresolvable.

In order to determine if a pod using a ReadWriteOncePod PVC can be scheduled, we need to enumerate all pods and check if any are already consuming this PVC. This logic will take place as part of the PreFilter extension point in the volume restrictions plugin.

The node info cache will be extended to map the PVC name to a reference count for the PVC. In the PreFilter extension point, if the pod's PVC is using ReadWriteOncePod, we will query this map for each node checking for references to the scheduled pod's PVC. If one is found the pod will fail scheduling and be marked UnschedulableAndUnresolvable.

Beta

Support for pod preemption is enforced in beta.

When a pod (A) with a ReadWriteOncePod PVC is scheduled, if another pod (B) is found using the same PVC and pod (A) has higher priority, the scheduler will return an "Unschedulable" status and attempt to preempt pod (B).

The implementation goes like follows:

In the PreFilter phase of the volume restrictions scheduler plugin, we will build a cache of the ReadWriteOncePod PVCs for the pod-to-be-scheduled and the number of conflicting PVC references (pods already using any of these PVCs). This cache will be saved as part of the scheduler's cycleState and forwarded to the following step. During AddPod and RemovePod, if there is a conflict we will add or subtract from the number of conflicting references. During the Filter phase, if the cache contains a non-zero amount of conflicting references then return "Unschedulable". If the pod has a PVC that cannot be found, return "UnschedulableAndUnresolvable".

Mount Enforcement

As an additional precaution this will also be enforced at the time a volume is mounted for filesystem devices, and at the time a volume is mapped for block devices. During the mount operation, kubelet will check the actual state of the world cache to determine if the volume is already in-use by another pod. If it is, kubelet will fail mounting with an appropriate error message.

CSI Specification Changes, Volume Capabilities

In the CSI spec we should add two new access modes that explicitly state the number of writers on a single node.

      // Can only be published once as read/write at a single worklad on
      // a single node, at any given time.
      SINGLE_NODE_SINGLE_WRITER = 6;

      // Can be published as read/write at multiple workloads on a
      // single node simultaneously.
      SINGLE_NODE_MULTI_WRITER = 7;

These access modes are modeled after the existing MULTI_NODE_SINGLE_WRITER and MULTI_NODE_MULTI_WRITER access modes. The reason for making this distinction is because the SINGLE_NODE_WRITER volume capability has conflicting definitions (see the motivation section for context).

In order to preserve backwards compatibility, we must be careful about how to map between Kubernetes access modes and the new CSI access modes. The way we control this is by maintaining different mappings based on the CSI driver's capabilities.

Both the controller and node services should have capability bits that represent that they support the new access modes:

      // Indicates the SP supports the SINGLE_NODE_SINGLE_WRITER and/or
      // SINGLE_NODE_MULTI_WRITER access modes.
      // These access modes are intended to replace the
      // SINGLE_NODE_WRITER access mode to clarify the number of writers
      // for a volume on a single node. Plugins MUST accept and allow
      // use of the SINGLE_NODE_WRITER access mode when either
      // SINGLE_NODE_SINGLE_WRITER and/or SINGLE_NODE_MULTI_WRITER are
      // supported, in order to permit older COs to continue working.
      SINGLE_NODE_MULTI_WRITER = 13;

Although it controls support for two access modes, SINGLE_NODE_MULTI_WRTIER is chosen as the capability name because it represents the access mode that is unsupported.

For ReadWriteOncePod, if the CSI driver supports the SINGLE_NODE_MULTI_WRTER capability, then ReadWriteOncePod will map to SINGLE_NODE_SINGLE_WRITER. If it does not, then ReadWriteOncePod will map to SINGLE_NODE_WRITER. This mapping is chosen because we can safely rely on Kubernetes to enforce the access mode outside of the CSI driver. It also has the advantage of enabling existing CSI drivers to start using ReadWriteOncePod.

For ReadWriteOnce, if the CSI driver supports the SINGLE_NODE_MULTI_WRITER capability, then ReadWriteOnce will map to SINGLE_NODE_MULTI_WRITER. If it does not, then ReadWriteOnce will map to SINGLE_NODE_WRITER, which is the existing behavior.

Put more succinctly:

	Driver Supports `SINGLE_NODE_MULTI_WRITER` Capability	Driver Does Not Support `SINGLE_NODE_MULTI_WRITER` Capability
ReadWriteOncePod	SINGLE_NODE_SINGLE_WRITER	SINGLE_NODE_WRITER
ReadWriteOnce	SINGLE_NODE_MULTI_WRITER	SINGLE_NODE_WRITER (Existing behavior)

CSI clients that will need updating are kubelet, external-provisioner, external-attacher, and external-resizer.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

None. New tests will be added for the transition to beta to support scheduler changes.

Unit tests

In alpha, the following unit tests were updated. See kubernetes/kubernetes#102028 and kubernetes/kubernetes#103082 for more context.

k8s.io/kubernetes/pkg/apis/core/helper: 09-22-2022 - 26.2
k8s.io/kubernetes/pkg/apis/core/v1/helper: 09-22-2022 - 56.9
k8s.io/kubernetes/pkg/apis/core/validation: 09-22-2022 - 82.3
k8s.io/kubernetes/pkg/controller/volume/persistentvolume: 09-22-2022 - 79.4
k8s.io/kubernetes/pkg/kubelet/volumemanager/cache: 09-22-2022 - 66.3
k8s.io/kubernetes/pkg/volume/csi/csi_client.go: 09-22-2022 - 76.2
k8s.io/kubernetes/pkg/scheduler/apis/config/v1beta2: 09-22-2022 - 76.8
k8s.io/kubernetes/pkg/scheduler/framework/plugins/volumerestrictions: 09-22-2022 - 85
k8s.io/kubernetes/pkg/scheduler/framework: 09-22-2022 - 77.1

In beta, there will be additional unit test coverage for k8s.io/kubernetes/pkg/scheduler/framework/plugins/volumerestrictions to cover preemption logic.

Integration tests

Integration tests for scheduler plugin behavior are available here:

test/integration/scheduler/filters/filters_test.go : testgrid

e2e tests

For alpha, to test this feature end to end, we will need to check the following cases:

A ReadWriteOncePod volume will succeed mounting when consumed by a single pod on a node
A ReadWriteOncePod volume will fail to mount when consumed by a second pod on the same node
A ReadWriteOncePod volume will fail to attach when consumed by a second pod on a different node

For testing the mapping for ReadWriteOnce, we will update the CSI hostpath driver to support the new volume capability access modes and cut a release. The existing Kubernetes end to end tests will be updated to use this version which will test the K8s to CSI access mode mapping behavior because most storage end to end tests rely on the ReadWriteOnce access mode, which now maps to the SINGLE_NODE_MULTI_WRITER CSI access mode.

For beta, we will want to cover the additional cases for preemption:

A high-priority pod requesting a ReadWriteOncePod volume that's already in-use will result in the preemption of the pod previously using the volume
A low-priority (or no priority) pod requesting a ReadWriteOncePod volume that's already in-use will result in it being UnschedulableAndUnresolvable

E2E tests for alpha and beta behavior can be found here:

test/e2e/storage/testsuites/readwriteoncepod.go : k8s-triage

Validation of PersistentVolumeSpec Object

To test the validation logic of the PersistentVolumeSpec, we need to check the following cases:

Validation succeeds when feature gate is enabled and PersistentVolume is created with ReadWriteOncePod access mode
Validation fails when feature gate is disabled and PersistentVolume is created with ReadWriteOncePod access mode
Validation succeeds when feature gate is enabled and PersistentVolumeClaim is created with ReadWriteOncePod access mode
Validation fails when feature gate is disabled and PersistentVolumeClaim is created with ReadWriteOncePod access mode

Mounting and Mapping with ReadWriteOncePod

To test mount behavior, we need to check the following cases:

Mounting a volume with ReadWriteOncePod succeeds if the volume isn't already mounted
Mounting a volume with ReadWriteOncePod fails if the volume is already mounted

Mounting and Mapping with ReadWriteOnce

Existing unit tests should cover this scenario.

Mapping Kubernetes Access Modes to CSI Volume Capability Access Modes

This test involves asserting the behavior in the above table. The volume capability access mode for ReadWriteOnce will depend on the capabilities of the CSI driver. A test asserting this behavior will be needed in both Kubernetes as well as in CSI sidecars.

End to End Tests

Graduation Criteria

Alpha

CSI spec supports SINGLE_NODE_*_WRITER access modes
Kubernetes supports ReadWriteOncePod access mode, has unit test coverage, has updated CSI spec
CSI sidecars support SINGLE_NODE_*_WRITER access modes and have unit test coverage

Beta

Scheduler enforces ReadWriteOncePod access mode by marking pods as Unschedulable, preemption logic added
ReadWriteOncePod access mode has end to end test coverage
Hostpath CSI driver supports SINGLE_NODE_*_WRITER access modes, relevant end to end tests updated to use this driver

GA

Kubernetes API and CSI spec changes are stable
CSI drivers support SINGLE_NODE_*_WRITER access modes

Upgrade / Downgrade Strategy

In order to upgrade a cluster to use this feature, the user will need to restart the kube-apiserver, kube-scheduler, and kubelet with the ReadWriteOncePod feature gate enabled. Additionally they will need to update their CSI drivers and sidecars to versions that depend on the new Kubernetes API and CSI spec.

When downgrading a cluster to disable this feature, the user will need to disable the ReadWriteOncePod feature gate in kube-apiserver, kube-scheduler, and kubelet. They may also roll back their CSI sidecars if they are encountering errors.

When disabling this feature gate, any existing volumes with the ReadWriteOncePod access mode will continue to exist, but can only be deleted. An alternative is to allow these volumes to be treated as ReadWriteOnce, however that would violate the intent of the user and so it is not recommended.

If a user downgrades their CSI drivers or sidecars, any existing volumes using ReadWriteOnce should continue working (switching from SINGLE_NODE_MULTI_WRITER to SINGLE_NODE_WRITER). This behavior is ultimately up to each CSI driver, but they should be designed with this backwards compatibility in mind.

Version Skew Strategy

API Server Version N / Scheduler Version N / Kubelet Version N-1 or N-2

When starting two pods with both using the same PVC with ReadWriteOncePod, one pod will successfully start, but the other will not be scheduled due to the ReadWriteOncePod access mode conflict.

When starting the same two pods but also setting pod.spec.nodeName to the same node, kubelet will not enforce the access mode and will proceed with starting both pods.

For older kubelets, ReadWriteOncePod will map to access mode UNKNOWN. How this access mode is used will vary across CSI drivers. By definition, the CSI spec says "If ANY of the specified volume capabilities are not supported by the SP, the call MUST return the appropriate gRPC error code", see the volume_capabilities field in CreateVolumeRequest. However, not all CSI drivers strictly adhere to this spec. For example, the EBS CSI driver will error when supplied an unsupported access mode. Other drivers like the mock CSI driver won't check the supplied access modes, meaning UNKNOWN is valid.

API Server Version N / Scheduler Version N-1 / Kubelet Version N-1 or N-2

When creating a pod using ReadWriteOncePod, the scheduler will not enforce this access mode during scheduling. It will be possible for two pods using the same PVC with this access mode to be assigned the same node.

Same as the above case, with an older kubelet ReadWriteOncePod will map to access mode UNKNOWN. How this access mode is used will vary across CSI drivers.

API Understands ReadWriteOncePod, CSI Sidecars Do Not

Both the the CSI attacher and the CSI resizer will error if they do not understand ReadWriteOncePod and this access mode is used on a PV.

The CSI provisioner will map ReadWriteOncePod to a nil access mode. How this access mode is used will vary across CSI drivers.

CSI Controller Service Understands New CSI Access Modes, CSI Node Service Does Not

If the CSI driver running the controller service understands the new access modes, then volumes will be provisioned and attached using these access modes (if ReadWriteOncePod or ReadWriteOnce are used). If the CSI driver running the node service does not understand these access modes, the behavior will depend on the CSI driver and how it treats unknown access modes. The recommendation is to upgrade the CSI drivers for the controller and node services together.

API Server Has Feature Enabled, Scheduler and Kubelet Do Not

In this scenario, the kube-scheduler will not enforce the ReadWriteOncePod access mode and proceed to schedule pods sharing the same ReadWriteOncePod PVCs.

If you have two pods sharing the same ReadWriteOncePod PVC and they land on separate nodes, the volume will only be able to attach to a single node. The other pod will be stuck because the volume is already attached elsewhere.

However, if both pods land on the same node, kubelet will not enforce the access mode and allow both pods to mount the same ReadWriteOncePod volume.

API Server Has Feature Enabled, Scheduler Does not, Kubelet Does

In this scenario, the kube-scheduler will not enforce the ReadWriteOncePod access mode and proceed to schedule pods sharing the same ReadWriteOncePod PVCs.

If both pods land on the same node, kubelet will enforce the access mode and only allow one pod to mount the volume.

API Server Has Feature Enabled, Scheduler Does, Kubelet Does Not

In this scenario, the kube-scheduler will enforce the ReadWriteOncePod access mode and ensure only a single pod may use a ReadWriteOncePod PVC.

If you have two pods sharing the same ReadWriteOncePod PVC and they both have spec.nodeName set, then scheduling will be bypassed. See the above scenario on the expected behavior.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: ReadWriteOncePod
- Components depending on the feature gate:
  - kube-apiserver
  - kube-scheduler
  - kubelet

Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

When the feature gate is disabled, existing ReadWriteOncePod volumes will continue working. The only allowed operation will be the deletion of ReadWriteOncePod volumes.

What happens if we reenable the feature if it was previously rolled back?

Any existing ReadWriteOncePod and ReadWriteOnce volumes will continue working. Upon re-enabling of the feature gate, users can begin creating ReadWriteOncePod volumes again.

Are there any tests for feature enablement/disablement?

There will be unit test coverage for API validation and mount behavior with the feature gate enabled and disabled. There will also be end to end test coverage for mount behavior (if the the feature gate is enabled).

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Rolling out this feature involves enabling the ReadWriteOncePod feature gate across kube-apiserver, kube-scheduler, kubelet, and updating CSI driver and sidecar versions. The order in which these are performed does not matter.

The only way this rollout can fail is if a user does not update all components, in which case the feature will not work. See the above section on version skews for behavior in this scenario.

Rolling out this feature does not impact any running workloads.

What specific metrics should inform a rollback?

If pods using ReadWriteOncePod PVCs fail to schedule, you may see an increase in scheduler_unschedulable_pods{plugin="VolumeRestrictions"}.

For enforcement in kubelet, if there are issues you may see changes in metrics for "volume_mount" operations. For example, an increase in storage_operation_duration_seconds_bucket{operation_name="volume_mount"} for larger buckets may indicate issues with mount.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Manual tests were performed to test the whole end to end flow.

Starting with the upgrade path:

Unsuccessfully create workloads using ReadWriteOncePod PVCs prior to upgrade
Perform the upgrade in two stages:
- First, update CSI sidecars
- Second, enable feature flag
Successfully create workloads (1) and (2) using ReadWriteOncePod PVCs (1) and (2).
Unsuccessfully create workload (3) using ReadWriteOncePod PVC (2) (already in-use)
Observe the workloads and PVCs are healthy

For the downgrade path:

Perform the downgrade in two stages:
- First, disable feature flag
- Second, downgrade CSI sidecars
Observe the workloads and PVCs are still healthy
Successfully delete workload (1) and ReadWriteOncePod PVC (1)

And re-upgrading the feature again:

Perform the upgrade in two stages:
- First, update CSI sidecars
- Second, enable feature flag
Successfully create workload (1) using ReadWriteOncePod PVC (1)
Unsuccessfully create workload (3) using ReadWriteOncePod PVC (2) (already in-use)
Observe the workloads and PVCs are still healthy

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

An operator can query for PersistentVolumeClaims and PersistentVolumes in the cluster with the ReadWriteOncePod access mode. If any exist then the feature is in use.

How can someone using this feature know that it is working for their instance?

Other
- Details:
  - Create two Pods using the same PersistentVolumeClaim with the ReadWriteOncePod access mode
  - (If cluster access available) A PersistentVolume should be created with .status.phase=Bound
  - A PersistentVolumeClaim should be created with .status.phase=Bound and have ExternalProvisioning, Provisioning, and ProvisioningSucceeded events
  - (If cluster access available) A VolumeAttachment should be created with .status.attached=True
  - One Pod should have a SuccessfulAttachVolume event and its Ready status condition set to True
  - The other Pod should have a PodScheduled status condition set to False wth reason "Unschedulable" and FailedScheduling events
  - The successful Pod should be able to access the volume at the provided mount path

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Defining an SLO for this metric is difficult because a pod may be "Unschedulable" due to user error; they mistakenly scheduled a workload that has volume conflicts. Additionally, this metric captures volume conflicts for some legacy in-tree drivers that do not support the ReadWriteOncePod feature but are part of the same scheduler plugin.

Any unexpected increases in scheduler_unschedulable_pods{plugin="VolumeRestrictions"} should be investigated by checking the status of pods failing scheduling.

If there are failures during attach, detach, mount, or unmount, you may see an increase in the storage_operation_duration_seconds metric exported by kubelet.

You may also see an increase in the csi_sidecar_operations_seconds_bucket metric exported by CSI sidecars if there are issues performing CSI operations.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: scheduler_unschedulable_pods{plugin="VolumeRestrictions"}
- [Optional] Aggregation method:
- Components exposing the metric:
  - kube-scheduler

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

This feature depends on the cluster having CSI drivers and sidecars that use CSI spec v1.5.0 at minimum.

[CSI drivers and sidecars]
- Usage description:
  - Impact of its outage on the feature: Inability to perform CSI storage operations on ReadWriteOncePod PVCs and PVs (for example, provisioning volumes)
  - Impact of its degraded performance or high-error rates on the feature: Increase in latency performing CSI storage operations (due to repeated retries)

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No, it will introduce a new "ReadWriteOncePod" value for the PersistentVolumeAccessMode type, added to the internal and v1 APIs.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No, the solution will involve using the same ActualStateOfWorld cache in kubelet.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

Existing ReadWriteOncePod volumes will continue working, however users will not be able to make any changes to them.

What are other known failure modes?

None.

What steps should be taken if SLOs are not being met to determine the problem?

Delete any unhealthy pods / PVCs using ReadWriteOncePod
Disable the feature gate (target the API server first to prevent creation of new PVCs)
Downgrade CSI sidecars and drivers if you're seeing elevated errors there

Implementation History

3/10/2021: Implementation started

Drawbacks

Alternatives

When it comes to handling ReadWriteOnce, an alternative that was considered was not introducing a SINGLE_NODE_MULTI_WRITER access mode in the CSI spec and continuing to use SINGLE_NODE_WRITER. This solution was ruled out because the SINGLE_NODE_WRITER access mode has conflicting definitions, and since we're introducing a SINGLE_NODE_SINGLE_WRITER access mode we should also address this issue to reduce confusion for developers.

Infrastructure Needed (Optional)

None.

Files

2485-read-write-once-pod-pv-access-mode

Directory actions

More options

Directory actions

More options

Latest commit

History

2485-read-write-once-pod-pv-access-mode

Folders and files

parent directory

README.md

KEP-2485: ReadWriteOncePod PersistentVolume AccessMode

Release Signoff Checklist

Summary

Glossary

Motivation

Kubernetes Changes

CSI Specification Changes

Goals

Non-Goals

Proposal

User Stories (Optional)

ReadWriteOncePod PVC Used Twice Fails for Second Consumer

ReadWriteOnce PVC Continues to Succeed with New Kubernetes, Old CSI Driver

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

Kubernetes Changes, Access Mode

Scheduler Enforcement

Alpha

Beta

Mount Enforcement

CSI Specification Changes, Volume Capabilities

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Validation of PersistentVolumeSpec Object

Mounting and Mapping with ReadWriteOncePod

Mounting and Mapping with ReadWriteOnce

Mapping Kubernetes Access Modes to CSI Volume Capability Access Modes

End to End Tests

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

API Server Version N / Scheduler Version N / Kubelet Version N-1 or N-2

API Server Version N / Scheduler Version N-1 / Kubelet Version N-1 or N-2

API Understands ReadWriteOncePod, CSI Sidecars Do Not

CSI Controller Service Understands New CSI Access Modes, CSI Node Service Does Not

API Server Has Feature Enabled, Scheduler and Kubelet Do Not

API Server Has Feature Enabled, Scheduler Does not, Kubelet Does

API Server Has Feature Enabled, Scheduler Does, Kubelet Does Not

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?