diff --git a/keps/sig-etcd/4578-feature-gate/README.md b/keps/sig-etcd/4578-feature-gate/README.md new file mode 100644 index 000000000000..86df78c4b8d0 --- /dev/null +++ b/keps/sig-etcd/4578-feature-gate/README.md @@ -0,0 +1,383 @@ +# KEP-4578: Feature Gate in etcd + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Story 3](#story-3) + - [Story 4](#story-4) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Is feature enablement state a server level or cluster level property?](#is-feature-enablement-state-a-server-level-or-cluster-level-property) + - [Should we use feature gate for bug fixes?](#should-we-use-feature-gate-for-bug-fixes) + - [Could the lifecycle of a feature change in patch versions?](#could-the-lifecycle-of-a-feature-change-in-patch-versions) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [New APIs](#new-apis) + - [Feature Gate](#feature-gate) + - [Cluster Level Feature Enablement](#cluster-level-feature-enablement) + - [Feature Stages](#feature-stages) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Milestone 1](#milestone-1) + - [Milestone 2](#milestone-2) + - [Milestone 3](#milestone-3) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place +- [ ] (R) Graduation criteria is in place +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [etcd-io/website], for publication to [etcd.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[etcd.io]: https://etcd.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[etcd-io/website]: https://github.com/etcd-io/website + +## Summary + +We are introducing a new `--feature-gates` flag in etcd, and the underlying framework to gate future feature enhancement to etcd. + +## Motivation + +Currently any new enhancements to the etcd are typically added as [an experimental feature](https://github.com/etcd-io/etcd/blob/main/Documentation/contributor-guide/features.md#adding-a-new-feature), with a configuration flag prefixed with "experimental", e.g. --experimental-feature-name. + +When it is time to [graduate an experimental feature to stable](https://github.com/etcd-io/etcd/blob/main/Documentation/contributor-guide/features.md#graduating-an-experimental-feature-to-stable), a new stable feature flag identical to the experimental feature flag but without the --experimental prefix is added to replace the old feature flag, which is a breaking change. + +We are proposing to add [feature gates](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/) in the similar way as Kubernetes does, so that can turn on or off any feature using the `--feature-gates` flag. + +Benefits from this enhancement includes: + +* graduating a feature is no longer a breaking change with new flags. + +* well-established feature lifecycle convention from Kubernetes. + +* a unified place to store all the feature enablement state in the code base for simplicity, and easy to refer to in guarding the implementation code. + +### Goals + +* Introduce the `--feature-gates` flag in etcd code base. + +* Introduce apis/methods to query the enablement state of a feature. + +* Establish lifecycle stages of a feature, and clear criteria of lifecycle progression. + +### Non-Goals + +* Migrate all existing `--experimental` features to feature gate. + +* Use feature gate as a mechanism to fix bugs. + +## Proposal + +### User Stories (Optional) + +#### Story 1 + +It would be easier for developers to add new features without worrying too much about breaking etcd, because the feature would be Alpha first and disabled by default, and new changes would be mostly gated behind the feature gate. + +It would be a smoother process to move the feature to the next stage without the need to introduce new flags. + +The developer also would not need to worry about feature compatibility when there is a mixed version cluster. + +#### Story 2 + +Users of etcd cluster would be able to find out which features are enabled in the etcd cluster and decide on how to use them downstream. + +#### Story 3 + +In a HA cluster, users should be able to enable/disable a feature without bring down the whole cluster, by restarting the servers with the new feature gate flag one by one. + +#### Story 4 + +During cluster upgrade/downgrade, feature changes across versions should have predictable behavior in this mix version scenario. + +### Notes/Constraints/Caveats (Optional) + +#### Is feature enablement state a server level or cluster level property? + +There are features like `ExperimentalEnableLeaseCheckpoint` (enables leader to send regular checkpoints to other members to prevent reset of remaining TTL on leader change), if different server nodes have different enablement values for `ExperimentalEnableLeaseCheckpoint`, the results would be inconsistent depending on which one is the leader and confusing. + +There are also features like `ExperimentalEnableDistributedTracing` (enables distributed tracing using OpenTelemetry protocol), which could work fine at the local server level. + +We are proposing to have separate APIs to query if a feature is enabled for the local server, and for the whole cluster. + +#### Should we use feature gate for bug fixes? + +There are some use cases like `ProgressNotify`, for which some bugs are found and later fixed in a patch version. The client would need to know if the etcd version in use contains that fix to decide whether or not to use that feature. + +The question is: should a new feature gate be added to signal the bug fix? + +We think the answer is NO: +* the new feature would need to be enabled by default to always apply the bug fix for new releases. +* it changes the API which is not desirable in patch version releases. + +The proper way of handling cases like `ProgressNotify` is: +1. the feature should be gated by the feature gate from the beginning. +1. the feature should be disabled by default until it is widely tested in practice. +1. when the bus is found, the feature should ideally be at a lifecycle in which it is disabled by default. If not, the admin should disable it by the `--feature-gates` flag. +1. when the client upgrades etcd to the patch version with the fix, the admin could enable it by the `--feature-gates` flag. + +#### Could the lifecycle of a feature change in patch versions? + +Kubernetes have a minor release every 3 months, while the cadence of etcd minor releases is much less frequent. The question is: do we have to wait for years before graduating a new feature? + +We think we should still stick to the common practice of not changing the lifecycle of a feature in patch versions. Because: +* changing the lifecycle of a feature is an API change. According to the [etcd Operations Guide](https://etcd.io/docs/v3.5/op-guide/versioning/), only new minor versions may add additional features to the API. +* bugs in etcd could be hard to detect, and the reliability and robustness of etcd is more important than speed. A long history of testing through practical adoption a new feature is beneficial. + +With the feature gate in place, we could consider increasing the etcd release cadence because it would be easier to add new features and less risky to release new features. + +### Risks and Mitigations + + + +## Design Details + +### New APIs + +New `--feature-gates` flag would be added to start the etcd server, with format like `--feature-gates=featureA=true,featureB=false`. + +New grpc and http endpoints would be added to query if a feature is enabled for the server or cluster. + +For grpc, new rpc type would be added to the `Maintenance` service. + +```proto +service Maintenance { + ... + rpc FeatureGateStatus(FeatureGateStatusRequest) returns (FeatureGateStatusResponse) { + option (google.api.http) = { + post: "/v3/maintenance/featuregate" + body: "*" + }; + } + ... +} +message Feature { + string name = 1; + bool enabled = 2; +} + +message FeatureGateStatusRequest { + // if true, query if the features are enabled for the single server. + // otherwise, query if the features are enabled for the cluster. + bool isServerFeature = 1; + repeated string features = 2; // return all the enabled features if empty. +} + +message FeatureGateStatusResponse { + ResponseHeader header = 1; + bool isServerFeature = 2; + repeated Feature features = 3; +} +``` + +The http endpoints could look like: +* `{server_endpoint}/featuregate?feature={featureName}` returns true if the feature is enabled for the cluster +* `{server_endpoint}/featuregate?feature={featureName}&isServerFeature=true` returns true if the feature is enabled for the server + +### Feature Gate + +We will use the new `k8s.io/component-base/featuregate.VersionedSpecs`(introduced in [kep-4330](https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/4330-compatibility-versions)) to register and track the features along with their lifecycle at different release versions. + +```go +map[Feature]VersionedSpecs { + featureA: VersionedSpecs{ + {Version: mustParseVersion("1.27"), Default: false, PreRelease: Beta}, + {Version: mustParseVersion("1.28"), Default: true, PreRelease: GA}, + }, + featureB: VersionedSpecs{ + {Version: mustParseVersion("1.28"), Default: false, PreRelease: Alpha}, + }, + featureC: VersionedSpecs{ + {Version: mustParseVersion("1.28"), Default: false, PreRelease: Beta}, + }, + featureD: VersionedSpecs{ + {Version: mustParseVersion("1.26"), Default: false, PreRelease: Alpha}, + {Version: mustParseVersion("1.28"), Default: true, PreRelease: Deprecated}, + } +} +``` + +The feature gates for a server can only be set with the `--feature-gates` flag during startup. We do not support dynamically changing the feature gates when the server is running. + +The `ServerConfig` struct will have a new `featureGate k8s.io/component-base/featuregate.FeatureGate`(immutable) field, which contains the interface of `Enabled(key Feature) bool`, and it would be piped through to where the feature gate is needed. + +```go +type ServerConfig struct { + ... + FeatureGate featuregate.FeatureGate + ... +} +``` + +(global singleton is convenient but it would be hard to track where it is mutated by different component flags (etcd server or proxy server), or in different tests.) + +### Cluster Level Feature Enablement + +To determine if a feature is enabled in the whole cluster, each member would need to know if the feature is enabled for all cluster members. We plan to store the information in the [`member.Attributes`](https://github.com/etcd-io/etcd/blob/e37a67e40b3f5ff8ef81f9de6e7f475f17fda32b/server/etcdserver/api/membership/member.go#L38), and it would be saved in the `members` bucket in the backend through raft. + +```proto +message Attributes { + option (versionpb.etcd_version_msg) = "3.5"; + + string name = 1; + repeated string client_urls = 2; + repeated string enabled_features = 3; +} +``` +When an etcd server starts, the attributes would be [published through raft](https://github.com/etcd-io/etcd/blob/e37a67e40b3f5ff8ef81f9de6e7f475f17fda32b/server/etcdserver/server.go#L1745) and stored in all members. Whenever a new member joins or an existing member restarts, its feature gate attributes would be automatically updated in the start up process. + +We choose to store the `enabled_features` through raft instead of dynamically querying the `\featuregate` endpoint because +1. it saves network bandwidth. +1. we want the cluster feature enablement status to be as stable as possible, only subject to change when the admin intends it as in the case of adding/removing/upgrading/downgrading/restarting servers. In the case of unpredictable loss of a member server, the feature gate of the cluster should not change. + +In `membership.RaftCluster`, we will add a new an aggregate method `FeatureEnabled(key Feature) bool` to query cluster level enablement of a feature. + +```go +// A feature is enabled for the whole cluster only if it is enabled for all the members. +// In case of a conflict, the feature would be disabled, and would not bring down the whole cluster because conflict is expected during upgrade/downgrade. +func (*RaftCluster) FeatureEnabled(key Feature) bool {} +``` + +### Feature Stages + +Following the convention of Kubernetes, a feature can go through a lifecycle of Alpha -> Beta -> GA -> Deprecated. + +| Feature Stage | Properties | Graduation Criteria | +| --- | --- | --- | +| Alpha |