diff --git a/keps/sig-etcd/4578-server-feature-gate/README.md b/keps/sig-etcd/4578-server-feature-gate/README.md new file mode 100644 index 000000000000..c64db7bec784 --- /dev/null +++ b/keps/sig-etcd/4578-server-feature-gate/README.md @@ -0,0 +1,326 @@ +# KEP-4578: Server Feature Gate in etcd + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Is feature enablement state a server level or cluster level property?](#is-feature-enablement-state-a-server-level-or-cluster-level-property) + - [Should we use feature gate for bug fixes?](#should-we-use-feature-gate-for-bug-fixes) + - [Could the lifecycle of a feature change in patch versions?](#could-the-lifecycle-of-a-feature-change-in-patch-versions) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Feature Gate](#feature-gate) + - [Ways to Query Feature Gate State](#ways-to-query-feature-gate-state) + - [Feature Stages](#feature-stages) + - [Path to Migrate Existing Experimental Features](#path-to-migrate-existing-experimental-features) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Milestone 1](#milestone-1) + - [Milestone 2](#milestone-2) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place +- [ ] (R) Graduation criteria is in place +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [etcd-io/website], for publication to [etcd.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[etcd.io]: https://etcd.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[etcd-io/website]: https://github.com/etcd-io/website + +## Summary + +We are introducing the Kubernetes style feature gate framework into etcd to gate future feature enhancement behind a sequence of feature lifecyles. Users would be able to turn features on or off and query feature enablement of an etcd server in a consistent way. + +## Motivation + +Currently any new enhancements to the etcd are typically added as [an experimental feature](https://github.com/etcd-io/etcd/blob/main/Documentation/contributor-guide/features.md#adding-a-new-feature), with a configuration flag prefixed with "experimental", e.g. --experimental-feature-name. + +When it is time to [graduate an experimental feature to stable](https://github.com/etcd-io/etcd/blob/main/Documentation/contributor-guide/features.md#graduating-an-experimental-feature-to-stable), a new stable feature flag identical to the experimental feature flag but without the --experimental prefix is added to replace the old feature flag, which is a breaking change. + +The following problems exist with the current `--experimental` flag mechanism: +* hard to track when and why feature was introduced, and what stage it is in. +* no clear path to graduate flags, as removal of experimental prefix is a breaking change in command line. +* unclear, undocumented feature flag graduation criteria. So very often features get stuck or forgotten in experimental phase. +* no method for clients to query and make decisions based on feature enablement in a cluster. +* each `--experimental` flag will have to be piped through the code based separately. It is cumbersome, nothing stops them being modified somewhere along the path, and some code changes might slip through and not be guarded by the flag. + +We are proposing to introduce Kubernetes style [feature gates](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/) into etcd to address these issues. + +### Goals + +* Introduce a single mechanism in the code base for enabling/disabling features with code changes guarded behind the mechanism without risking breaking the rest of the server if the feature is disabled. + +* Introduce a way for a client to find out what features are enabled in the server. + +* Codify lifecycle stages of a feature, and clear tracking and documentation of feature lifecycle progression. + +### Non-Goals + +* Handle the feature gate enablement on the cluster level. It would be designed in a separate KEP. + +* Migrate all existing `--experimental` features to feature gate. + * While feature gate provides a [path to deprecate existing `--experimental` features]((#path-to-migrate-existing-experimental-features)), and we will migrate several `--experimental` during the feature gate implementation as examples, the completion of the KEP does not require migrating all existing `--experimental` features. It can be done as followup tasks. + +* Use feature gate as a mechanism to fix bugs. + +## Proposal + +### User Stories (Optional) + +#### Story 1 + +A developer adds a new feature, and finds a bug after its Alpha release. + +Now the developer can take their time to fix the bug, because they can be confident all the new changes related to the feature are guarded behind the feature gate, Alpha features are disabled by default and should be adpoted by a small set of users. + +On the user side, how would the users know if they are using the buggy feature if they do not remember which flags were used to start the cluster? Before feature gate, they need to search through the long logs to find the flag value when the server initially started, if the log still exists. Now there is a way to find which features are enabled on a server directly, and restart the servers with the buggy feature enabled. + +#### Story 2 + +A developer added a new feature, and after a few releases, they are ready to graduate the feature. + +Before feature gate, to avoid breaking changes, they would need to add a stable flag `--my-feature-flag` along with `--experimental-my-feature-flag`. Now there are two duplicate flags doing exactly the same thing. +Now they need to wait till the next minor release to remove `--experimental-my-feature-flag`. If they forget, we are stuck with the two flags until someone bothers to look into verifying the two flags are identical and eventually removing `--experimental-my-feature-flag`. + +Now, the developer just need to change stage of the feature to the next stage - single line change. + +### Notes/Constraints/Caveats (Optional) + +#### Is feature enablement state a server level or cluster level property? + +There are simple features like `ExperimentalEnableDistributedTracing` (enables distributed tracing using OpenTelemetry protocol), which could work fine at the local server level. + +There are also features like `ExperimentalEnableLeaseCheckpoint` (enables leader to send regular checkpoints to other members to prevent reset of remaining TTL on leader change), if different server nodes have different enablement values for `ExperimentalEnableLeaseCheckpoint`, the results would be inconsistent depending on which one is the leader and confusing. + +There are also more critial features like changing the Apply logic of a raft message. All members have to agree on whether to use the new logic or old logic, otherwise it could break correctness. + +So feature gate should be able to handle both server level and cluster level features. Since cluster level feature consensus is itself a complicated problem. We will defer it to its own KEP, and only focus on the server level feature gate in this KEP. + +#### Should we use feature gate for bug fixes? + +There are some use cases when some bugs are found and later fixed in a patch version. The client would need to know if the etcd version in use contains that fix to decide whether or not to use that feature. + +The question is: should a new feature gate be added to signal the bug fix? + +We think the answer is generally NO: +* the new feature would need to be enabled by default to always apply the bug fix for new releases. +* it changes the API which is not desirable in patch version releases. + +The proper way of handling these bug fixes should be: +1. the feature should be gated by the feature gate from the beginning. +1. the feature should be disabled by default until it is widely tested in practice. +1. when the bug is found, the feature should ideally be at a lifecycle in which it is disabled by default. If not, the admin should disable it by the `--feature-gates` flag. +1. when the client upgrades etcd to the patch version with the fix, the admin could enable it by the `--feature-gates` flag. + +But if the bug fix is very critical, it could be discussed in a case-by-case manner. + +#### Could the lifecycle of a feature change in patch versions? + +Kubernetes have a minor release every 3 months, while the cadence of etcd minor releases is much less frequent. The question is: do we have to wait for years before graduating a new feature? + +We think we should still stick to the common practice of not changing the lifecycle of a feature in patch versions. Because: +* changing the lifecycle of a feature is an API change. According to the [etcd Operations Guide](https://etcd.io/docs/v3.5/op-guide/versioning/), only new minor versions may add additional features to the API. +* bugs in etcd could be hard to detect, and the reliability and robustness of etcd is more important than speed. A long history of testing through practical adoption a new feature is beneficial. + +With the feature gate in place, we could consider increasing the etcd release cadence because it would be easier to add new features and less risky to release new features. + +### Risks and Mitigations + +* As discussed before, we are deferring cluster level feature support to a follow-up KEP. + +## Design Details + +### Feature Gate + +We will create a new `featuregate` module under `pkg/` (mostly copying code from [`k8s.io/component-base/featuregate`](https://github.com/kubernetes/component-base/tree/master/featuregate)) with the following interface: +```go +// FeatureGate indicates whether a given feature is enabled or not +type FeatureGate interface { + // Enabled returns true if the key is enabled. + Enabled(key Feature) bool + // KnownFeatures returns a slice of strings describing the FeatureGate's known features. + KnownFeatures() []string + // DeepCopy returns a deep copy of the FeatureGate object, such that gates can be + // set on the copy without mutating the original. This is useful for validating + // config against potential feature gate changes before committing those changes. + DeepCopy() MutableFeatureGate +} +``` + +We will use the new `VersionedSpecs`(introduced in [kep-4330](https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/4330-compatibility-versions)) to register and track the features along with their lifecycle at different release versions. + +```go +defaultServerFeatureGates := map[Feature]VersionedSpecs { + featureA: VersionedSpecs{ + {Version: mustParseVersion("3.6"), Default: false, PreRelease: Beta}, + {Version: mustParseVersion("3.7"), Default: true, PreRelease: GA}, + }, + featureB: VersionedSpecs{ + {Version: mustParseVersion("3.7"), Default: false, PreRelease: Alpha}, + }, +} +``` + +A new `--feature-gates` command line argument would be added to start the etcd server, with format like `--feature-gates=featureA=true,featureB=false`. The flag can also be set in the `config-file`. The flag can only be set during startup. We do not support dynamically changing the feature gates when the server is running. + +The `ServerConfig` struct will have a new `featuregate.FeatureGate`(immutable) field. `EtcdServer` would have an interface of `FeatureEnabled(key Feature) bool`, and it can be referenced throughout the server. + +```go +type ServerConfig struct { + ... + // ServerFeatureGate server level feature gate + ServerFeatureGate featuregate.FeatureGate + ... +} + +func (s *EtcdServer) FeatureEnabled(key Feature) bool +``` + +### Ways to Query Feature Gate State + +New Prometheus gauge metrics would be added to monitor and query if a feature is enabled for the server. + +The metrics could look like: +* `etcd_feature_enabled{name="featureName",stage="Alpha"} 1` if the feature is enabled. + +### Feature Stages + +Following the convention of Kubernetes, a feature can go through a lifecycle of Alpha → Beta → GA → Deprecated. + +| Feature Stage | Properties | Graduation Criteria | +| --- | --- | --- | +| Alpha |