[WIP] 📖 External etcd cluster lifecycle support #7525

g-gaston · 2022-11-09T22:41:50Z

What this PR does / why we need it:
This proposal continues the work started by #4659

Tracking TODOs

Expand on KCP, CLuster and EtdcadmCluster controllers "interaction". Specially for upgrades scenarios.
Document support for Bottlerocket

k8s-ci-robot · 2022-11-09T22:41:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vincepri for approval by writing /assign @vincepri in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2022-11-09T22:42:00Z

Hi @g-gaston. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

g-gaston · 2022-11-09T22:46:13Z

docs/proposals/20221109-managed-external-etcd.md

+    - A [`runcmd` section](https://cloudinit.readthedocs.io/en/latest/topics/modules.html#runcmd) containing the etcdadm commands, along with any user specified commands.
+  The controller then saves this cloud-init script as a Secret, and this Secret's name on the EtcdadmConfig.Status.DataSecretName field.
+  - Since the EtcdadmConfig.Status.DataSecretName field gets set as per the [bootstrap provider specifications](https://cluster-api.sigs.k8s.io/developer/providers/bootstrap.html), the infrastructure providers will use the data from this Secret to initiate the Machines with the cloud-init script.
+> TODO: the current implementation today supports [Bottlerocket](https://github.com/bottlerocket-os/bottlerocket) as well, document it.


This might trigger its own conversation due to the significant difference between bottlerocket and other OS supporting cloud-init.

It doesn't need to be part of this proposal but, since we will follow with the implementation, it's probably worth noting. I'll just document it here and we'll go from there.

k8s-triage-robot · 2023-02-07T22:48:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

g-gaston · 2023-02-08T13:01:03Z

/remove-lifecycle stale

k8s-triage-robot · 2023-05-09T13:10:46Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

g-gaston · 2023-05-09T13:28:31Z

/remove-lifecycle stale

vincepri · 2023-07-07T17:01:55Z

docs/proposals/20221109-managed-external-etcd.md

+A provider that creates and manages an etcd cluster to be used by a single workload cluster for the [external etcd topology](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/#external-etcd-topology).
+
+### Etcdadm based etcd provider
+This provider is an implementation of the Etcd provider that uses [etcdadm](https://github.com/kubernetes-sigs/etcdadm) to create and manage an external etcd cluster.


Last release was in 2021, is this still maintained?

Yeah, it is. I myself got a couple PRs after that 2021 release. I suspect no one has asked for a release and that's why it hasn't been cut. But I can reach out to get one if we move ahead with this.

vincepri · 2023-07-07T17:04:57Z

docs/proposals/20221109-managed-external-etcd.md

+  - An etcd provider should support etcd clusters of any size (odd numbers only), including single member etcd clusters for testing purposes. But the documentation should specify the recommended cluster size for production use cases to range between 3 and 7.
+- Provide a first implementation of this pluggable etcd provider using [etcdadm](https://github.com/kubernetes-sigs/etcdadm) that integrates with the Kubeadm Control Plane provider.
+  - The etcdadm based provider will utilize the existing Machine objects to represent etcd members for convenience instead of adding a new machine type for etcd.
+  - Support the following etcd cluster management actions: scale up and scale down, etcd member replacement and etcd version upgrades.


Should we add that we need some code reuse between this effort and KCP's current etcd capabilities? Ultimately we should try to maximize our shared code between the external etcd provider and stacked.

Ah good point. TBH I would prefer to leave that out of the design and move that responsibility to the implementation phase. It's also something we can iterate on without affecting the design.

However if you feel strongly about this, what about:

Suggested change

- Support the following etcd cluster management actions: scale up and scale down, etcd member replacement and etcd version upgrades.

- Support the following etcd cluster management actions: scale up and scale down, etcd member replacement and etcd version upgrades.

- Minimize code duplication with KCP's current etcd capabilities by sharing code with the external etcd providers.

vincepri · 2023-07-07T17:08:23Z

docs/proposals/20221109-managed-external-etcd.md

+  - There will be a 1:1 mapping between the external etcd cluster and the workload cluster.
+  - An etcd provider should support etcd clusters of any size (odd numbers only), including single member etcd clusters for testing purposes. But the documentation should specify the recommended cluster size for production use cases to range between 3 and 7.
+- Provide a first implementation of this pluggable etcd provider using [etcdadm](https://github.com/kubernetes-sigs/etcdadm) that integrates with the Kubeadm Control Plane provider.
+  - The etcdadm based provider will utilize the existing Machine objects to represent etcd members for convenience instead of adding a new machine type for etcd.


There is a limitation currently in that the Machine controller always expects a Machine to become a Kubernetes node; this proposal would allow a different set of nodes to stand as Machine without a corresponding Node. We might want to add a paragraph here that states the above clarifying that Machine(s) that won't become nodes will only be for etcd, which is a support pillar.

Good point, will add this.

vincepri · 2023-07-07T17:10:01Z

docs/proposals/20221109-managed-external-etcd.md

+The following new type/CRDs will be added for the etcdadm based provider
+```go
+// API Group: etcd.cluster.x-k8s.io OR etcdplane.cluster.x-k8s.io
+type EtcdadmCluster struct {


Could we avoid relying on etcdadm here, and just call it etcd? Reason being, if in the future we'd like to swap out the implementation detail, we should be able to do so

Oh that's an interesting point. However, wouldn't that almost guarantee another provider? Similarly to if we ever want to swap kubeadm with a different tool?

It might be better to replace etcdadm based provider with etcdadm based etcd provider to avoid confusion. IMO calling etcdadm here would be fine because that's one of the reference implementations of the etcd provider.

vincepri · 2023-07-07T17:10:41Z

docs/proposals/20221109-managed-external-etcd.md

+    // +optional
+    EtcdadmConfigSpec etcdbp.EtcdadmConfigSpec `json:"etcdadmConfigSpec"`


This imports a config that we have no control over, could we instead offer a translation layer between what we want to expose to Cluster API users, and have etcdadm be an implementation detail?

Ah I see what you are saying. But that sounds like a paradigm change, which I'm not opposed to, but it does change the high level approach slightly. If I'm getting this right, you are suggesting making this API almost like a generic interface for an etcd provider, as opposed to the API for a concrete implementation using etcdadm (which is what this design tries to do). This design follows the same idea as the CP provider and the KCP, where the KCP API is specific to kubeadm.

What's the benefit you see on doing this? Or is it more that you have a concern with etcdadm.

vincepri · 2023-07-07T17:10:59Z

docs/proposals/20221109-managed-external-etcd.md

+    // +optional
+    Initialized bool `json:"initialized"`
+
+    // +optional
+    CreationComplete bool `json:"ready"`


These should probably be conditions?

Yeah, I would agree. Let me take a note and revisit this.

g-gaston · 2023-07-14T15:39:36Z

Thanks for the review @vincepri!

k8s-ci-robot · 2023-11-29T21:22:36Z

@g-gaston: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-verify-main	`71e9661`	link	true	`/test pull-cluster-api-verify-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

fabriziopandini · 2024-01-15T19:48:19Z

@g-gaston what is the plan for this PR?
Since it is not active since some time now I would suggest closing it and revive the effort as soon someone has bandwidth/there is critical mass to make this move forward

g-gaston · 2024-01-16T15:49:52Z

@g-gaston what is the plan for this PR? Since it is not active since some time now I would suggest closing it and revive the effort as soon someone has bandwidth/there is critical mass to make this move forward

Yeah, agreed.

I still have interest in moving this forward but I don't have the bandwidth to make enough progress right now :(. If someone is interested in collaborating, please reach out. If not, I'll reopen once me or someone else on the eks-a side starts working on this again.

mrajashree and others added 2 commits November 9, 2022 22:39

KEP to add support for managed external etcd clusters in CAPI

2202673

Add TODOs and small edits

71e9661

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 9, 2022

k8s-ci-robot requested review from sbueringer and stmcginnis November 9, 2022 22:42

g-gaston commented Nov 9, 2022

View reviewed changes

g-gaston changed the title ~~📖 External etcd cluster lifecycle support~~ 📖 [WIP] External etcd cluster lifecycle support Nov 9, 2022

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2022

g-gaston changed the title ~~📖 [WIP] External etcd cluster lifecycle support~~ [WIP] 📖 External etcd cluster lifecycle support Nov 9, 2022

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 9, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 7, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2023

nprokopic mentioned this pull request May 9, 2023

Investigate support for externally managed etcd clusters giantswarm/roadmap#2465

Open

vincepri reviewed Jul 7, 2023

View reviewed changes

g-gaston closed this Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] 📖 External etcd cluster lifecycle support #7525

[WIP] 📖 External etcd cluster lifecycle support #7525

g-gaston commented Nov 9, 2022

k8s-ci-robot commented Nov 9, 2022

k8s-ci-robot commented Nov 9, 2022

g-gaston Nov 9, 2022

k8s-triage-robot commented Feb 7, 2023

g-gaston commented Feb 8, 2023

k8s-triage-robot commented May 9, 2023

g-gaston commented May 9, 2023

vincepri Jul 7, 2023

g-gaston Jul 14, 2023

vincepri Jul 7, 2023

g-gaston Jul 14, 2023

vincepri Jul 7, 2023

g-gaston Jul 14, 2023

vincepri Jul 7, 2023

g-gaston Jul 14, 2023

musaprg Oct 16, 2023

vincepri Jul 7, 2023

g-gaston Jul 14, 2023

vincepri Jul 7, 2023

g-gaston Jul 14, 2023

g-gaston commented Jul 14, 2023

k8s-ci-robot commented Nov 29, 2023

fabriziopandini commented Jan 15, 2024

g-gaston commented Jan 16, 2024

	- Support the following etcd cluster management actions: scale up and scale down, etcd member replacement and etcd version upgrades.
	- Support the following etcd cluster management actions: scale up and scale down, etcd member replacement and etcd version upgrades.
	- Minimize code duplication with KCP's current etcd capabilities by sharing code with the external etcd providers.

		// +optional
		EtcdadmConfigSpec etcdbp.EtcdadmConfigSpec `json:"etcdadmConfigSpec"`

[WIP] 📖 External etcd cluster lifecycle support #7525

[WIP] 📖 External etcd cluster lifecycle support #7525

Conversation

g-gaston commented Nov 9, 2022

k8s-ci-robot commented Nov 9, 2022

k8s-ci-robot commented Nov 9, 2022

Choose a reason for hiding this comment

k8s-triage-robot commented Feb 7, 2023

g-gaston commented Feb 8, 2023

k8s-triage-robot commented May 9, 2023

g-gaston commented May 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

g-gaston commented Jul 14, 2023

k8s-ci-robot commented Nov 29, 2023

fabriziopandini commented Jan 15, 2024

g-gaston commented Jan 16, 2024