Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] 📖 External etcd cluster lifecycle support #7525

Closed

Conversation

g-gaston
Copy link
Contributor

@g-gaston g-gaston commented Nov 9, 2022

What this PR does / why we need it:
This proposal continues the work started by #4659

Tracking TODOs

  • Expand on KCP, CLuster and EtdcadmCluster controllers "interaction". Specially for upgrades scenarios.
  • Document support for Bottlerocket

Fixes #7399

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vincepri for approval by writing /assign @vincepri in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 9, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @g-gaston. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

- A [`runcmd` section](https://cloudinit.readthedocs.io/en/latest/topics/modules.html#runcmd) containing the etcdadm commands, along with any user specified commands.
The controller then saves this cloud-init script as a Secret, and this Secret's name on the EtcdadmConfig.Status.DataSecretName field.
- Since the EtcdadmConfig.Status.DataSecretName field gets set as per the [bootstrap provider specifications](https://cluster-api.sigs.k8s.io/developer/providers/bootstrap.html), the infrastructure providers will use the data from this Secret to initiate the Machines with the cloud-init script.
> TODO: the current implementation today supports [Bottlerocket](https://github.com/bottlerocket-os/bottlerocket) as well, document it.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might trigger its own conversation due to the significant difference between bottlerocket and other OS supporting cloud-init.

It doesn't need to be part of this proposal but, since we will follow with the implementation, it's probably worth noting. I'll just document it here and we'll go from there.

@g-gaston g-gaston changed the title 📖 External etcd cluster lifecycle support 📖 [WIP] External etcd cluster lifecycle support Nov 9, 2022
@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2022
@g-gaston g-gaston changed the title 📖 [WIP] External etcd cluster lifecycle support [WIP] 📖 External etcd cluster lifecycle support Nov 9, 2022
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 9, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 7, 2023
@g-gaston
Copy link
Contributor Author

g-gaston commented Feb 8, 2023

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2023
@g-gaston
Copy link
Contributor Author

g-gaston commented May 9, 2023

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2023
A provider that creates and manages an etcd cluster to be used by a single workload cluster for the [external etcd topology](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/#external-etcd-topology).

### Etcdadm based etcd provider
This provider is an implementation of the Etcd provider that uses [etcdadm](https://github.com/kubernetes-sigs/etcdadm) to create and manage an external etcd cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last release was in 2021, is this still maintained?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is. I myself got a couple PRs after that 2021 release. I suspect no one has asked for a release and that's why it hasn't been cut. But I can reach out to get one if we move ahead with this.

- An etcd provider should support etcd clusters of any size (odd numbers only), including single member etcd clusters for testing purposes. But the documentation should specify the recommended cluster size for production use cases to range between 3 and 7.
- Provide a first implementation of this pluggable etcd provider using [etcdadm](https://github.com/kubernetes-sigs/etcdadm) that integrates with the Kubeadm Control Plane provider.
- The etcdadm based provider will utilize the existing Machine objects to represent etcd members for convenience instead of adding a new machine type for etcd.
- Support the following etcd cluster management actions: scale up and scale down, etcd member replacement and etcd version upgrades.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add that we need some code reuse between this effort and KCP's current etcd capabilities? Ultimately we should try to maximize our shared code between the external etcd provider and stacked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. TBH I would prefer to leave that out of the design and move that responsibility to the implementation phase. It's also something we can iterate on without affecting the design.

However if you feel strongly about this, what about:

Suggested change
- Support the following etcd cluster management actions: scale up and scale down, etcd member replacement and etcd version upgrades.
- Support the following etcd cluster management actions: scale up and scale down, etcd member replacement and etcd version upgrades.
- Minimize code duplication with KCP's current etcd capabilities by sharing code with the external etcd providers.

- There will be a 1:1 mapping between the external etcd cluster and the workload cluster.
- An etcd provider should support etcd clusters of any size (odd numbers only), including single member etcd clusters for testing purposes. But the documentation should specify the recommended cluster size for production use cases to range between 3 and 7.
- Provide a first implementation of this pluggable etcd provider using [etcdadm](https://github.com/kubernetes-sigs/etcdadm) that integrates with the Kubeadm Control Plane provider.
- The etcdadm based provider will utilize the existing Machine objects to represent etcd members for convenience instead of adding a new machine type for etcd.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a limitation currently in that the Machine controller always expects a Machine to become a Kubernetes node; this proposal would allow a different set of nodes to stand as Machine without a corresponding Node. We might want to add a paragraph here that states the above clarifying that Machine(s) that won't become nodes will only be for etcd, which is a support pillar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will add this.

The following new type/CRDs will be added for the etcdadm based provider
```go
// API Group: etcd.cluster.x-k8s.io OR etcdplane.cluster.x-k8s.io
type EtcdadmCluster struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid relying on etcdadm here, and just call it etcd? Reason being, if in the future we'd like to swap out the implementation detail, we should be able to do so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's an interesting point. However, wouldn't that almost guarantee another provider? Similarly to if we ever want to swap kubeadm with a different tool?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to replace etcdadm based provider with etcdadm based etcd provider to avoid confusion. IMO calling etcdadm here would be fine because that's one of the reference implementations of the etcd provider.

Comment on lines +124 to +125
// +optional
EtcdadmConfigSpec etcdbp.EtcdadmConfigSpec `json:"etcdadmConfigSpec"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This imports a config that we have no control over, could we instead offer a translation layer between what we want to expose to Cluster API users, and have etcdadm be an implementation detail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you are saying. But that sounds like a paradigm change, which I'm not opposed to, but it does change the high level approach slightly. If I'm getting this right, you are suggesting making this API almost like a generic interface for an etcd provider, as opposed to the API for a concrete implementation using etcdadm (which is what this design tries to do). This design follows the same idea as the CP provider and the KCP, where the KCP API is specific to kubeadm.

What's the benefit you see on doing this? Or is it more that you have a concern with etcdadm.

Comment on lines +138 to +142
// +optional
Initialized bool `json:"initialized"`

// +optional
CreationComplete bool `json:"ready"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably be conditions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would agree. Let me take a note and revisit this.

@g-gaston
Copy link
Contributor Author

Thanks for the review @vincepri!

@k8s-ci-robot
Copy link
Contributor

@g-gaston: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-verify-main 71e9661 link true /test pull-cluster-api-verify-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fabriziopandini
Copy link
Member

@g-gaston what is the plan for this PR?
Since it is not active since some time now I would suggest closing it and revive the effort as soon someone has bandwidth/there is critical mass to make this move forward

@g-gaston
Copy link
Contributor Author

@g-gaston what is the plan for this PR? Since it is not active since some time now I would suggest closing it and revive the effort as soon someone has bandwidth/there is critical mass to make this move forward

Yeah, agreed.

I still have interest in moving this forward but I don't have the bandwidth to make enough progress right now :(. If someone is interested in collaborating, please reach out. If not, I'll reopen once me or someone else on the eks-a side starts working on this again.

@g-gaston g-gaston closed this Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

External etcd lifecycle support
7 participants