📖 CAEP to add support for managed external etcd clusters in CAPI #4659

mrajashree · 2021-05-24T23:21:51Z

What this PR does / why we need it:
This PR adds a KEP to support managed external etcd clusters within CAPI.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
#4658

k8s-ci-robot · 2021-05-24T23:21:59Z

Hi @mrajashree. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

CecileRobertMichon · 2021-05-25T17:25:14Z

/ok-to-test

docs/proposals/20210323-etcdadm-bootstrap-provider.md

gyuho

I added some comments as someone new to CAPI, but those shouldn't be a blocker. The overall logic makes sense.

docs/proposals/20210323-etcdadm-bootstrap-provider.md

docs/proposals/20210524-etcdadm-bootstrap-provider.md

docs/proposals/20210524-managed-external-etcd.md

fabriziopandini · 2021-08-06T09:18:30Z

docs/proposals/20210524-managed-external-etcd.md

+    InitMachineAddress string `json:"initMachineAddress"`
+
+    // +optional
+    Initialized bool `json:"initialized"`


I'm not sure I get the difference between this flag and the EtcdInitializedCondition. Is it possible to dedup?

fabriziopandini · 2021-08-06T10:29:03Z

docs/proposals/20210524-managed-external-etcd.md

+- Once the upgrade/machine replacement is completed, the etcd controller will again update the EtcdadmCluster.Status.Endpoints field. The control plane controller will rollout new Machines that use the updated etcd endpoints for the apiserver.
+
+###### Periodic etcd member healthchecks
+- The etcdadm cluster controller will perform healthchecks on the etcd members periodically at a predetermined interval. The controller will perform client healthchecks by making HTTP Get requests to the `<etcd member address>:2379/health` endpoint of each etcd member. It cannot perform peer-to-peer healthchecks since the controller is external to the etcd cluster and there won't be any agents running on the etcd members.


This inverts the current pattern with MHC responsible for checking machines health.
I'm ok with this, but I would like to hear other opinions on this...

fabriziopandini · 2021-08-06T10:30:42Z

docs/proposals/20210524-managed-external-etcd.md

+
+#### Changes needed in docker machine controller
+- Each machine infrastructure provider sets a providerID on every InfraMachine.
+- Infrastructure providers like CAPA and CAPV set this provider ID based on the instance metadata.


If I'm not wrong this is a responsibility of the CPI controllers.

fabriziopandini · 2021-08-06T10:33:36Z

docs/proposals/20210524-managed-external-etcd.md

+#### Changes needed in docker machine controller
+- Each machine infrastructure provider sets a providerID on every InfraMachine.
+- Infrastructure providers like CAPA and CAPV set this provider ID based on the instance metadata.
+- CAPD on the other hand first calls `kubectl patch` to add the provider ID on the node. And only then it sets the provider ID value on the DockerMachine resource. But the problem is that in case of an etcd only cluster, the machine is not registered as a Kubernetes node. Cluster provisioning does not progress until this provider ID is set on the Machine. As a solution, CAPD will check if the bootstrap provider is etcdadm, and skip the kubectl patch process and set the provider ID on DockerMachine directly. These changes are required in the [docker machine controller](https://github.com/mrajashree/cluster-api/pull/2/files#diff-1923dd8291a9406e3c144763f526bd9797a2449a030f5768178b8d06d13c795bR307) for CAPD.


I'm -1 on this change.
Each provider should be sandboxed and rely only on informations available on CAPI resources (vs checking combinations with other providers). If with this proposal we are introducing the concept of machine without a node, let's find a way to make this is first class construct in CAPI core. This can eventually be done in staged approaches avoiding breaking changes in the beginning, but at the end we should provide the users a way to discriminate the different types of machines.

fabriziopandini · 2021-08-06T10:38:41Z

docs/proposals/20210524-managed-external-etcd.md

+EtcdBootstrapProviderType = ProviderType("EtcdBootstrapProvider")
+EtcdProvider = ProviderType("EtcdProvider")


This impact the operator work as well... At least let's state as a future goal the intent to update the operator proposal to accommodate those providers as well

fabriziopandini · 2021-08-06T10:40:26Z

docs/proposals/20210524-managed-external-etcd.md

+The Etcdadm based provider requires two separate provider types, other etcd providers that get added later on might not require two separate providers, so they can utilize just the EtcdProvider type.
+With these changes, user should be able to install the etcd provider by running
+```
+clusterctl init --etcdbootstrap etcdadm-bootstrap-provider --etcdprovider etcdadm-controller


Note: adding new providers types has a ripple effect on several clusterctl commands, not only init (e.g upgrade, generate providers etc)

k8s-ci-robot · 2021-12-08T13:54:41Z

@mrajashree: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-test-main-mink8s	`7c8b65f`	link	`/test pull-cluster-api-test-main-mink8s`
pull-cluster-api-e2e-main	`7c8b65f`	link	true	`/test pull-cluster-api-e2e-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

dntosas · 2022-01-25T22:09:35Z

hey @mrajashree ^^

is there any update regarding this KEP?

mrajashree · 2022-01-26T18:23:06Z

@dntosas I'll update the kep soon

vincepri · 2022-03-31T17:03:55Z

@mrajashree Are we still trying to pursue this proposal? What's its status?

mrajashree · 2022-04-01T01:49:58Z

@vincepri yes I'd like to continue with this, I'll update this by next week

vincepri · 2022-06-07T17:09:08Z

@mrajashree Any updates on this proposal?

vincepri · 2022-06-08T17:07:31Z

@g-gaston mentioned during the 06/08 meeting that they'd be taking over this proposal and collaborate with @mrajashree on next steps

linux-foundation-easycla · 2022-07-23T14:17:58Z

✅ login: mrajashree / name: Rajashree Mandaogane (fbf852b, 868b2c6, bdc23a9, 9a41c89, 7c8b65f)
❌ The commit (9de8ecd). This user is missing the User's ID, preventing the EasyCLA check. Consult GitHub Help to resolve.For further assistance with EasyCLA, please submit a support request ticket.

k8s-ci-robot · 2022-07-23T14:18:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval by writing /assign @justinsb in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fabriziopandini · 2022-10-03T17:29:22Z

/close
due to lack of activity, we can re-open if someone is interested to pick up the work

k8s-ci-robot · 2022-10-03T17:29:26Z

@fabriziopandini: Closed this PR.

In response to this:

/close
due to lack of activity, we can re-open if someone is interested to pick up the work

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

g-gaston · 2022-10-03T17:39:40Z

@fabriziopandini I'm interested in picking this up

fabriziopandini · 2022-10-04T09:23:22Z

Good to hear that, what about re-starting with a new issue stating the goal of this effort + a new PR proposal, I think that we this discussion could benefit from a fresh start
Otherwise, I can just re-open, let me know

g-gaston · 2022-10-04T13:59:23Z

Yeah, sounds good. I'll write up an issue an open a new PR referencing this one.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 24, 2021

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 24, 2021

k8s-ci-robot requested review from fabriziopandini and justinsb May 24, 2021 23:22

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 24, 2021

mrajashree force-pushed the kep branch from 0c9530e to a2071dc Compare May 24, 2021 23:22

mrajashree changed the title ~~📖 Proposal to add an etcd bootstrap provider~~ 📖 KEP to add support for managed external etcd clusters in CAPI May 25, 2021

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 25, 2021

gyuho reviewed May 25, 2021

View reviewed changes