Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📖 CAEP to add support for managed external etcd clusters in CAPI #4659

Closed
wants to merge 6 commits into from

Conversation

mrajashree
Copy link
Contributor

What this PR does / why we need it:
This PR adds a KEP to support managed external etcd clusters within CAPI.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
#4658

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 24, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @mrajashree. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 24, 2021
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 24, 2021
@mrajashree mrajashree changed the title 📖 Proposal to add an etcd bootstrap provider 📖 KEP to add support for managed external etcd clusters in CAPI May 25, 2021
@CecileRobertMichon
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 25, 2021
Copy link
Member

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments as someone new to CAPI, but those shouldn't be a blocker. The overall logic makes sense.

@mrajashree mrajashree changed the title 📖 KEP to add support for managed external etcd clusters in CAPI 📖 CAEP to add support for managed external etcd clusters in CAPI May 25, 2021
@mrajashree mrajashree force-pushed the kep branch 3 times, most recently from 833b8d9 to 327061c Compare May 26, 2021 22:36
@mrajashree mrajashree force-pushed the kep branch 2 times, most recently from 060ab54 to 54a1a75 Compare May 27, 2021 19:06
docs/proposals/20210524-managed-external-etcd.md Outdated Show resolved Hide resolved
docs/proposals/20210524-managed-external-etcd.md Outdated Show resolved Hide resolved
docs/proposals/20210524-managed-external-etcd.md Outdated Show resolved Hide resolved
InitMachineAddress string `json:"initMachineAddress"`

// +optional
Initialized bool `json:"initialized"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get the difference between this flag and the EtcdInitializedCondition. Is it possible to dedup?

- Once the upgrade/machine replacement is completed, the etcd controller will again update the EtcdadmCluster.Status.Endpoints field. The control plane controller will rollout new Machines that use the updated etcd endpoints for the apiserver.

###### Periodic etcd member healthchecks
- The etcdadm cluster controller will perform healthchecks on the etcd members periodically at a predetermined interval. The controller will perform client healthchecks by making HTTP Get requests to the `<etcd member address>:2379/health` endpoint of each etcd member. It cannot perform peer-to-peer healthchecks since the controller is external to the etcd cluster and there won't be any agents running on the etcd members.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inverts the current pattern with MHC responsible for checking machines health.
I'm ok with this, but I would like to hear other opinions on this...


#### Changes needed in docker machine controller
- Each machine infrastructure provider sets a providerID on every InfraMachine.
- Infrastructure providers like CAPA and CAPV set this provider ID based on the instance metadata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not wrong this is a responsibility of the CPI controllers.

#### Changes needed in docker machine controller
- Each machine infrastructure provider sets a providerID on every InfraMachine.
- Infrastructure providers like CAPA and CAPV set this provider ID based on the instance metadata.
- CAPD on the other hand first calls `kubectl patch` to add the provider ID on the node. And only then it sets the provider ID value on the DockerMachine resource. But the problem is that in case of an etcd only cluster, the machine is not registered as a Kubernetes node. Cluster provisioning does not progress until this provider ID is set on the Machine. As a solution, CAPD will check if the bootstrap provider is etcdadm, and skip the kubectl patch process and set the provider ID on DockerMachine directly. These changes are required in the [docker machine controller](https://github.com/mrajashree/cluster-api/pull/2/files#diff-1923dd8291a9406e3c144763f526bd9797a2449a030f5768178b8d06d13c795bR307) for CAPD.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on this change.
Each provider should be sandboxed and rely only on informations available on CAPI resources (vs checking combinations with other providers). If with this proposal we are introducing the concept of machine without a node, let's find a way to make this is first class construct in CAPI core. This can eventually be done in staged approaches avoiding breaking changes in the beginning, but at the end we should provide the users a way to discriminate the different types of machines.

Comment on lines +443 to +444
EtcdBootstrapProviderType = ProviderType("EtcdBootstrapProvider")
EtcdProvider = ProviderType("EtcdProvider")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This impact the operator work as well... At least let's state as a future goal the intent to update the operator proposal to accommodate those providers as well

The Etcdadm based provider requires two separate provider types, other etcd providers that get added later on might not require two separate providers, so they can utilize just the EtcdProvider type.
With these changes, user should be able to install the etcd provider by running
```
clusterctl init --etcdbootstrap etcdadm-bootstrap-provider --etcdprovider etcdadm-controller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: adding new providers types has a ripple effect on several clusterctl commands, not only init (e.g upgrade, generate providers etc)

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Dec 8, 2021

@mrajashree: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-test-main-mink8s 7c8b65f link /test pull-cluster-api-test-main-mink8s
pull-cluster-api-e2e-main 7c8b65f link true /test pull-cluster-api-e2e-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dntosas
Copy link
Contributor

dntosas commented Jan 25, 2022

hey @mrajashree ^^

is there any update regarding this KEP?

@mrajashree
Copy link
Contributor Author

@dntosas I'll update the kep soon

@vincepri
Copy link
Member

@mrajashree Are we still trying to pursue this proposal? What's its status?

@mrajashree
Copy link
Contributor Author

@vincepri yes I'd like to continue with this, I'll update this by next week

@vincepri
Copy link
Member

vincepri commented Jun 7, 2022

@mrajashree Any updates on this proposal?

@vincepri
Copy link
Member

vincepri commented Jun 8, 2022

@g-gaston mentioned during the 06/08 meeting that they'd be taking over this proposal and collaborate with @mrajashree on next steps

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jul 23, 2022

CLA Missing ID CLA Not Signed

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 23, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval by writing /assign @justinsb in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fabriziopandini
Copy link
Member

/close
due to lack of activity, we can re-open if someone is interested to pick up the work

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closed this PR.

In response to this:

/close
due to lack of activity, we can re-open if someone is interested to pick up the work

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@g-gaston
Copy link
Contributor

g-gaston commented Oct 3, 2022

@fabriziopandini I'm interested in picking this up

@fabriziopandini
Copy link
Member

fabriziopandini commented Oct 4, 2022

Good to hear that, what about re-starting with a new issue stating the goal of this effort + a new PR proposal, I think that we this discussion could benefit from a fresh start
Otherwise, I can just re-open, let me know

@g-gaston
Copy link
Contributor

g-gaston commented Oct 4, 2022

Yeah, sounds good. I'll write up an issue an open a new PR referencing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants