ETCD snapshot/restore support #7796

musaprg · 2022-12-22T10:21:08Z

User Story

As an operator, I'd like to maintain the etcd snapshot/restore functionalities with Cluster API (KubeadmControlPlane).

Detailed Description

The etcd snapshot and restore are usually crucial for administrators. We can achieve those functionalities by using community-provided operators (e.g., etcd-operator, which is already archived though...) or etcdctl directly. However, sometimes restore tasks should be considered as one of the cluster lifecycles since it requires to stop/start kube-apiserver before/after restoring. It would be nice to provide the etcd snapshot/restore functionalities by the CAPI side so that we can easily maintain them.

(I couldn't find any discussions related to this except for #7399, so I filed this topic as a new issue. Please let me know if there are any places where we already have this kind of discussion.)

Anything else you would like to add:

(TBD)

Related Issues/PRs

/kind feature

The text was updated successfully, but these errors were encountered:

killianmuldoon · 2022-12-22T10:40:58Z

Thanks @musaprg - this is a really interesting idea IMO. I think it go a long way toward helping out in disaster recovery scenarios.

A couple of questions:

Do you see this as only for workload clusters? Or for the management cluster?
Do you think standalone etcd is a necessary pre-requisite for this?
Would this require some kind of external storage in order to host the snapshot?

It would be great to get doc together with answers to some of these questions and, an overall problem statement and a start on implementation details to get people who would be interested in working on this feature involved.

musaprg · 2022-12-22T11:10:02Z

@killianmuldoon Thank you for asking. I'd like to prepare more detailed documents for this topic, but first, let me answer my opinions on the questions briefly.

Do you see this as only for workload clusters? Or for the management cluster?

I'm currently considering only workload clusters.

Do you think standalone etcd is a necessary pre-requisite for this?

I don't know if I could understand correctly the meaning of "standalone etcd", but I'm assuming etcd nodes that CAPI knows. IMO it doesn't matter how etcd nodes are running if they can be accessible from the management cluster.

Would this require some kind of external storage to host the snapshot?

It could have several options, but I don't think it's required. IMO possible destinations could be the following ones.

Object Storage (e.g., Amazon S3), which is external storage.
Persistent Volume on the management cluster, which is an in-cluster resource.

musaprg · 2022-12-22T11:14:42Z

I don't think it would be better to depend on something outside the Kubernetes ecosystem, so persistent volume would be the better candidate for the destination.

fabriziopandini · 2022-12-27T16:39:46Z

/triage accepted
I agree with @killianmuldoon this is interesting. I have got the impression this requires a small proposal defining scope and limitations, mostly because CAPI assumes bootstrap/control plane providers to be pluggable (and they are responsible both for etcd and control plane components)

musaprg · 2022-12-28T02:35:03Z

/assign

enxebre · 2023-01-02T10:42:18Z

As an operator, I'd like to maintain the etcd snapshot/restore functionalities with Cluster API (KubeadmControlPlane).

It'd be interesting to explore having these ability decoupled from KubeadmControlPlane or composable, so more control plane implementations could share it.

Heiko-san · 2023-03-27T12:32:51Z

Same problem here... It would be nice if cluster-api would provide a native feature for etcd snapshots (Which in my opinion is absolutely crucial! This is why I can't believe this was never thought of...)

We actually tried to implement it ourselves, but weren't very successful, since the etcd pod lacks really everything (We can do a etcdctl snapshot save in the pod, but can't obtain the file since kubectl cp complains about missing tar ...).

I couldn't find any helpful documentation on the topic, either. Is there something existing? There has to be a way to get snapshots from that pods, I guess...

kfox1111 · 2023-03-27T16:16:32Z

Yeah, that broke a while back. I've been working around it by snapshotting to a host mount, then pulling it off the host. Less then ideal...

musaprg · 2023-03-28T00:06:49Z

We actually tried to implement it ourselves, but weren't very successful, since the etcd pod lacks really everything (We can do a etcdctl snapshot save in the pod, but can't obtain the file since kubectl cp complains about missing tar ...).

We currently create another pod that uses an image with required utilities (etcdctl, aws-cli, etc.) to create/upload snapshots into S3-compatible storages apart from etcd pods. It could be much easier than pulling it from the pod's local storage, but yes I'm thinking it would be nice if CAPI etcd snapshot support doesn't require any external storages.

fabriziopandini · 2023-03-28T10:16:08Z

Adding kind/proposal because I think we should figure out if and how to make this work with different bootstrap/control-plane providers (who are ultimately responsible for defining how etcd and api server are run) and/or if and how to make this work with different types of storage (which can or cannot be related to the infrastructure provider in use).
In other words, considering the CAPI provider model the issue is not to make this work, but to make it work in a generic way or to define a clear contract around the use cases it supports - and eventually how to overcome those limitations on follow up iterations -.

/kind proposal

fabriziopandini · 2024-04-11T19:53:29Z

/priority backlog

fabriziopandini · 2024-05-03T14:14:19Z

The Cluster API project currently lacks enough active contributors to adequately respond to all issues and PRs.

Also most probably this should fall under SIG etcd / the ongoing discussion about a community maintained etcd-operator

/close

k8s-ci-robot · 2024-05-03T14:14:24Z

@fabriziopandini: Closing this issue.

In response to this:

The Cluster API project currently lacks enough active contributors to adequately respond to all issues and PRs.

Also most probably this should fall under SIG etcd / the ongoing discussion about a community maintained etcd-operator

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 22, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 27, 2022

k8s-ci-robot assigned musaprg Dec 28, 2022

k8s-ci-robot added the kind/proposal Issues or PRs related to proposals. label Mar 28, 2023

musaprg removed their assignment Aug 17, 2023

musaprg mentioned this issue Oct 17, 2023

REQUEST: New membership for musaprg kubernetes/org#4528

Closed

9 tasks

k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Apr 11, 2024

k8s-ci-robot closed this as completed May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETCD snapshot/restore support #7796

ETCD snapshot/restore support #7796

musaprg commented Dec 22, 2022 •

edited

Loading

killianmuldoon commented Dec 22, 2022

musaprg commented Dec 22, 2022 •

edited

Loading

musaprg commented Dec 22, 2022

fabriziopandini commented Dec 27, 2022

musaprg commented Dec 28, 2022

enxebre commented Jan 2, 2023

Heiko-san commented Mar 27, 2023 •

edited

Loading

kfox1111 commented Mar 27, 2023

musaprg commented Mar 28, 2023

fabriziopandini commented Mar 28, 2023 •

edited

Loading

fabriziopandini commented Apr 11, 2024

fabriziopandini commented May 3, 2024

k8s-ci-robot commented May 3, 2024

ETCD snapshot/restore support #7796

ETCD snapshot/restore support #7796

Comments

musaprg commented Dec 22, 2022 • edited Loading

killianmuldoon commented Dec 22, 2022

musaprg commented Dec 22, 2022 • edited Loading

musaprg commented Dec 22, 2022

fabriziopandini commented Dec 27, 2022

musaprg commented Dec 28, 2022

enxebre commented Jan 2, 2023

Heiko-san commented Mar 27, 2023 • edited Loading

kfox1111 commented Mar 27, 2023

musaprg commented Mar 28, 2023

fabriziopandini commented Mar 28, 2023 • edited Loading

fabriziopandini commented Apr 11, 2024

fabriziopandini commented May 3, 2024

k8s-ci-robot commented May 3, 2024

musaprg commented Dec 22, 2022 •

edited

Loading

musaprg commented Dec 22, 2022 •

edited

Loading

Heiko-san commented Mar 27, 2023 •

edited

Loading

fabriziopandini commented Mar 28, 2023 •

edited

Loading