document one should restart all system components after restoring etcd #24911

roycaihw · 2020-11-05T19:25:40Z

xref kubernetes/kubernetes#95958 (comment)

Restoring etcd without restarting kube-apiserver will cause new data in the caches fight against old data in etcd. After restoring etcd, one should at least restart all system components-- this will refresh the watch cache in kube-apiserver, the informer caches in controllers. Assuming the controllers are level-based with no weird side effects, things should reconcile and work. We cannot promise things built on top of Kubernetes with side effects will work 100% because restoring etcd is going back in time. Copying etcd experts here.

fixes kubernetes/kubernetes#95958.

/cc @jingyih @jpbetz @wojtek-t
cc @weinong

netlify · 2020-11-05T19:37:07Z

✔️ Deploy preview for kubernetes-io-master-staging ready!

🔨 Explore the source changes: c617542

🔍 Inspect the deploy logs: https://app.netlify.com/sites/kubernetes-io-master-staging/deploys/5fca83e03893e10008568a6c

😎 Browse the preview: https://deploy-preview-24911--kubernetes-io-master-staging.netlify.app

sftim · 2020-11-05T20:35:40Z

content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md

@@ -200,6 +200,8 @@ If the access URLs of the restored cluster is changed from the previous cluster,

 If the majority of etcd members have permanently failed, the etcd cluster is considered failed. In this scenario, Kubernetes cannot make any changes to its current state. Although the scheduled pods might continue to run, no new pods can be scheduled. In such cases, recover the etcd cluster and potentially reconfigure Kubernetes API server to fix the issue.

+You should restart all Kubernetes system components after restoring the etcd cluster.


This sounds important, so:

Suggested change

You should restart all Kubernetes system components after restoring the etcd cluster.

{{< note >}}

You should restart all Kubernetes control plane components after restoring the

etcd cluster, so as to be confident that the cluster is not relying on stale cache data.

{{< /note >}}

(do you also need to restart all nodes? I'm assuming not)

kubelet also watches the apiserver and keeps some local cache I think, so I would suggest restarting them just to be safe. @jingyih do you have any experience restoring etcd in Kubernetes? If so, could you share your opinion on this?

In general, we don't recommend restoring etcd when any apiserver is running. So what we recommend is:

stop all kube-apiserver

restore state in etcd

restart all kube-apiserver

Given that the restore takes a bit of time, the critical components will loose leader lock and they will restart themselves (we enable leader election no matter how many of them are running by-default).
We should recommend restarting any components to ensure that they don't rely on some stale data, but in practice they will relist on their own after all kube-apiservers will be done for some time.

Totally agree with @wojtek-t 's comment here. Restoring etcd cluster with multiple etcd servers is an offline process (i.e., cannot be performed in rolling fashion). It is better to stop the apiservers first.

ack. Will update soon. Thanks for the suggestions!

Sorry I lost track of this PR. I updated the paragraph accordingly. Please take a look

I don't want to hijack the thread, but based on the changes made, should we consider using the {{ caution }} tag here now? (where do we draw the line between {{ note }} and {{ caution }}?)

Possibly update note to caution (follow on PR)? The text changes look good. Thanks!

wojtek-t

Some minor nits - other than that lgtm.

wojtek-t · 2020-12-02T07:07:11Z

content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md

+recommend:
+
+- stop *all* kube-apiserver
+- restore state in etcd


nit: s/etcd/all etcd instances/

wojtek-t · 2020-12-02T07:07:41Z

content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md

+In general, we don't recommend restoring etcd when any apiserver is running. We
+recommend:
+
+- stop *all* kube-apiserver


nit: pluralize (i.e. kube-apiservers) or maybe kube-apiserver instances?

Same below

wojtek-t · 2020-12-02T07:08:02Z

content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md

+- restore state in etcd
+- restart all kube-apiserver
+
+We also recommend recommend restarting any components to ensure that they don't


nit: s/recommend recommend/recommend

roycaihw · 2020-12-03T21:57:41Z

@wojtek-t Updated. Thanks for reviewing!

wojtek-t · 2020-12-04T07:14:14Z

/lgtm

/assing @sftim
@sftim - can you please help with reviewing this small change?

k8s-ci-robot · 2020-12-04T07:14:30Z

LGTM label has been added.

Git tree hash: 66bbaa7ee9db18f5d093f30f57a9f28abc494084

kbhawkey · 2020-12-04T14:18:50Z

👀

sftim · 2020-12-04T14:20:53Z

content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md

+
+We also recommend restarting any components to ensure that they don't
+rely on some stale data. Note that in practice, given that the restore takes a
+bit of time, the critical components will loose leader lock and they will


grammar nits:

Suggested change

bit of time, the critical components will loose leader lock and they will

bit of time; the critical components will lose leader lock and will

kbhawkey · 2020-12-04T14:31:00Z

content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md

@@ -200,6 +200,20 @@ If the access URLs of the restored cluster is changed from the previous cluster,

 If the majority of etcd members have permanently failed, the etcd cluster is considered failed. In this scenario, Kubernetes cannot make any changes to its current state. Although the scheduled pods might continue to run, no new pods can be scheduled. In such cases, recover the etcd cluster and potentially reconfigure Kubernetes API server to fix the issue.

+{{< note >}}
+In general, we don't recommend restoring etcd when any apiserver is running. We


Hi @roycaihw , Possible edit for replacing lines 204-205:
If any API servers are running in your cluster, you should not attempt to restore instances of etcd.
Instead, follow these steps to restore etcd:

kbhawkey · 2020-12-04T14:31:52Z

content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md

+- restore state in all etcd instances
+- restart all kube-apiserver instances
+
+We also recommend restarting any components to ensure that they don't


nit: Which components would you recommend? It seems a bit vague.

kbhawkey · 2020-12-04T14:41:17Z

content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md

+We also recommend restarting any components to ensure that they don't
+rely on some stale data. Note that in practice, given that the restore takes a
+bit of time, the critical components will loose leader lock and they will
+restart themselves.


What do you think about:
During the restoration, critical components lose leader lock and restart themselves. OR
While the restoration occurs, critical components lose leader lock and restart themselves.

updated. Please take a look

wojtek-t · 2020-12-07T08:12:10Z

/lgtm

k8s-ci-robot · 2020-12-07T08:12:25Z

LGTM label has been added.

Git tree hash: 08d4745953040a36a6b092d6b617af46f4f8e2f6

kbhawkey · 2020-12-07T15:11:41Z

Page preview:
https://deploy-preview-24911--kubernetes-io-master-staging.netlify.app/docs/tasks/administer-cluster/configure-upgrade-etcd/#restoring-an-etcd-cluster

kbhawkey · 2020-12-07T15:14:35Z

Thanks @roycaihw
/lgtm

kbhawkey · 2020-12-07T15:17:00Z

/approve

k8s-ci-robot · 2020-12-07T15:17:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kbhawkey

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~content/en/OWNERS~~ [kbhawkey]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from jingyih, jpbetz and wojtek-t November 5, 2020 19:25

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. sig/docs Categorizes an issue or PR as relevant to SIG Docs. labels Nov 5, 2020

roycaihw mentioned this pull request Nov 5, 2020

Controller manager stuck in "the object has been modified; please apply your changes to the latest version and try again" even after restart kubernetes/kubernetes#95958

Closed

sftim reviewed Nov 5, 2020

View reviewed changes

roycaihw force-pushed the restore-etcd branch from 94713c8 to 19ee292 Compare December 1, 2020 22:21

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 1, 2020

wojtek-t reviewed Dec 2, 2020

View reviewed changes

roycaihw force-pushed the restore-etcd branch from 19ee292 to a5dedf9 Compare December 3, 2020 21:57

k8s-ci-robot assigned wojtek-t Dec 4, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 4, 2020

sftim reviewed Dec 4, 2020

View reviewed changes

kbhawkey reviewed Dec 4, 2020

View reviewed changes

document one should restart all system components after restoring etcd

c617542

roycaihw force-pushed the restore-etcd branch from a5dedf9 to c617542 Compare December 4, 2020 18:45

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 4, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 7, 2020

k8s-ci-robot assigned kbhawkey Dec 7, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 7, 2020

k8s-ci-robot merged commit b905af1 into kubernetes:master Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document one should restart all system components after restoring etcd #24911

document one should restart all system components after restoring etcd #24911

roycaihw commented Nov 5, 2020 •

edited

Loading

netlify bot commented Nov 5, 2020 •

edited

Loading

sftim Nov 5, 2020

roycaihw Nov 5, 2020

wojtek-t Nov 7, 2020

jingyih Nov 9, 2020

roycaihw Nov 10, 2020

roycaihw Dec 1, 2020

nate-double-u Dec 4, 2020 •

edited

Loading

kbhawkey Dec 7, 2020 •

edited

Loading

wojtek-t left a comment

wojtek-t Dec 2, 2020

wojtek-t Dec 2, 2020

wojtek-t Dec 2, 2020

roycaihw commented Dec 3, 2020

wojtek-t commented Dec 4, 2020

k8s-ci-robot commented Dec 4, 2020

kbhawkey commented Dec 4, 2020

sftim Dec 4, 2020

kbhawkey Dec 4, 2020

kbhawkey Dec 4, 2020

kbhawkey Dec 4, 2020

roycaihw Dec 4, 2020

wojtek-t commented Dec 7, 2020

k8s-ci-robot commented Dec 7, 2020

kbhawkey commented Dec 7, 2020

kbhawkey commented Dec 7, 2020

kbhawkey commented Dec 7, 2020

k8s-ci-robot commented Dec 7, 2020

		@@ -200,6 +200,8 @@ If the access URLs of the restored cluster is changed from the previous cluster,

		If the majority of etcd members have permanently failed, the etcd cluster is considered failed. In this scenario, Kubernetes cannot make any changes to its current state. Although the scheduled pods might continue to run, no new pods can be scheduled. In such cases, recover the etcd cluster and potentially reconfigure Kubernetes API server to fix the issue.

		You should restart all Kubernetes system components after restoring the etcd cluster.

-You should restart all Kubernetes system components after restoring the etcd cluster.
+{{< note >}}
+You should restart all Kubernetes control plane components after restoring the
+etcd cluster, so as to be confident that the cluster is not relying on stale cache data.
+{{< /note >}}

	bit of time, the critical components will loose leader lock and they will
	bit of time; the critical components will lose leader lock and will

document one should restart all system components after restoring etcd #24911

document one should restart all system components after restoring etcd #24911

Conversation

roycaihw commented Nov 5, 2020 • edited Loading

netlify bot commented Nov 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nate-double-u Dec 4, 2020 • edited Loading

Choose a reason for hiding this comment

kbhawkey Dec 7, 2020 • edited Loading

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roycaihw commented Dec 3, 2020

wojtek-t commented Dec 4, 2020

k8s-ci-robot commented Dec 4, 2020

kbhawkey commented Dec 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Dec 7, 2020

k8s-ci-robot commented Dec 7, 2020

kbhawkey commented Dec 7, 2020

kbhawkey commented Dec 7, 2020

kbhawkey commented Dec 7, 2020

k8s-ci-robot commented Dec 7, 2020

roycaihw commented Nov 5, 2020 •

edited

Loading

netlify bot commented Nov 5, 2020 •

edited

Loading

nate-double-u Dec 4, 2020 •

edited

Loading

kbhawkey Dec 7, 2020 •

edited

Loading