-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
document one should restart all system components after restoring etcd #24911
Conversation
✔️ Deploy preview for kubernetes-io-master-staging ready! 🔨 Explore the source changes: c617542 🔍 Inspect the deploy logs: https://app.netlify.com/sites/kubernetes-io-master-staging/deploys/5fca83e03893e10008568a6c 😎 Browse the preview: https://deploy-preview-24911--kubernetes-io-master-staging.netlify.app |
@@ -200,6 +200,8 @@ If the access URLs of the restored cluster is changed from the previous cluster, | |||
|
|||
If the majority of etcd members have permanently failed, the etcd cluster is considered failed. In this scenario, Kubernetes cannot make any changes to its current state. Although the scheduled pods might continue to run, no new pods can be scheduled. In such cases, recover the etcd cluster and potentially reconfigure Kubernetes API server to fix the issue. | |||
|
|||
You should restart all Kubernetes system components after restoring the etcd cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds important, so:
You should restart all Kubernetes system components after restoring the etcd cluster. | |
{{< note >}} | |
You should restart all Kubernetes control plane components after restoring the | |
etcd cluster, so as to be confident that the cluster is not relying on stale cache data. | |
{{< /note >}} |
(do you also need to restart all nodes? I'm assuming not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubelet also watches the apiserver and keeps some local cache I think, so I would suggest restarting them just to be safe. @jingyih do you have any experience restoring etcd in Kubernetes? If so, could you share your opinion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we don't recommend restoring etcd when any apiserver is running. So what we recommend is:
- stop all kube-apiserver
- restore state in etcd
- restart all kube-apiserver
Given that the restore takes a bit of time, the critical components will loose leader lock and they will restart themselves (we enable leader election no matter how many of them are running by-default).
We should recommend restarting any components to ensure that they don't rely on some stale data, but in practice they will relist on their own after all kube-apiservers will be done for some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree with @wojtek-t 's comment here. Restoring etcd cluster with multiple etcd servers is an offline process (i.e., cannot be performed in rolling fashion). It is better to stop the apiservers first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack. Will update soon. Thanks for the suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I lost track of this PR. I updated the paragraph accordingly. Please take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to hijack the thread, but based on the changes made, should we consider using the {{ caution }}
tag here now? (where do we draw the line between {{ note }}
and {{ caution }}
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly update note to caution (follow on PR)? The text changes look good. Thanks!
94713c8
to
19ee292
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nits - other than that lgtm.
recommend: | ||
|
||
- stop *all* kube-apiserver | ||
- restore state in etcd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/etcd/all etcd instances/
In general, we don't recommend restoring etcd when any apiserver is running. We | ||
recommend: | ||
|
||
- stop *all* kube-apiserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: pluralize (i.e. kube-apiservers) or maybe kube-apiserver instances?
Same below
- restore state in etcd | ||
- restart all kube-apiserver | ||
|
||
We also recommend recommend restarting any components to ensure that they don't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/recommend recommend/recommend
19ee292
to
a5dedf9
Compare
@wojtek-t Updated. Thanks for reviewing! |
LGTM label has been added. Git tree hash: 66bbaa7ee9db18f5d093f30f57a9f28abc494084
|
👀 |
|
||
We also recommend restarting any components to ensure that they don't | ||
rely on some stale data. Note that in practice, given that the restore takes a | ||
bit of time, the critical components will loose leader lock and they will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grammar nits:
bit of time, the critical components will loose leader lock and they will | |
bit of time; the critical components will lose leader lock and will |
@@ -200,6 +200,20 @@ If the access URLs of the restored cluster is changed from the previous cluster, | |||
|
|||
If the majority of etcd members have permanently failed, the etcd cluster is considered failed. In this scenario, Kubernetes cannot make any changes to its current state. Although the scheduled pods might continue to run, no new pods can be scheduled. In such cases, recover the etcd cluster and potentially reconfigure Kubernetes API server to fix the issue. | |||
|
|||
{{< note >}} | |||
In general, we don't recommend restoring etcd when any apiserver is running. We |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @roycaihw , Possible edit for replacing lines 204-205:
If any API servers are running in your cluster, you should not attempt to restore instances of etcd.
Instead, follow these steps to restore etcd:
- restore state in all etcd instances | ||
- restart all kube-apiserver instances | ||
|
||
We also recommend restarting any components to ensure that they don't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Which components would you recommend? It seems a bit vague.
We also recommend restarting any components to ensure that they don't | ||
rely on some stale data. Note that in practice, given that the restore takes a | ||
bit of time, the critical components will loose leader lock and they will | ||
restart themselves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about:
During the restoration, critical components lose leader lock and restart themselves. OR
While the restoration occurs, critical components lose leader lock and restart themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated. Please take a look
a5dedf9
to
c617542
Compare
/lgtm |
LGTM label has been added. Git tree hash: 08d4745953040a36a6b092d6b617af46f4f8e2f6
|
Thanks @roycaihw |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kbhawkey The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
xref kubernetes/kubernetes#95958 (comment)
Restoring etcd without restarting kube-apiserver will cause new data in the caches fight against old data in etcd. After restoring etcd, one should at least restart all system components-- this will refresh the watch cache in kube-apiserver, the informer caches in controllers. Assuming the controllers are level-based with no weird side effects, things should reconcile and work. We cannot promise things built on top of Kubernetes with side effects will work 100% because restoring etcd is going back in time. Copying etcd experts here.
fixes kubernetes/kubernetes#95958.
/cc @jingyih @jpbetz @wojtek-t
cc @weinong