-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver request timed out, when marking a node as master by adding a label and a taint #937
Comments
I think we can add some |
@neolit123 PTAL |
@xlgao-zju thanks for the report.
if it only happens sometimes, i wonder if it's a networking issue. /cc @detiber @stealthybox |
Yes, I think it may be the networking issue too. But I checked the etcd, and it worked well. So, I do not know how to avoid this. I think adding some |
@xlgao-zju I'm a bit hesitant to add retries without having an explicit reason to do so, since it could mask other potential issues. In this case if the control plane node has some type of intermittent connectivity issues with etcd there will be other cascading issues with the cluster. When invoking kubeadm init are you using using a config file or passing in any additional arguments? If using the default config or a config that doesn't specify an external etcd cluster, then I would have additional concerns, since connectivity issues to the etcd static pod would be even more concerning since it is deployed locally on the host. |
@detiber I used the config file. I created a three nodes etcd cluster, before I issued Feel free to ping me, if you need more evidence. |
I came across this issue today again. Our team creates k8s cluster hundreds times everyday, so we will come across some rare issues. :) This is the log of etcd during that period. -- Logs begin at 二 2018-06-26 20:09:29 CST, end at 二 2018-06-26 16:23:45 CST. --
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: adff8e2c78fa38a1 [term: 2] received a MsgVote message with higher term from f7babc917426a5a8 [term: 3]
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: adff8e2c78fa38a1 became follower at term 3
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: adff8e2c78fa38a1 [logterm: 2, index: 1290, vote: 0] cast MsgVote for f7babc917426a5a8 [logterm: 2, index: 1290] at term 3
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: raft.node: adff8e2c78fa38a1 lost leader a680a06346fef120 at term 3
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: raft.node: adff8e2c78fa38a1 elected leader f7babc917426a5a8 at term 3
6月 26 12:14:25 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: read-only range request "key:\"/registry/clusterrolebindings\" range_end:\"/registry/clusterrolebindingt\" count_only:true " with result "range_response_count:0 size:7" took too long (803.326661ms) to execute
6月 26 12:14:25 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: read-only range request "key:\"/registry/namespaces/kube-system\" " with result "range_response_count:1 size:178" took too long (187.157103ms) to execute |
@xlgao-zju what type of storage is backing these clusters? If etcd isn't backed by fast enough storage it could potentially account for the issue you are seeing. |
@detiber I use a cloud disk on alibaba cloud. Seems the cloud disk is kind of slow... Any suggestion to avoid this issue? |
@xlgao-zju in the past for ci/testing envs I have used a memory-backed tmpfs to work around slow disk access. It doesn't solve the problem for longer-lived envs, but works in a pinch for short-lived envs. |
@detiber Yes, the tmpfs work fine with short-lived envs, is there any solution which can fix this issue(expect changing the type of cloud disk:P)? |
I'm also running into this, while running kubespray with kubeadm enabled. I have 3 separate etcd nodes. /opt/bin/kubeadm init --config=/etc/kubernetes/kubeadm-config.v1alpha2.yaml --ignore-preflight-errors=all -v256 -- SNIP -- [markmaster] Marking the node k8s-acc-api01 as master by adding the label "node-role.kubernetes.io/master=''" -- REPEATS OVER AND OVER -- I0828 11:45:14.174411 26618 round_trippers.go:386] curl -k -v -XGET -H "Accept: application/json, /" -H "User-Agent: kubeadm/v1.11.2 (linux/amd64) kubernetes/bb9ffb1" 'https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01' |
Using Kubernetes/kubeadm 1.11.2 on CoreOS. |
I think this issue is not related to the original reported error: |
Regarding retries: there are (and were in 1.9.7) retries built in to the code. However, there is what I believe to be a bug. If the connection to the API server returns an error, including a timeout error, there will be no retry. I think the client response needs to be smarter about the errors it checks for so that it can I believe this line is the culprit of not getting retries where we would expect them. |
Closing as stale. |
Is this issue fixed? why is it closed? I still have the issue in 1.13. |
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version (use
kubeadm version
): 1.9.7Environment:
kubectl version
): 1.9.7uname -a
): 3.10.0-693.2.2.el7.x86_64What happened?
Etcdserver request timed out, when marking a node as master by adding a label and a taint.
The logs are:
What you expected to happen?
kubeadm init
succeededHow to reproduce it (as minimally and precisely as possible)?
Issue
kubeadm init
, and it happens sometimes...Anything else we need to know?
The etcd works well...
The text was updated successfully, but these errors were encountered: