Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Node unable to rejoin after failure #593

Closed
matt0x6F opened this issue Aug 28, 2018 · 3 comments
Closed

Node unable to rejoin after failure #593

matt0x6F opened this issue Aug 28, 2018 · 3 comments

Comments

@matt0x6F
Copy link

For some context, I am using the mySQL Operator by PressLabs on Kubernetes which utilizes this application. My Kubernetes nodes are preemptible, which means they can occassionally die (usually once a day).

I'm observing an interesting behavior where I have a cluster of three orchestrators. They all work really well until one of the nodes dies and then when a new one comes up it looks like the other two ignore it.

Here's some orchestrator logs:

2018-08-27 18:18:27.000 CDT Successfully pulled image "quay.io/presslabs/orchestrator:v3.0.11-r21"
2018-08-27 18:18:27.000 CDT Created container 
2018-08-27 18:18:27.000 CDT Started container
2018-08-27 18:18:37.000 CDT Readiness probe failed: HTTP probe failed with statuscode: 500

The failing healthcheck goes on perpetually

This is emitted from the node that restart:

I  [martini] Completed 500 Internal Server Error in 7.805308ms
I  [martini] Started GET /api/raft-health for 10.8.33.1:48672
E  2018/08/27 23:18:36 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E  2018/08/27 23:18:36 [WARN] raft: Election timeout reached, restarting election
E  2018/08/27 23:18:35 [DEBUG] raft: Vote granted from 10.8.33.10:10008. Tally: 1
E  2018/08/27 23:18:35 [DEBUG] raft: Votes needed: 2
E  2018/08/27 23:18:35 [WARN] raft: Remote peer 10.8.31.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/27 23:18:35 [WARN] raft: Remote peer 10.8.32.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/27 23:18:34 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E  2018/08/27 23:18:34 [WARN] raft: Election timeout reached, restarting election
E  2018/08/27 23:18:32 [DEBUG] raft: Vote granted from 10.8.33.10:10008. Tally: 1
E  2018/08/27 23:18:32 [DEBUG] raft: Votes needed: 2
E  2018/08/27 23:18:32 [WARN] raft: Remote peer 10.8.32.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/27 23:18:32 [WARN] raft: Remote peer 10.8.31.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/27 23:18:30 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E  2018/08/27 23:18:30 [WARN] raft: Heartbeat timeout from "" reached, starting election
E  2018/08/27 23:18:29 [INFO] raft: Node at 10.8.33.10:10008 [Follower] entering Follower state (Leader: "")
E  2018/08/27 23:18:29 [INFO] raft: Restored from snapshot 15915-17741-1535409376687
E  2018-08-27 23:18:27 FATAL 2018-08-27 23:18:27 ERROR failed to open raft store: lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E  2018-08-27 23:18:27 ERROR failed to open raft store: lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E  2018-08-27 23:18:27 ERROR lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E  2018-08-27 23:18:27 ERROR lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E  2018-08-27 23:18:27 ERROR lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host

This is emitted from the other nodes:

E  2018/08/28 00:48:48 [DEBUG] raft: Votes needed: 2
E  2018/08/28 00:48:48 [WARN] raft: Remote peer 10.8.32.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/28 00:48:48 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E  2018/08/28 00:48:48 [WARN] raft: Election timeout reached, restarting election
E  2018/08/28 00:48:48 [WARN] raft: Rejecting vote request from 10.8.33.10:10008 since we have a leader: 10.8.32.3:10008
E  2018/08/28 00:48:48 [DEBUG] raft: Failed to contact 10.8.30.6:10008 in 1h34m3.397022732s
E  2018/08/28 00:48:48 [DEBUG] raft: Failed to contact 10.8.30.6:10008 in 1h34m2.919864839s
I  [martini] Started GET /api/lb-check for 10.8.31.1:60068
E  2018/08/28 00:48:48 [WARN] raft: Rejecting vote request from 10.8.33.10:10008 since we have a leader: 10.8.32.3:10008
I  k8s.io update kube-system:cluster-autoscaler cluster-autoscaler {"@type":"type.googleapis.com/google.cloud.audit.AuditLog","status":{},"authenticationInfo":{"principalEmail":"cluster-autoscaler"},"requestMetadata":{"callerIp":"::1"},"serviceName":"k8s.io","methodName":"io.k8s.core.v1.endpoints.update","authorizationInfo":[{"resource":"core/v1/namespaces/kube-sys… k8s.io update kube-system:cluster-autoscaler cluster-autoscaler 
E  2018/08/28 00:48:47 [WARN] raft: Rejecting vote request from 10.8.33.10:10008 since we have a leader: 10.8.32.3:10008
I  2018-08-28T00:48:47,943449832+00:00 requests.cpu needs updating. Is: '', want: '100m'.
E  Error from server (NotFound): daemonsets.extensions "fluentd-gcp-v3.0.0" not found
I  2018-08-28T00:48:47,791328941+00:00 fluentd-gcp-scaling-policy not found in namespace kube-system, using defaults.
E  Error from server (NotFound): scalingpolicies.scalingpolicy.kope.io "fluentd-gcp-scaling-policy" not found
E  2018/08/28 00:48:47 [DEBUG] raft: Votes needed: 2
E  2018/08/28 00:48:47 [WARN] raft: Remote peer 10.8.31.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/28 00:48:47 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state

It seems like a node should be able to rejoin after failure, even if it's using a different IP address.

@matt0x6F
Copy link
Author

I have this issue opened for PressLabs as I'm not really sure where the failure is occurring.

@shlomi-noach
Copy link
Collaborator

@Mattouille at this time the raft setup is not as dynamic as one might wish it to be. To completely replace a box in the raft setup, one must:

  • Update config file on all hosts to reflect the identities of all hosts (thus, if you have a 3 node setup, each of the 3 nodes should list all of the 3 nodes).
  • Restart orchestrator on all nodes where config has changed (i.e. everywhere). This can be a rolling restart. You may (and will) lose leadership for a few seconds.

Please let me know if this works.

@matt0x6F
Copy link
Author

That's consistent with what I observed. The mySQL Operator for Kubernetes by PressLabs I suspect is actually supposed to be handling this and not. I'll close this issue and reopen it if need be.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants