You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.
For some context, I am using the mySQL Operator by PressLabs on Kubernetes which utilizes this application. My Kubernetes nodes are preemptible, which means they can occassionally die (usually once a day).
I'm observing an interesting behavior where I have a cluster of three orchestrators. They all work really well until one of the nodes dies and then when a new one comes up it looks like the other two ignore it.
Here's some orchestrator logs:
2018-08-27 18:18:27.000 CDT Successfully pulled image "quay.io/presslabs/orchestrator:v3.0.11-r21"
2018-08-27 18:18:27.000 CDT Created container
2018-08-27 18:18:27.000 CDT Started container
2018-08-27 18:18:37.000 CDT Readiness probe failed: HTTP probe failed with statuscode: 500
The failing healthcheck goes on perpetually
This is emitted from the node that restart:
I [martini] Completed 500 Internal Server Error in 7.805308ms
I [martini] Started GET /api/raft-health for 10.8.33.1:48672
E 2018/08/27 23:18:36 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E 2018/08/27 23:18:36 [WARN] raft: Election timeout reached, restarting election
E 2018/08/27 23:18:35 [DEBUG] raft: Vote granted from 10.8.33.10:10008. Tally: 1
E 2018/08/27 23:18:35 [DEBUG] raft: Votes needed: 2
E 2018/08/27 23:18:35 [WARN] raft: Remote peer 10.8.31.3:10008 does not have local node 10.8.33.10:10008 as a peer
E 2018/08/27 23:18:35 [WARN] raft: Remote peer 10.8.32.3:10008 does not have local node 10.8.33.10:10008 as a peer
E 2018/08/27 23:18:34 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E 2018/08/27 23:18:34 [WARN] raft: Election timeout reached, restarting election
E 2018/08/27 23:18:32 [DEBUG] raft: Vote granted from 10.8.33.10:10008. Tally: 1
E 2018/08/27 23:18:32 [DEBUG] raft: Votes needed: 2
E 2018/08/27 23:18:32 [WARN] raft: Remote peer 10.8.32.3:10008 does not have local node 10.8.33.10:10008 as a peer
E 2018/08/27 23:18:32 [WARN] raft: Remote peer 10.8.31.3:10008 does not have local node 10.8.33.10:10008 as a peer
E 2018/08/27 23:18:30 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E 2018/08/27 23:18:30 [WARN] raft: Heartbeat timeout from "" reached, starting election
E 2018/08/27 23:18:29 [INFO] raft: Node at 10.8.33.10:10008 [Follower] entering Follower state (Leader: "")
E 2018/08/27 23:18:29 [INFO] raft: Restored from snapshot 15915-17741-1535409376687
E 2018-08-27 23:18:27 FATAL 2018-08-27 23:18:27 ERROR failed to open raft store: lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E 2018-08-27 23:18:27 ERROR failed to open raft store: lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E 2018-08-27 23:18:27 ERROR lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E 2018-08-27 23:18:27 ERROR lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E 2018-08-27 23:18:27 ERROR lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
This is emitted from the other nodes:
E 2018/08/28 00:48:48 [DEBUG] raft: Votes needed: 2
E 2018/08/28 00:48:48 [WARN] raft: Remote peer 10.8.32.3:10008 does not have local node 10.8.33.10:10008 as a peer
E 2018/08/28 00:48:48 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E 2018/08/28 00:48:48 [WARN] raft: Election timeout reached, restarting election
E 2018/08/28 00:48:48 [WARN] raft: Rejecting vote request from 10.8.33.10:10008 since we have a leader: 10.8.32.3:10008
E 2018/08/28 00:48:48 [DEBUG] raft: Failed to contact 10.8.30.6:10008 in 1h34m3.397022732s
E 2018/08/28 00:48:48 [DEBUG] raft: Failed to contact 10.8.30.6:10008 in 1h34m2.919864839s
I [martini] Started GET /api/lb-check for 10.8.31.1:60068
E 2018/08/28 00:48:48 [WARN] raft: Rejecting vote request from 10.8.33.10:10008 since we have a leader: 10.8.32.3:10008
I k8s.io update kube-system:cluster-autoscaler cluster-autoscaler {"@type":"type.googleapis.com/google.cloud.audit.AuditLog","status":{},"authenticationInfo":{"principalEmail":"cluster-autoscaler"},"requestMetadata":{"callerIp":"::1"},"serviceName":"k8s.io","methodName":"io.k8s.core.v1.endpoints.update","authorizationInfo":[{"resource":"core/v1/namespaces/kube-sys… k8s.io update kube-system:cluster-autoscaler cluster-autoscaler
E 2018/08/28 00:48:47 [WARN] raft: Rejecting vote request from 10.8.33.10:10008 since we have a leader: 10.8.32.3:10008
I 2018-08-28T00:48:47,943449832+00:00 requests.cpu needs updating. Is: '', want: '100m'.
E Error from server (NotFound): daemonsets.extensions "fluentd-gcp-v3.0.0" not found
I 2018-08-28T00:48:47,791328941+00:00 fluentd-gcp-scaling-policy not found in namespace kube-system, using defaults.
E Error from server (NotFound): scalingpolicies.scalingpolicy.kope.io "fluentd-gcp-scaling-policy" not found
E 2018/08/28 00:48:47 [DEBUG] raft: Votes needed: 2
E 2018/08/28 00:48:47 [WARN] raft: Remote peer 10.8.31.3:10008 does not have local node 10.8.33.10:10008 as a peer
E 2018/08/28 00:48:47 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
It seems like a node should be able to rejoin after failure, even if it's using a different IP address.
The text was updated successfully, but these errors were encountered:
@Mattouille at this time the raft setup is not as dynamic as one might wish it to be. To completely replace a box in the raft setup, one must:
Update config file on all hosts to reflect the identities of all hosts (thus, if you have a 3 node setup, each of the 3 nodes should list all of the 3 nodes).
Restart orchestrator on all nodes where config has changed (i.e. everywhere). This can be a rolling restart. You may (and will) lose leadership for a few seconds.
That's consistent with what I observed. The mySQL Operator for Kubernetes by PressLabs I suspect is actually supposed to be handling this and not. I'll close this issue and reopen it if need be.
For some context, I am using the mySQL Operator by PressLabs on Kubernetes which utilizes this application. My Kubernetes nodes are preemptible, which means they can occassionally die (usually once a day).
I'm observing an interesting behavior where I have a cluster of three orchestrators. They all work really well until one of the nodes dies and then when a new one comes up it looks like the other two ignore it.
Here's some orchestrator logs:
The failing healthcheck goes on perpetually
This is emitted from the node that restart:
This is emitted from the other nodes:
It seems like a node should be able to rejoin after failure, even if it's using a different IP address.
The text was updated successfully, but these errors were encountered: