-
-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node unable to rejoin after failure #107
Node unable to rejoin after failure #107
Comments
I opened this issue on GitHub Orchestrator to correlate as I'm not really sure where the actual failure is occurring here. |
Based on the response I got from @shlomi-noach it seems that this actually an operator issue. He's described the steps for recovery which mimic my experience. Let me know your thoughts. |
Also, I was actually using a Helm export of the manifests (mainly because I didn't want to run Tiller). It seems that wasn't enough to get things running (potentially, although things seemed to work). I have not seen this issue replicated since, so I'll keep you informed. |
This is actually still happening, albeit, more rarely. I'll start up a dev cluster and see if I can get the operator to notice the failure and rectify it. It should be noted I regularly kill instances in my cluster, so my situation may be a bit extreme :) |
May I know how to rolling restart the orchestrator? |
Hi @tuapuikia, you can do rolling restart by setting a new annotation on the orchestrator statefulset with @Mattouille we know about those problems with orchestrator, in the next version we will try to fix them, sorry for the late response, we focus on rewriting operator with kubebuilder. Also if you have a fix, for this issue, we can discuss it on gitter. |
Thank you for the reply. |
Noteworthy that there is another option that does not require a rolling restart upon node replacement: using Also related is a discussion on vitessio/vitess#3665 |
@shlomi-noach I've set I'm seeing this error:
|
@calind could you please provide the following details:
|
@shlomi-noach here are the details: orchestrator.conf.json ({{ .Env.HOSTNAME }} gets replaced accordingly with orchestrator-0...2): {
"BackendDB": "sqlite",
"Debug": false,
"ListenAddress": ":3000",
"MySQLTopologyCredentialsConfigFile": "/etc/orchestrator/orc-topology.cnf",
"RaftBind": "{{ .Env.HOSTNAME }}.orchestrator-headless",
"RaftAdvertise": "{{ .Env.HOSTNAME }}.orchestrator-headless",
"RaftDataDir": "/var/lib/orchestrator",
"RaftEnabled": true,
"RaftNodes": [
"orchestrator-0.orchestrator-headless",
"orchestrator-1.orchestrator-headless",
"orchestrator-2.orchestrator-headless"
],
"SQLite3DataFile": "/var/lib/orchestrator/orc.db"
} Initial peer IPs:
After killing orchestrator-2.orchestrator-headless (the master):
Error logs after killing the master:
|
@calind if I'm reading this right, your Then, if a box goes down and another takes its place, it would have a different Makes sense? |
Yes, it does for the But the problem I see is with That way, when orchestrator-2 changes it's IP, it would be accepted by orchestrator-0 and orchestrator-1 as cluster member. Another approach would be to have a shared "RaftID" and consider |
Sorry, I'm not sure I understand what the "yes" implies.
I'm wondering whether after given some time this self resolves?
The latest Consul code does the same. Unfortunately it also removes support for pre-defined cluster IPs and otherwise breaks other things. I don't plan to upgrade to that as yet. |
yes, the mechanics for RaftBind/RaftAdvertise make sense
It doesn't recover after some time. It seams that the peer list gets set in stone in https://github.com/github/orchestrator/blob/eb7a3b642f6e0aa83a4257ae62441571cc14a292/go/raft/raft.go#L138 |
True, the list is set in stone.
I think (sorry if I'm wrong) that this is a continued misunderstanding about how |
The setup I'm referring is on kubernetes. There are 3 orchestrator pods managed by a statefulset and it's corresponding headless service. Initial state: As far as I understood, the correct raft configuration for this would be (no need for {
...
"RaftBind": "{{ .Env.HOSTNAME }}.orchestrator-headless",
"RaftEnabled": true,
"RaftNodes": [
"orchestrator-0.orchestrator-headless",
"orchestrator-1.orchestrator-headless",
"orchestrator-2.orchestrator-headless"
]
} @shlomi-noach is this correct? |
@calind sorry for the late response. What happens when a pod goes down and a new one takes it place? Say
|
@shlomi-noach the new pod gets a new ip. It's not clear what do I put in |
I have submitted a fix for this issue 😊 |
Fixes #107 This commit makes a service for each pod by using the unique statefull set name label. These services ensure that there is a cluster ip reserved for each pod. The rafting uses these cluster ips. Orchestrator will proxy/route traffic to its leader. So, the main service can be used as entry point and all the trafic will be routed to the leader. See: https://github.com/github/orchestrator/blob/master/docs/configuration-raft.md See: presslabs/docker-orchestrator#8 Signed-off-by: Kevin Hellemun <[email protected]>
Will be a RC (like |
Signed-off-by: Anthony Yeh <[email protected]>
For some context, I am using the mySQL Operator by PressLabs on Kubernetes which utilizes this application. My Kubernetes nodes are preemptible, which means they can occassionally die (usually once a day).
I'm observing an interesting behavior where I have a cluster of three orchestrators. They all work really well until one of the nodes dies and then when a new one comes up it looks like the other two ignore it.
Here's some orchestrator logs:
The failing healthcheck goes on perpetually
This is emitted from the node that restart:
This is emitted from the other nodes:
It seems like a node should be able to rejoin after failure, even if it's using a different IP address.
The text was updated successfully, but these errors were encountered: