-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Immediate reaction to nodes with disconnected sessions trigger reshuffling while processes are running and healthy #15
Comments
Could you post log files corresponding to such event? |
There relevant part of the log I think is this (log was this): [2013-04-30 04:00:14,616][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state Sorry the entire log has been lost since then. I hope this is of any relevance. |
Strange. It looks like some nodes were appearing and disappearing intermittently from the cluster for quite some time. Did you monitor CPU and java heap size on the nodes while these issues were happening? What was going on there? Could it be the case that the cluster was simply overloaded? |
True, it was overloaded but not to the extent of the process itself dying or OOMing. This might be the reason why the disconnections are happening. But the issue still remains, a disconnected ZKClient means no ephemeral node and an immediate reshuffle (which makes the load problem worse). I think waiting before initiating recovery or retrying the check for multiple times would be an awesome feature nevertheless. What do you think? |
I think the real problem here is cluster overload. Disappearing nodes is just a symptom and zookeeper discovery service is just a messenger. This is how it works - a zookeeper detects that a node is unresponsive for 60 sec and kills its session, zookeeper tells discovery service that this node disappeared and discovery service passes the message upstream telling the rest of the system that the node disappeared, which in turn causes rebalancing, etc. You can increase zookeeper session timeout from the current 60 seconds default to something longer using |
Hi,
I am using the ZK plugin with both publishing options, it's nice so far but for some unknown reason every now and then the session disconnects, this causes the ephemral node corresponding to the node that lost it's session to be lose immediately and at the same time causing the cluster to reshuffle, though the process that lost its session is alive and well and would immediately recreate the node after re-establishing it's session to the server.
Is there a possibility to change the master to not count this immediately as a fault detection, instead it waits for a certain time then does the check again, and after failing this n-times it should start recovery?
The text was updated successfully, but these errors were encountered: