Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Immediate reaction to nodes with disconnected sessions trigger reshuffling while processes are running and healthy #15

Open
mahdeto opened this issue May 1, 2013 · 5 comments

Comments

@mahdeto
Copy link

mahdeto commented May 1, 2013

Hi,

I am using the ZK plugin with both publishing options, it's nice so far but for some unknown reason every now and then the session disconnects, this causes the ephemral node corresponding to the node that lost it's session to be lose immediately and at the same time causing the cluster to reshuffle, though the process that lost its session is alive and well and would immediately recreate the node after re-establishing it's session to the server.

Is there a possibility to change the master to not count this immediately as a fault detection, instead it waits for a certain time then does the check again, and after failing this n-times it should start recovery?

@imotov
Copy link
Contributor

imotov commented May 1, 2013

Could you post log files corresponding to such event?

@mahdeto
Copy link
Author

mahdeto commented May 2, 2013

There relevant part of the log I think is this (log was this):

[2013-04-30 04:00:14,616][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 04:00:14,784][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 04:00:14,846][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 04:00:14,911][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 04:00:15,071][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 04:00:15,199][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 08:10:05,926][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 13:23:45,309][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 17:12:28,013][INFO ][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Master is gone
[2013-04-30 17:12:28,013][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Electing master
[2013-04-30 17:12:28,110][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Found master: PtKWjrq4QymCGPKy2LrWFQ
[2013-04-30 17:12:28,110][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 17:12:28,151][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 17:12:28,154][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 17:12:28,194][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 17:12:28,195][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 17:14:49,571][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Session Disconnected
[2013-04-30 17:17:10,455][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Checking if ZooKeeper session should be restarted
[2013-04-30 17:17:10,456][INFO ][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Restarting ZooKeeper discovery
[2013-04-30 17:17:10,456][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Stopping ZooKeeper
[2013-04-30 17:17:10,456][DEBUG][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Closing zooKeeper
[2013-04-30 17:17:10,456][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Starting ZooKeeper
[2013-04-30 17:17:10,456][INFO ][org.apache.zookeeper.ZooKeeper] Initiating client connection, connectString=node-01:2181,node-02:2181,node-03:2181 sessionTimeout=60000 watcher=com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService$1@5257932b
[2013-04-30 17:17:11,137][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Started ZooKeeper
[2013-04-30 17:17:11,138][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Restarting ZK Discovery
[2013-04-30 17:17:11,138][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Creating root nodes in ZooKeeper
[2013-04-30 17:17:11,141][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Registering in ZooKeeper
[2013-04-30 17:17:11,148][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Electing master
[2013-04-30 17:17:11,148][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Found master: PtKWjrq4QymCGPKy2LrWFQ
[2013-04-30 17:17:11,148][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 17:17:12,604][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Session Connected
[2013-04-30 17:17:12,604][INFO ][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Master is gone
[2013-04-30 17:17:12,604][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Electing master
[2013-04-30 17:17:12,606][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Elected as master (N0Xv9y6ZSy663481l6NjGw)
[2013-04-30 17:17:12,606][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 17:17:12,748][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:19:38,669][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [922]
[2013-04-30 17:19:39,814][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Session Disconnected
[2013-04-30 17:19:39,814][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Checking if ZooKeeper session should be restarted
[2013-04-30 17:19:39,814][INFO ][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Restarting ZooKeeper discovery
[2013-04-30 17:19:39,814][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Stopping ZooKeeper
[2013-04-30 17:19:39,814][DEBUG][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Closing zooKeeper
[2013-04-30 17:19:39,814][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Starting ZooKeeper
[2013-04-30 17:19:39,814][INFO ][org.apache.zookeeper.ZooKeeper] Initiating client connection, connectString=node-01:2181,node-02:2181,node-03:2181 sessionTimeout=60000 watcher=com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService$1@4ea4d7a6
[2013-04-30 17:19:40,145][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Session Connected
[2013-04-30 17:19:40,145][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Started ZooKeeper
[2013-04-30 17:19:40,145][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Restarting ZK Discovery
[2013-04-30 17:19:40,145][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Creating root nodes in ZooKeeper
[2013-04-30 17:19:40,147][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Registering in ZooKeeper
[2013-04-30 17:19:40,155][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Electing master
[2013-04-30 17:19:40,156][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Found master: kMgo52H1SLy_fvhOdUdQhA
[2013-04-30 17:21:47,028][WARN ][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Session Expired Exception
[2013-04-30 17:21:47,028][WARN ][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Session Expired Exception
[2013-04-30 17:21:47,156][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Session Disconnected
[2013-04-30 17:21:47,210][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Restarting ZK Discovery
[2013-04-30 17:21:47,210][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 17:21:47,210][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Creating root nodes in ZooKeeper
[2013-04-30 17:22:45,011][INFO ][org.apache.zookeeper.ZooKeeper] Initiating client connection, connectString=node-01:2181,node-02:2181,node-03:2181 sessionTimeout=60000 watcher=com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService$1@6de1dadb
[2013-04-30 17:22:45,158][DEBUG][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Closing zooKeeper
[2013-04-30 17:22:51,978][INFO ][org.apache.zookeeper.ZooKeeper] Initiating client connection, connectString=node-01:2181,node-02:2181,node-03:2181 sessionTimeout=60000 watcher=com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService$1@639d0e0b
[2013-04-30 17:22:51,983][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Session Connected
[2013-04-30 17:22:51,984][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Creating root nodes in ZooKeeper
[2013-04-30 17:22:51,989][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Registering in ZooKeeper
[2013-04-30 17:22:52,002][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Electing master
[2013-04-30 17:22:52,015][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Elected as master (a37moRArQBOHy56j92xhsw)
[2013-04-30 17:22:52,015][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state
[2013-04-30 17:22:52,215][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:22:52,272][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[a37moRArQBOHy56j92xhsw]], new nodes: [[a37moRArQBOHy56j92xhsw]], deleted: [[]], added[[]]
[2013-04-30 17:23:12,872][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:23:12,873][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[a37moRArQBOHy56j92xhsw]], new nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw]], deleted: [[]], added[[l2E0_dQzThOumbUA7ipjKA]]
[2013-04-30 17:23:12,881][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [2]
[2013-04-30 17:23:55,949][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:23:55,953][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw]], new nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[]], added[[kMgo52H1SLy_fvhOdUdQhA]]
[2013-04-30 17:23:55,959][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [3]
[2013-04-30 17:24:16,011][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:24:16,013][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[l2E0_dQzThOumbUA7ipjKA]], added[[]]
[2013-04-30 17:24:16,014][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [4]
[2013-04-30 17:25:24,182][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:25:24,183][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[]], added[[l2E0_dQzThOumbUA7ipjKA]]
[2013-04-30 17:25:24,190][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [5]
[2013-04-30 17:26:26,011][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:26:26,013][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[l2E0_dQzThOumbUA7ipjKA]], added[[]]
[2013-04-30 17:26:26,014][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [6]
[2013-04-30 17:27:30,806][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:27:30,808][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[]], added[[l2E0_dQzThOumbUA7ipjKA]]
[2013-04-30 17:27:30,813][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [7]
[2013-04-30 17:28:04,290][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:28:04,291][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[l2E0_dQzThOumbUA7ipjKA, esVkbqnZSNiIOyWIrJ6Vfg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[]], added[[esVkbqnZSNiIOyWIrJ6Vfg]]
[2013-04-30 17:28:04,296][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [8]
[2013-04-30 17:28:32,012][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:28:32,014][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[l2E0_dQzThOumbUA7ipjKA, esVkbqnZSNiIOyWIrJ6Vfg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[esVkbqnZSNiIOyWIrJ6Vfg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[l2E0_dQzThOumbUA7ipjKA]], added[[]]
[2013-04-30 17:28:32,015][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [9]
[2013-04-30 17:29:06,005][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:29:06,006][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[esVkbqnZSNiIOyWIrJ6Vfg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[esVkbqnZSNiIOyWIrJ6Vfg]], added[[]]
[2013-04-30 17:29:06,008][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [10]
[2013-04-30 17:32:05,826][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:32:05,827][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[NeSmgy0TSKO3pttCkn8Qlg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[]], added[[NeSmgy0TSKO3pttCkn8Qlg]]
[2013-04-30 17:32:05,832][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [11]
[2013-04-30 17:32:52,443][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:32:52,444][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[NeSmgy0TSKO3pttCkn8Qlg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[NeSmgy0TSKO3pttCkn8Qlg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA, h-LYwWQHSbidqqa6X-2XYQ]], deleted: [[]], added[[h-LYwWQHSbidqqa6X-2XYQ]]
[2013-04-30 17:33:08,012][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:33:22,456][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [12]
[2013-04-30 17:33:22,464][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[NeSmgy0TSKO3pttCkn8Qlg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA, h-LYwWQHSbidqqa6X-2XYQ]], new nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA, h-LYwWQHSbidqqa6X-2XYQ]], deleted: [[NeSmgy0TSKO3pttCkn8Qlg]], added[[]]
[2013-04-30 17:33:22,466][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [13]
[2013-04-30 17:33:54,005][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list
[2013-04-30 17:33:54,006][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA, h-LYwWQHSbidqqa6X-2XYQ]], new nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[h-LYwWQHSbidqqa6X-2XYQ]], added[[]]
[2013-04-30 17:33:54,007][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [14]
[2013-04-30 17:34:04,662][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Stopping zooKeeper client
[2013-04-30 17:34:04,662][DEBUG][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Closing zooKeeper
[2013-04-30 17:34:04,664][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Stopped zooKeeper client

Sorry the entire log has been lost since then. I hope this is of any relevance.

@imotov
Copy link
Contributor

imotov commented May 2, 2013

Strange. It looks like some nodes were appearing and disappearing intermittently from the cluster for quite some time. Did you monitor CPU and java heap size on the nodes while these issues were happening? What was going on there? Could it be the case that the cluster was simply overloaded?

@mahdeto
Copy link
Author

mahdeto commented May 2, 2013

True, it was overloaded but not to the extent of the process itself dying or OOMing. This might be the reason why the disconnections are happening. But the issue still remains, a disconnected ZKClient means no ephemeral node and an immediate reshuffle (which makes the load problem worse).

I think waiting before initiating recovery or retrying the check for multiple times would be an awesome feature nevertheless. What do you think?

@imotov
Copy link
Contributor

imotov commented May 2, 2013

I think the real problem here is cluster overload. Disappearing nodes is just a symptom and zookeeper discovery service is just a messenger. This is how it works - a zookeeper detects that a node is unresponsive for 60 sec and kills its session, zookeeper tells discovery service that this node disappeared and discovery service passes the message upstream telling the rest of the system that the node disappeared, which in turn causes rebalancing, etc. You can increase zookeeper session timeout from the current 60 seconds default to something longer using sonian.elasticsearch.zookeeper.client.session.timeout setting, but I would suggest fixing the real issue - overloaded cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants