You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On my cluster, I had an issue when I lost one of my historical nodes (1 among 4)
Here are the logs I've found on my coordinator
2019-09-07T00:25:50,545 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-180-80.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 34,768,169,338 bytes served.
2019-09-07T00:25:50,545 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-180-80.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 34,768,169,338 bytes served.
2019-09-07T00:25:50,545 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-181-111.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 34,963,965,123 bytes served.
2019-09-07T00:25:50,545 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-182-88.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 35,060,517,554 bytes served.
2019-09-07T00:25:50,545 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-182-97.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 35,112,561,267 bytes served.
2019-09-07T00:25:55,571 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
2019-09-07T00:25:55,580 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Found [8,252] available segments.
2019-09-07T00:25:55,595 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue is empty.
2019-09-07T00:25:55,655 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorBalancer - Found 4 active servers, 0 decommissioning servers
2019-09-07T00:25:55,655 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorBalancer - Processing 4 segments for moving from decommissioning servers
2019-09-07T00:25:55,655 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorBalancer - All servers to move segments from are empty, ending run.
2019-09-07T00:25:55,655 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorBalancer - Processing 5 segments for balancing between active servers
2019-09-07T00:25:55,672 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorBalancer - No good moves found in tier [_default_tier]
**2019-09-07T00:25:58,954 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - Server Disappeared[DruidServerMetadata{name='ip-172-29-181-111.eu-west-1.compute.internal:8083', hostAndPort='ip-172-29-181-111.eu-west-1.compute.internal:8083', hostAndTlsPort='null', maxSize=50000000000, tier='_default_tier', type=historical, priority=0}]
2019-09-07T00:25:58,958 INFO [NodeTypeWatcher[HISTORICAL]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeTypeWatcher - Node[ip-172-29-181-111.eu-west-1.compute.internal:8083:DiscoveryDruidNode{druidNode=DruidNode{serviceName='druid/historical', host='ip-172-29-181-111.eu-west-1.compute.internal', bindOnHost=false, port=-1, plaintextPort=8083, enablePlaintextPort=true, tlsPort=-1, enableTlsPort=false}, nodeType='HISTORICAL', services={dataNodeService=DataNodeService{tier='_default_tier', maxSize=50000000000, type=historical, priority=0}, lookupNodeService=LookupNodeService{lookupTier='__default'}}}] disappeared.**
2019-09-07T00:26:03,876 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.CostBalancerStrategy - [_default_tier]: Initial Total Cost: [58981658.349756], Normalization: [15286538.012713], Initial Normalized Cost: [3.858405]
2019-09-07T00:26:03,876 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: Segments Moved: [0] Segments Let Alone: [5]
2019-09-07T00:26:03,876 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Assigned 0 segments among 4 servers
2019-09-07T00:26:03,876 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Assigned 0 segments among 4 servers
2019-09-07T00:26:03,876 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Dropped 0 segments among 4 servers
2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Moved 0 segment(s)
2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Let alone 5 segment(s)
2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Load Queues:
2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-180-80.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 34,768,169,338 bytes served.
2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-181-111.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 34,963,965,123 bytes served.
2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-182-88.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 35,060,517,554 bytes served.
2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-182-97.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 35,112,561,267 bytes served.
2019-09-07T00:26:08,905 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
2019-09-07T00:26:08,916 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Found [8,252] available segments.
**2019-09-07T00:26:08,928 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Removing listener for server[ip-172-29-181-111.eu-west-1.compute.internal:8083] which is no longer there.**
2019-09-07T00:26:08,932 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue is empty.
2019-09-07T00:26:08,935 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Assigning 'replica' for segment [qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z] to server [ip-172-29-182-88.eu-west-1.compute.internal:8083] in tier [_default_tier]
**2019-09-07T00:26:08,935 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Assigning 'replica' for segment [qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z] to server [ip-172-29-182-88.eu-west-1.compute.internal:8083] in tier [_default_tier]
2019-09-07T00:26:08,946 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Caught exception, ignoring so that schedule keeps going.: {class=org.apache.druid.server.coordinator.DruidCoordinator, exceptionType=class java.util.concurrent.RejectedExecutionException, exceptionMessage=Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5b74b020 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@55b0122a[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21038]}
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5b74b020 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@55b0122a[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21038]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) ~[?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) ~[?:1.8.0_201]
at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326) ~[?:1.8.0_201]
at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) ~[?:1.8.0_201]
at java.util.concurrent.ScheduledThreadPoolExecutor.submit(ScheduledThreadPoolExecutor.java:632) ~[?:1.8.0_201]
at org.apache.druid.server.coordinator.CuratorLoadQueuePeon.loadSegment(CuratorLoadQueuePeon.java:181) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.rules.LoadRule.assignReplicasForTier(LoadRule.java:313) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.rules.LoadRule.assignReplicas(LoadRule.java:240) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.rules.LoadRule.assign(LoadRule.java:102) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.rules.LoadRule.run(LoadRule.java:77) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.helper.DruidCoordinatorRuleRunner.run(DruidCoordinatorRuleRunner.java:122) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.DruidCoordinator$CoordinatorRunnable.run(DruidCoordinator.java:716) [druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:601) [druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:594) [druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:92) [druid-core-0.15.0-incubating.jar:0.15.0-incubating]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_201]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_201]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_201]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_201]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
2019-09-07T00:26:13,967 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
2019-09-07T00:26:13,977 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Found [8,252] available segments.
2019-09-07T00:26:13,999 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue still has 1 segments. Lifetime[15]. Segments [qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z ON ip-172-29-182-88.eu-west-1.compute.internal:8083]
2019-09-07T00:26:13,999 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Loading in progress, skipping drop until loading is complete
Then a few minutes later, the coordinator throws :
2019-09-07T00:29:04,110 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
2019-09-07T00:29:04,110 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
2019-09-07T00:29:04,120 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Found [8,252] available segments.
2019-09-07T00:29:04,136 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue still has 1 segments. Lifetime[0]. Segments [qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z ON ip-172-29-182-88.eu-west-1.compute.internal:8083]
2019-09-07T00:29:04,136 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue stuck after 15+ runs!: {class=org.apache.druid.server.coordinator.ReplicationThrottler, segments=[qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z ON ip-172-29-182-88.eu-west-1.compute.internal:8083]}
2019-09-07T00:29:04,137 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Loading in progress, skipping drop until loading is complete
First, I re-added one historical, but it didn't appear in the cluster (neither in the UI nor in the coordinator logs)
Then I restarted the coordinator, and then everything get back to normal (4 historicals found, segments started to rebalance)
Direct effect of the issue :
The number of unavailable segments started to increase as the coordinator was stuck on his issue, new data were not available for requests
The text was updated successfully, but these errors were encountered:
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.
Affected Version
Druid 0.15.0-incubating
Cluster description
Issue description
On my cluster, I had an issue when I lost one of my historical nodes (1 among 4)
Here are the logs I've found on my coordinator
Then a few minutes later, the coordinator throws :
First, I re-added one historical, but it didn't appear in the cluster (neither in the UI nor in the coordinator logs)
Then I restarted the coordinator, and then everything get back to normal (4 historicals found, segments started to rebalance)
Direct effect of the issue :
The number of unavailable segments started to increase as the coordinator was stuck on his issue, new data were not available for requests
The text was updated successfully, but these errors were encountered: