Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator get stuck after an historical lost #8523

Closed
giom-l opened this issue Sep 12, 2019 · 3 comments
Closed

Coordinator get stuck after an historical lost #8523

giom-l opened this issue Sep 12, 2019 · 3 comments

Comments

@giom-l
Copy link
Contributor

giom-l commented Sep 12, 2019

Affected Version

Druid 0.15.0-incubating

Cluster description

  • deployed on AWS
  • 1 broker, 1 router, 1 middleManager, 1 coordinator, 1 overlord and 4 historicals
  • configuration items can be provided if necessary

Issue description

On my cluster, I had an issue when I lost one of my historical nodes (1 among 4)
Here are the logs I've found on my coordinator

2019-09-07T00:25:50,545 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-180-80.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 34,768,169,338 bytes served.
2019-09-07T00:25:50,545 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-180-80.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 34,768,169,338 bytes served.
2019-09-07T00:25:50,545 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-181-111.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 34,963,965,123 bytes served.
2019-09-07T00:25:50,545 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-182-88.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 35,060,517,554 bytes served.
2019-09-07T00:25:50,545 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-182-97.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 35,112,561,267 bytes served.
2019-09-07T00:25:55,571 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
2019-09-07T00:25:55,580 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Found [8,252] available segments.
2019-09-07T00:25:55,595 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue is empty.
2019-09-07T00:25:55,655 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorBalancer - Found 4 active servers, 0 decommissioning servers
2019-09-07T00:25:55,655 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorBalancer - Processing 4 segments for moving from decommissioning servers
2019-09-07T00:25:55,655 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorBalancer - All servers to move segments from are empty, ending run.
2019-09-07T00:25:55,655 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorBalancer - Processing 5 segments for balancing between active servers
2019-09-07T00:25:55,672 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorBalancer - No good moves found in tier [_default_tier]
**2019-09-07T00:25:58,954 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - Server Disappeared[DruidServerMetadata{name='ip-172-29-181-111.eu-west-1.compute.internal:8083', hostAndPort='ip-172-29-181-111.eu-west-1.compute.internal:8083', hostAndTlsPort='null', maxSize=50000000000, tier='_default_tier', type=historical, priority=0}]
2019-09-07T00:25:58,958 INFO [NodeTypeWatcher[HISTORICAL]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeTypeWatcher - Node[ip-172-29-181-111.eu-west-1.compute.internal:8083:DiscoveryDruidNode{druidNode=DruidNode{serviceName='druid/historical', host='ip-172-29-181-111.eu-west-1.compute.internal', bindOnHost=false, port=-1, plaintextPort=8083, enablePlaintextPort=true, tlsPort=-1, enableTlsPort=false}, nodeType='HISTORICAL', services={dataNodeService=DataNodeService{tier='_default_tier', maxSize=50000000000, type=historical, priority=0}, lookupNodeService=LookupNodeService{lookupTier='__default'}}}] disappeared.**
2019-09-07T00:26:03,876 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.CostBalancerStrategy - [_default_tier]: Initial Total Cost: [58981658.349756], Normalization: [15286538.012713], Initial Normalized Cost: [3.858405]
2019-09-07T00:26:03,876 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: Segments Moved: [0] Segments Let Alone: [5]
2019-09-07T00:26:03,876 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Assigned 0 segments among 4 servers
2019-09-07T00:26:03,876 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Assigned 0 segments among 4 servers
2019-09-07T00:26:03,876 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Dropped 0 segments among 4 servers
2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Moved 0 segment(s)
2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Let alone 5 segment(s)
2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Load Queues:
2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-180-80.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 34,768,169,338 bytes served.
2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-181-111.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 34,963,965,123 bytes served.
2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-182-88.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 35,060,517,554 bytes served.
2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-182-97.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 35,112,561,267 bytes served.
2019-09-07T00:26:08,905 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
2019-09-07T00:26:08,916 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Found [8,252] available segments.
**2019-09-07T00:26:08,928 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Removing listener for server[ip-172-29-181-111.eu-west-1.compute.internal:8083] which is no longer there.**
2019-09-07T00:26:08,932 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue is empty.
2019-09-07T00:26:08,935 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Assigning 'replica' for segment [qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z] to server [ip-172-29-182-88.eu-west-1.compute.internal:8083] in tier [_default_tier]
**2019-09-07T00:26:08,935 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Assigning 'replica' for segment [qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z] to server [ip-172-29-182-88.eu-west-1.compute.internal:8083] in tier [_default_tier]
2019-09-07T00:26:08,946 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Caught exception, ignoring so that schedule keeps going.: {class=org.apache.druid.server.coordinator.DruidCoordinator, exceptionType=class java.util.concurrent.RejectedExecutionException, exceptionMessage=Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5b74b020 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@55b0122a[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21038]}
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5b74b020 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@55b0122a[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21038]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) ~[?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) ~[?:1.8.0_201]
at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326) ~[?:1.8.0_201]
at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) ~[?:1.8.0_201]
at java.util.concurrent.ScheduledThreadPoolExecutor.submit(ScheduledThreadPoolExecutor.java:632) ~[?:1.8.0_201]
at org.apache.druid.server.coordinator.CuratorLoadQueuePeon.loadSegment(CuratorLoadQueuePeon.java:181) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.rules.LoadRule.assignReplicasForTier(LoadRule.java:313) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.rules.LoadRule.assignReplicas(LoadRule.java:240) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.rules.LoadRule.assign(LoadRule.java:102) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.rules.LoadRule.run(LoadRule.java:77) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.helper.DruidCoordinatorRuleRunner.run(DruidCoordinatorRuleRunner.java:122) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.DruidCoordinator$CoordinatorRunnable.run(DruidCoordinator.java:716) [druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:601) [druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:594) [druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:92) [druid-core-0.15.0-incubating.jar:0.15.0-incubating]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_201]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_201]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_201]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_201]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
2019-09-07T00:26:13,967 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
2019-09-07T00:26:13,977 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Found [8,252] available segments.
2019-09-07T00:26:13,999 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue still has 1 segments. Lifetime[15]. Segments [qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z ON ip-172-29-182-88.eu-west-1.compute.internal:8083]
2019-09-07T00:26:13,999 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Loading in progress, skipping drop until loading is complete

Then a few minutes later, the coordinator throws :

2019-09-07T00:29:04,110 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
2019-09-07T00:29:04,110 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
2019-09-07T00:29:04,120 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Found [8,252] available segments.
2019-09-07T00:29:04,136 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue still has 1 segments. Lifetime[0]. Segments [qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z ON ip-172-29-182-88.eu-west-1.compute.internal:8083]
2019-09-07T00:29:04,136 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue stuck after 15+ runs!: {class=org.apache.druid.server.coordinator.ReplicationThrottler, segments=[qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z ON ip-172-29-182-88.eu-west-1.compute.internal:8083]}
2019-09-07T00:29:04,137 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Loading in progress, skipping drop until loading is complete

First, I re-added one historical, but it didn't appear in the cluster (neither in the UI nor in the coordinator logs)
Then I restarted the coordinator, and then everything get back to normal (4 historicals found, segments started to rebalance)

Direct effect of the issue :
The number of unavailable segments started to increase as the coordinator was stuck on his issue, new data were not available for requests
image

@giom-l
Copy link
Contributor Author

giom-l commented Sep 13, 2019

Just remember something !
I think I'm encountering the same issue as here : #8137

Which issue has been fixed in druid 0.15.1

Can someone confirm my point ?

@stale
Copy link

stale bot commented Jun 20, 2020

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

@stale stale bot added the stale label Jun 20, 2020
@stale
Copy link

stale bot commented Jul 18, 2020

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

@stale stale bot closed this as completed Jul 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant