You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a massive balance/load/replication is happening in the historicals, the Coordinator ONLY does that, load/drop/replicate segments, ignoring the Middlemanagers. The coord logs look like:
May 19 10:07:00 druid-master-2 java[18008]: 2021-05-19T10:07:00,045 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Assigning 'replica' for segment [xxx] to server [yyy] in tier [zzz]
May 19 10:07:00 druid-master-2 java[18008]: 2021-05-19T10:07:00,046 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Loading in progress, skipping drop until loading is complete
The tasks in the middlemanagers then enter in a permanent loop of printing a similar line to this several times (hundreds?) until the task is marked as failed due to timeout:
2021-05-19T09:41:49,584 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.realtime.plumber.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for Segments : [[SegmentDescriptor{interval=2021-05-19T00:00:00.000Z/2021-05-20T00:00:00.000Z, version='2021-05-19T00:00:00.782Z', partitionNumber=10}]]
I've seen this several times, whether when we took a big datasource (around 2M segments, 35Tb) and added another replica, right now since we're migrating from one DC to another, so Drop all from old servers, load all to new servers, (around 70Tb of data) or when some sort of BIG balancing is happening in the historical, for instance when you add a new server because there was an almost full one and the new one needs to load a lot and the others drop.
There are is a "workaround".
When we see lots of failing tasks, we check if this is the cause.
Once confirmed, we restart the actual coordinator, forcing it to failover to the other one.
For a while, the new coordinator will "ACK" the hand-off of the tasks and they will succeed.
Moments later, it will start with the "really long balancing/dropping/loading" and start ignoring the Middlemanagers again.
the "workaround": Schedule in crontab a forced fail over every 30 min. That way, every 30 min a new coordinator will "take the lead" and ACK for a while before going full obsessive over the Historicals.
I've seen this issue since 0.18 "at least". I have a thread open in the mail list in Google trying to figure out if happened to anyone else in case we're doing something wrong, but it really looks like a bug.
The text was updated successfully, but these errors were encountered:
Affected Version
0.20.1
Description
When a massive balance/load/replication is happening in the historicals, the Coordinator ONLY does that, load/drop/replicate segments, ignoring the Middlemanagers. The coord logs look like:
The tasks in the middlemanagers then enter in a permanent loop of printing a similar line to this several times (hundreds?) until the task is marked as failed due to timeout:
I've seen this several times, whether when we took a big datasource (around 2M segments, 35Tb) and added another replica, right now since we're migrating from one DC to another, so Drop all from old servers, load all to new servers, (around 70Tb of data) or when some sort of BIG balancing is happening in the historical, for instance when you add a new server because there was an almost full one and the new one needs to load a lot and the others drop.
There are is a "workaround".
When we see lots of failing tasks, we check if this is the cause.
Once confirmed, we restart the actual coordinator, forcing it to failover to the other one.
For a while, the new coordinator will "ACK" the hand-off of the tasks and they will succeed.
Moments later, it will start with the "really long balancing/dropping/loading" and start ignoring the Middlemanagers again.
the "workaround": Schedule in crontab a forced fail over every 30 min. That way, every 30 min a new coordinator will "take the lead" and ACK for a while before going full obsessive over the Historicals.
I've seen this issue since 0.18 "at least". I have a thread open in the mail list in Google trying to figure out if happened to anyone else in case we're doing something wrong, but it really looks like a bug.
The text was updated successfully, but these errors were encountered: