Tasks failing due to Coordinator it's too busy moving large amounts of segments in the Historicals #11274

tanisdlj · 2021-05-19T10:28:37Z

Affected Version

0.20.1

Description

Cluster size: 52 Historicals, 32 Middlemanagers, 2 Coordinators, 2 Overlords, 5 Brokers, 5 Routers, 2,379,776 segments, ~70Tb of data.

When a massive balance/load/replication is happening in the historicals, the Coordinator ONLY does that, load/drop/replicate segments, ignoring the Middlemanagers. The coord logs look like:

May 19 10:07:00 druid-master-2 java[18008]: 2021-05-19T10:07:00,045 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Assigning 'replica' for segment [xxx] to server [yyy] in tier [zzz]
May 19 10:07:00 druid-master-2 java[18008]: 2021-05-19T10:07:00,046 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Loading in progress, skipping drop until loading is complete

The tasks in the middlemanagers then enter in a permanent loop of printing a similar line to this several times (hundreds?) until the task is marked as failed due to timeout:

2021-05-19T09:41:49,584 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.realtime.plumber.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for Segments : [[SegmentDescriptor{interval=2021-05-19T00:00:00.000Z/2021-05-20T00:00:00.000Z, version='2021-05-19T00:00:00.782Z', partitionNumber=10}]]

I've seen this several times, whether when we took a big datasource (around 2M segments, 35Tb) and added another replica, right now since we're migrating from one DC to another, so Drop all from old servers, load all to new servers, (around 70Tb of data) or when some sort of BIG balancing is happening in the historical, for instance when you add a new server because there was an almost full one and the new one needs to load a lot and the others drop.

There are is a "workaround".
When we see lots of failing tasks, we check if this is the cause.
Once confirmed, we restart the actual coordinator, forcing it to failover to the other one.
For a while, the new coordinator will "ACK" the hand-off of the tasks and they will succeed.
Moments later, it will start with the "really long balancing/dropping/loading" and start ignoring the Middlemanagers again.
the "workaround": Schedule in crontab a forced fail over every 30 min. That way, every 30 min a new coordinator will "take the lead" and ACK for a while before going full obsessive over the Historicals.

I've seen this issue since 0.18 "at least". I have a thread open in the mail list in Google trying to figure out if happened to anyone else in case we're doing something wrong, but it really looks like a bug.

The text was updated successfully, but these errors were encountered:

tanisdlj · 2021-10-25T10:22:45Z

Any news on this front? :S

kfaraz · 2024-01-19T08:05:12Z

There has been a complete overhaul of the coordinator operations recently in #13197 and related PRs.
Druid 27 and above have smart segment loading enabled by default. https://druid.apache.org/docs/latest/configuration/#smart-segment-loading

@tanisdlj , you could try out the newer versions of Druid and check if you are still facing the handoff issue.

tanisdlj added the Uncategorized problem report label May 19, 2021

asdf2014 added Area - Segment Balancing/Coordination and removed Uncategorized problem report labels May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks failing due to Coordinator it's too busy moving large amounts of segments in the Historicals #11274

Tasks failing due to Coordinator it's too busy moving large amounts of segments in the Historicals #11274

tanisdlj commented May 19, 2021

tanisdlj commented Oct 25, 2021

kfaraz commented Jan 19, 2024

Tasks failing due to Coordinator it's too busy moving large amounts of segments in the Historicals #11274

Tasks failing due to Coordinator it's too busy moving large amounts of segments in the Historicals #11274

Comments

tanisdlj commented May 19, 2021

Affected Version

Description

tanisdlj commented Oct 25, 2021

kfaraz commented Jan 19, 2024