M3 mirrored placement algorithm should support concurrent replaces for instances #2850

andrewmains12 · 2020-11-06T17:22:03Z

Currently, the M3 mirrored placement algorithm blocks concurrent replace operations, even when the two operations are independent.

Say you have a placement like:

i1 shardsetID 1, 4 shards available, 0 init, 0 leaving
i2 shardSetID 1, 4 shards available, 0 init, 0 leaving

i3 shardsetID 2, 4 shards available, 0 init, 0 leaving
i4 shardSetID 2, 4 shards available, 0 init, 0 leaving

with i1, i2 in a shardset pair and i3, i4 in a shardset pair.

If you try to do 2 consecutive replaces, the second replace will block until the first replace finishes, e.g.:

R1: replace i1 with i5
R2: replace i3 with i6   # this replace will block until R1 finishes

Since replaces can take a long time to complete for long tile sizes (e.g. 1 hour), this is non ideal.

The reason this ends up blocking is that we call MarkAllShardsAvailable before doing the replacement (code, which means that all shards in the placement have to pass the IsCutoverFn and IsCutoffFn checks. For hour tiles, this means that the second replace has to wait an additional hour or so to go through.

The point of this call (iiuc) is to make sure that the placement is in a clean state before doing any shard movement. That is, if you have shards that are moving between nodes already, you shouldn't perform a replace on those nodes.

Potential Fixes

We may be able to fix this by limiting the instances we mark available to those affected by the replace, i.e. the leaving (replaced) instances. This will allow replaces that operate on independent shardsets to proceed concurrently.

The text was updated successfully, but these errors were encountered:

andrewmains12 · 2020-11-06T17:25:42Z

cc @ryanhall07 -- @robskillington mentioned you as a potential good POC on the Chronosphere side for this. I'm working on implementing a fix now, but would love any input on the approach.

andrewmains12 · 2020-11-06T17:30:28Z

Also cc @prateek

gibbscullen · 2020-11-11T17:08:27Z

@andrewmains12 -- checking in .. this still an issue after #2858?

Problem: This fixes issue #2850. The mirrored placement requires all shards to be available to process an instance replace. This makes it impossible to have a consecutive replace following an earlier replace until *all* shards in placement are cutover, even if the scopes of replaces do not overlap, i.e. the replace pairs own disjoint sets of shards. In practice it significantly slows down consecutive replaces and increases the risk of data loss because the longest supported aggregation tile is 1 hour. Solution: When processing a replace, require only the leaving instance and its peers to have their shards available. The peer instances are instances that own the same shardset which includes mirror peers (when replication factor >= 2) and any pending replaces where the specified leaving node can be either a replacement or replaced. Add new tests asserting this use case.

gibbscullen · 2021-01-28T16:58:55Z

Closing as being worked on in PR #3117.

abliqo · 2021-01-30T04:33:51Z

@andrewmains12 -- checking in .. this still an issue after #2858?

Yes it is. The #2858 allows to specify a custom placement algorithm that could be implemented outside of this OSS code base but it's better to fix existing mirrored placement algorithm as I'm doing in #3117

* Consecutive replaces in mirrored placement Problem: This fixes issue #2850. The mirrored placement requires all shards to be available to process an instance replace. This makes it impossible to have a consecutive replace following an earlier replace until *all* shards in placement are cutover, even if the scopes of replaces do not overlap, i.e. the replace pairs own disjoint sets of shards. In practice it significantly slows down consecutive replaces and increases the risk of data loss because the longest supported aggregation tile is 1 hour. Solution: When processing a replace, require only the leaving instance and its peers to have their shards available. The peer instances are instances that own the same shardset which includes mirror peers (when replication factor >= 2) and any pending replaces where the specified leaving node can be either a replacement or replaced. Add new tests asserting this use case.

andrewmains12 mentioned this issue Nov 6, 2020

Draft: support concurrent replaces within the mirrored placement code #2852

Draft

vdarulis mentioned this issue Nov 9, 2020

[m3cluster] Expose placement algorithm in placement service #2858

Merged

gibbscullen self-assigned this Nov 11, 2020

abliqo mentioned this issue Jan 23, 2021

Non-overlapping replaces in mirrored placement #3117

Merged

gibbscullen added the area:aggregator All issues pertaining to aggregator label Jan 28, 2021

gibbscullen closed this as completed Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M3 mirrored placement algorithm should support concurrent replaces for instances #2850

M3 mirrored placement algorithm should support concurrent replaces for instances #2850

andrewmains12 commented Nov 6, 2020

andrewmains12 commented Nov 6, 2020

andrewmains12 commented Nov 6, 2020

gibbscullen commented Nov 11, 2020

gibbscullen commented Jan 28, 2021

abliqo commented Jan 30, 2021

M3 mirrored placement algorithm should support concurrent replaces for instances #2850

M3 mirrored placement algorithm should support concurrent replaces for instances #2850

Comments

andrewmains12 commented Nov 6, 2020

Potential Fixes

andrewmains12 commented Nov 6, 2020

andrewmains12 commented Nov 6, 2020

gibbscullen commented Nov 11, 2020

gibbscullen commented Jan 28, 2021

abliqo commented Jan 30, 2021