-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
M3 mirrored placement algorithm should support concurrent replaces for instances #2850
Comments
cc @ryanhall07 -- @robskillington mentioned you as a potential good POC on the Chronosphere side for this. I'm working on implementing a fix now, but would love any input on the approach. |
Also cc @prateek |
@andrewmains12 -- checking in .. this still an issue after #2858? |
Problem: This fixes issue #2850. The mirrored placement requires all shards to be available to process an instance replace. This makes it impossible to have a consecutive replace following an earlier replace until *all* shards in placement are cutover, even if the scopes of replaces do not overlap, i.e. the replace pairs own disjoint sets of shards. In practice it significantly slows down consecutive replaces and increases the risk of data loss because the longest supported aggregation tile is 1 hour. Solution: When processing a replace, require only the leaving instance and its peers to have their shards available. The peer instances are instances that own the same shardset which includes mirror peers (when replication factor >= 2) and any pending replaces where the specified leaving node can be either a replacement or replaced. Add new tests asserting this use case.
Closing as being worked on in PR #3117. |
Yes it is. The #2858 allows to specify a custom placement algorithm that could be implemented outside of this OSS code base but it's better to fix existing mirrored placement algorithm as I'm doing in #3117 |
* Consecutive replaces in mirrored placement Problem: This fixes issue #2850. The mirrored placement requires all shards to be available to process an instance replace. This makes it impossible to have a consecutive replace following an earlier replace until *all* shards in placement are cutover, even if the scopes of replaces do not overlap, i.e. the replace pairs own disjoint sets of shards. In practice it significantly slows down consecutive replaces and increases the risk of data loss because the longest supported aggregation tile is 1 hour. Solution: When processing a replace, require only the leaving instance and its peers to have their shards available. The peer instances are instances that own the same shardset which includes mirror peers (when replication factor >= 2) and any pending replaces where the specified leaving node can be either a replacement or replaced. Add new tests asserting this use case.
Currently, the M3 mirrored placement algorithm blocks concurrent replace operations, even when the two operations are independent.
Say you have a placement like:
with i1, i2 in a shardset pair and i3, i4 in a shardset pair.
If you try to do 2 consecutive replaces, the second replace will block until the first replace finishes, e.g.:
Since replaces can take a long time to complete for long tile sizes (e.g. 1 hour), this is non ideal.
The reason this ends up blocking is that we call
MarkAllShardsAvailable
before doing the replacement (code, which means that all shards in the placement have to pass the IsCutoverFn andIsCutoffFn
checks. For hour tiles, this means that the second replace has to wait an additional hour or so to go through.The point of this call (iiuc) is to make sure that the placement is in a clean state before doing any shard movement. That is, if you have shards that are moving between nodes already, you shouldn't perform a replace on those nodes.
Potential Fixes
We may be able to fix this by limiting the instances we mark available to those affected by the replace, i.e. the leaving (replaced) instances. This will allow replaces that operate on independent shardsets to proceed concurrently.
The text was updated successfully, but these errors were encountered: