-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make snapshot deletes not block the repository during data blob deletes #86514
Make snapshot deletes not block the repository during data blob deletes #86514
Conversation
Jenkins run |
|
||
@Override | ||
public void onFailure(Exception e) { | ||
submitUnbatchedTask( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulled this branch out of removeSnapshotDeletionFromClusterState
to make the code a little less confusing than it already is with the listenersHandler
.
assertThat(Strings.toString(result), result.blobs(), equalTo(0L)); | ||
assertThat(Strings.toString(result), result.bytes(), equalTo(0L)); | ||
startCleaner(); | ||
if (result.bytes() > 0L || result.blobs() > 0L) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the only reason we did not see this earlier is timing. Previously a delete would be a hard barrier that would always clean up all dangling blobs from previous operations. Now, operations may create dangling blobs while a delete is running and deletes won't pick those up. Theoretically, this failure mode was possible prior to the changes here as far as I can see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. What we're really trying to say here is that explicit repo cleanup is unnecessary because other operations also clean things up properly. With this change we're kind of saying the opposite. Can we make those failures clean up after themselves more robustly? If not could we at least do a delete before the no-op cleanup (after acquiring all the permits) instead of the first cleanup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this more robust might be interesting but I wonder if it's worth it in the real world. Technically, it wouldn't be so hard to make a data node fire off a bunch of deletes for the stuff it just wrote in case it runs into an abort I guess. I'm not sure the added complexity is worth it given how little data this tends to leak in the real world in practice, but it sure would make aborts cleaner in theory. I think I like the idea :) should I do that in a separate PR maybe?
If not could we at least do a delete before the no-op cleanup (after acquiring all the permits) instead of the first cleanup?
It's tricky right? We'd have to do a delete that touches all the indices that could have something leaked in them. Which I guess (unless we're doing something really tricky) means we'd have to run a delete that touches a set of snapshots such that all indices in the repo are touched? It shouldn't be endlessly hard to do that but is it really worth it, given that we have a bunch of focused tests around deletes and not leaking blobs elsewhere?
@@ -270,7 +268,7 @@ public void testRepositoryConflict() throws Exception { | |||
final String snapshot1 = "test-snap1"; | |||
client().admin().cluster().prepareCreateSnapshot(repo, snapshot1).setWaitForCompletion(true).get(); | |||
String blockedNode = internalCluster().getMasterName(); | |||
((MockRepository) internalCluster().getInstance(RepositoriesService.class, blockedNode).repository(repo)).blockOnDataFiles(); | |||
blockMasterOnWriteIndexFile(repo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking data files doesn't block CS updates any longer.
Pinging @elastic/es-distributed (Team:Distributed) |
Hi @original-brownbear, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General approach seems ok to me (subject to the usual concerns about losing track of blobs and having to fall back to listing things). I left a few small comments.
assertThat(Strings.toString(result), result.blobs(), equalTo(0L)); | ||
assertThat(Strings.toString(result), result.bytes(), equalTo(0L)); | ||
startCleaner(); | ||
if (result.bytes() > 0L || result.blobs() > 0L) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. What we're really trying to say here is that explicit repo cleanup is unnecessary because other operations also clean things up properly. With this change we're kind of saying the opposite. Can we make those failures clean up after themselves more robustly? If not could we at least do a delete before the no-op cleanup (after acquiring all the permits) instead of the first cleanup?
x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRepository.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/snapshots/SnapshotDeleteListener.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
public void onMetaUpdated(RepositoryData updatedRepoData) { | ||
removeSnapshotDeletionFromClusterState( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that completely dropping the deletion from the cluster state here before completion will lead to (a higher probability of) shard-level blob leaks in situations like master failovers. As things stand today I think the new master would retry the cleanups of these shards, whereas with this change we have to do another delete that touches the same indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that this leads to a little more leaking on the shard level statistically speaking but I'm not so worried about it. If you think about it, it's quite unlikely for an index to run into a a partially failed delete and then not be touched again for a long time. The only scenario in which this would realistically happen is when a delete fails for the last regular backup of a shard/index that has moved to frozen and thus still has a snapshot around.
In that case though, the contents of the frozen-tier backing snapshot are probably the same as those of the last backup anyway so idk.
Master fail-overs + unlikely and I can't really see a common case where we'd have a massive diff between snapshots in a shard and then never touch it again => seems not too important?
Also note, that this brings us back to how things worked pre-7.6 in a sense (where we didn't track repository generation in the CS and thus the second delete after fail-over wouldn't do anything) and in my recollection, we hardly had any blobs leak past 7.4 back then (we did collect some data on cloud about that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share David's concerns but on the other hand I think this change has benefits that outweigh the (apparently unlikely) inconvenience of leaking blobs, so I'm +1 on this as long as we keep an eye on this on Cloud
…tep-snapshot-delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks David and Tanguy! |
As a result of elastic#86514 we now remove the snapshot delete from the cluster state before cleaning up stale shard level blobs so we have to block master earlier on deleting the root level index-N blob to get the desired cluster state where delete and snapshot creation are both in the cluster state at the same time. closes elastic#88061
…#88065) As a result of #86514 we now remove the snapshot delete from the cluster state before cleaning up stale shard level blobs so we have to block master earlier on deleting the root level index-N blob to get the desired cluster state where delete and snapshot creation are both in the cluster state at the same time. closes #88061
For large snapshot clusters the final cleanup step that removes previous shard metadata might have to delete tens of thousands of blobs. This can take many seconds, for which the repository is needlessly blocked from running subsequent finalizations or deletes. Similar to how the final cleanup step was made async and taken out of the finalization loop for deletes in elastic#86514 this PR moves the final cleanup step for finalization async.
For large snapshot clusters the final cleanup step that removes previous shard metadata might have to delete tens of thousands of blobs. This can take many seconds, for which the repository is needlessly blocked from running subsequent finalizations or deletes. Similar to how the final cleanup step was made async and taken out of the finalization loop for deletes in #86514 this PR moves the final cleanup step for finalization async.
Snapshot deletes need not block other operations beyond the updates to the repository metadata at the beginning of the delete operation. All subsequent blob deletes can safely run async and concurrent to other operations.
This is the simplest possible implementation of this change that I could find. It's not the most optimal since concurrent deletes are not guarded against trying to delete the same blobs twice. I believe this is not a big issue in practice though.
For one, we batch overlapping deletes into single operations, so we will only try to redundantly delete blobs leaked by previous operations that are part of indices still referenced (which will generally by a very limited number of blobs I believe) and indices that went out of scope. Indices that went out of scope are deleted by listing out blobs and deleting them in turn, which means that we likely won't be attempting all that many redundant deletes even if the same index would be touched by concurrent delete operations and even if we did, the additional memory use would be bounded.