Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make snapshot deletes not block the repository during data blob deletes #86514

Merged

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented May 6, 2022

Snapshot deletes need not block other operations beyond the updates to the repository metadata at the beginning of the delete operation. All subsequent blob deletes can safely run async and concurrent to other operations.

This is the simplest possible implementation of this change that I could find. It's not the most optimal since concurrent deletes are not guarded against trying to delete the same blobs twice. I believe this is not a big issue in practice though.
For one, we batch overlapping deletes into single operations, so we will only try to redundantly delete blobs leaked by previous operations that are part of indices still referenced (which will generally by a very limited number of blobs I believe) and indices that went out of scope. Indices that went out of scope are deleted by listing out blobs and deleting them in turn, which means that we likely won't be attempting all that many redundant deletes even if the same index would be touched by concurrent delete operations and even if we did, the additional memory use would be bounded.

@original-brownbear
Copy link
Member Author

Jenkins run
elasticsearch-ci/part-2 ( unrelated)


@Override
public void onFailure(Exception e) {
submitUnbatchedTask(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled this branch out of removeSnapshotDeletionFromClusterState to make the code a little less confusing than it already is with the listenersHandler.

assertThat(Strings.toString(result), result.blobs(), equalTo(0L));
assertThat(Strings.toString(result), result.bytes(), equalTo(0L));
startCleaner();
if (result.bytes() > 0L || result.blobs() > 0L) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the only reason we did not see this earlier is timing. Previously a delete would be a hard barrier that would always clean up all dangling blobs from previous operations. Now, operations may create dangling blobs while a delete is running and deletes won't pick those up. Theoretically, this failure mode was possible prior to the changes here as far as I can see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. What we're really trying to say here is that explicit repo cleanup is unnecessary because other operations also clean things up properly. With this change we're kind of saying the opposite. Can we make those failures clean up after themselves more robustly? If not could we at least do a delete before the no-op cleanup (after acquiring all the permits) instead of the first cleanup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this more robust might be interesting but I wonder if it's worth it in the real world. Technically, it wouldn't be so hard to make a data node fire off a bunch of deletes for the stuff it just wrote in case it runs into an abort I guess. I'm not sure the added complexity is worth it given how little data this tends to leak in the real world in practice, but it sure would make aborts cleaner in theory. I think I like the idea :) should I do that in a separate PR maybe?

If not could we at least do a delete before the no-op cleanup (after acquiring all the permits) instead of the first cleanup?

It's tricky right? We'd have to do a delete that touches all the indices that could have something leaked in them. Which I guess (unless we're doing something really tricky) means we'd have to run a delete that touches a set of snapshots such that all indices in the repo are touched? It shouldn't be endlessly hard to do that but is it really worth it, given that we have a bunch of focused tests around deletes and not leaking blobs elsewhere?

@@ -270,7 +268,7 @@ public void testRepositoryConflict() throws Exception {
final String snapshot1 = "test-snap1";
client().admin().cluster().prepareCreateSnapshot(repo, snapshot1).setWaitForCompletion(true).get();
String blockedNode = internalCluster().getMasterName();
((MockRepository) internalCluster().getInstance(RepositoriesService.class, blockedNode).repository(repo)).blockOnDataFiles();
blockMasterOnWriteIndexFile(repo);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking data files doesn't block CS updates any longer.

@original-brownbear original-brownbear marked this pull request as ready for review May 31, 2022 09:11
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 31, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Collaborator

Hi @original-brownbear, I've created a changelog YAML for you.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General approach seems ok to me (subject to the usual concerns about losing track of blobs and having to fall back to listing things). I left a few small comments.

assertThat(Strings.toString(result), result.blobs(), equalTo(0L));
assertThat(Strings.toString(result), result.bytes(), equalTo(0L));
startCleaner();
if (result.bytes() > 0L || result.blobs() > 0L) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. What we're really trying to say here is that explicit repo cleanup is unnecessary because other operations also clean things up properly. With this change we're kind of saying the opposite. Can we make those failures clean up after themselves more robustly? If not could we at least do a delete before the no-op cleanup (after acquiring all the permits) instead of the first cleanup?


@Override
public void onMetaUpdated(RepositoryData updatedRepoData) {
removeSnapshotDeletionFromClusterState(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that completely dropping the deletion from the cluster state here before completion will lead to (a higher probability of) shard-level blob leaks in situations like master failovers. As things stand today I think the new master would retry the cleanups of these shards, whereas with this change we have to do another delete that touches the same indices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that this leads to a little more leaking on the shard level statistically speaking but I'm not so worried about it. If you think about it, it's quite unlikely for an index to run into a a partially failed delete and then not be touched again for a long time. The only scenario in which this would realistically happen is when a delete fails for the last regular backup of a shard/index that has moved to frozen and thus still has a snapshot around.
In that case though, the contents of the frozen-tier backing snapshot are probably the same as those of the last backup anyway so idk.
Master fail-overs + unlikely and I can't really see a common case where we'd have a massive diff between snapshots in a shard and then never touch it again => seems not too important?
Also note, that this brings us back to how things worked pre-7.6 in a sense (where we didn't track repository generation in the CS and thus the second delete after fail-over wouldn't do anything) and in my recollection, we hardly had any blobs leak past 7.4 back then (we did collect some data on cloud about that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share David's concerns but on the other hand I think this change has benefits that outweigh the (apparently unlikely) inconvenience of leaking blobs, so I'm +1 on this as long as we keep an eye on this on Cloud

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Member Author

Thanks David and Tanguy!

@original-brownbear original-brownbear merged commit 55acdfa into elastic:master Jun 16, 2022
@original-brownbear original-brownbear deleted the 2-step-snapshot-delete branch June 16, 2022 10:38
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jun 27, 2022
As a result of elastic#86514 we now
remove the snapshot delete from the cluster state before cleaning up
stale shard level blobs so we have to block master earlier on deleting the
root level index-N blob to get the desired cluster state where delete
and snapshot creation are both in the cluster state at the same time.

closes elastic#88061
original-brownbear added a commit that referenced this pull request Jun 28, 2022
…#88065)

As a result of #86514 we now
remove the snapshot delete from the cluster state before cleaning up
stale shard level blobs so we have to block master earlier on deleting the
root level index-N blob to get the desired cluster state where delete
and snapshot creation are both in the cluster state at the same time.

closes #88061
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Aug 24, 2022
For large snapshot clusters the final cleanup step that removes
previous shard metadata might have to delete tens of thousands of blobs.
This can take many seconds, for which the repository is needlessly blocked
from running subsequent finalizations or deletes.
Similar to how the final cleanup step was made async and taken out of the
finalization loop for deletes in elastic#86514 this PR moves the final
cleanup step for finalization async.
original-brownbear added a commit that referenced this pull request Aug 25, 2022
For large snapshot clusters the final cleanup step that removes
previous shard metadata might have to delete tens of thousands of blobs.
This can take many seconds, for which the repository is needlessly blocked
from running subsequent finalizations or deletes.
Similar to how the final cleanup step was made async and taken out of the
finalization loop for deletes in #86514 this PR moves the final
cleanup step for finalization async.
@original-brownbear original-brownbear restored the 2-step-snapshot-delete branch April 18, 2023 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants