Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize stale index deletion #100316

Merged

Conversation

DaveCTurner
Copy link
Contributor

After deleting a snapshot today we clean up all the now-dangling indices
sequentially, which can be rather slow. With this commit we parallelize
the work across the whole SNAPSHOT pool on the master node.

Closes #61513

Co-authored-by: Piyush Daftary [email protected]

After deleting a snapshot today we clean up all the now-dangling indices
sequentially, which can be rather slow. With this commit we parallelize
the work across the whole `SNAPSHOT` pool on the master node.

Closes elastic#61513

Co-authored-by: Piyush Daftary <[email protected]>
@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.12.0 labels Oct 5, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 5, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/part-1

@DaveCTurner DaveCTurner requested a review from ywangd October 5, 2023 05:45
@DaveCTurner
Copy link
Contributor Author

I'm slightly concerned about the lack of backpressure in this area, we could in theory end up with an ever-increasing pile of delete work in the queue; previously that would eventually have blocked all the snapshot threads but with this change even that doesn't happen any more. I've raised this point for discussion with the team.

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Oct 5, 2023

One possible solution would be to preserve today's behaviour of always blocking at least one SNAPSHOT thread for each cleanup operation, so that falling behind would still eventually end up with all the SNAPSHOT threads processing deletes, slowing down subsequent snapshots. Obviously we don't just want to pause that thread, but we could use it to eagerly steal work from the queue:

diff --git a/server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java b/server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
index fa699a76fadd..f7f9ed22efe4 100644
--- a/server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
+++ b/server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
@@ -147,6 +147,7 @@ import java.util.concurrent.ConcurrentHashMap;
 import java.util.concurrent.Executor;
 import java.util.concurrent.LinkedBlockingQueue;
 import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicBoolean;
 import java.util.concurrent.atomic.AtomicLong;
 import java.util.concurrent.atomic.AtomicReference;
 import java.util.function.Consumer;
@@ -1194,6 +1195,28 @@ public abstract class BlobStoreRepository extends AbstractLifecycleComponent imp
                 }));
             }
         }
+
+        // dedicate a single SNAPSHOT thread for this work, so that if we fall too far behind with deletes then eventually we stop taking
+        // snapshots too
+        threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(new AbstractRunnable() {
+            @Override
+            protected void doRun() {
+                final AtomicBoolean isDone = new AtomicBoolean(true);
+                final Releasable ref = () -> isDone.set(true);
+                ActionListener<Releasable> nextTask;
+                while ((nextTask = staleBlobDeleteRunner.takeNextTask()) != null) {
+                    isDone.set(false);
+                    nextTask.onResponse(ref);
+                    assert isDone.get();
+                }
+            }
+
+            @Override
+            public void onFailure(Exception e) {
+                logger.error("unexpected failure while processing deletes on dedicated snapshot thread", e);
+                assert false : e;
+            }
+        });
     }

     /**

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/part-2

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has suprisingly rich context and interesting technical details. Thanks for the chance to review. My comments are mostly for my educational purpose. Thanks!

Comment on lines +1176 to +1194
final var survivingIndexIds = newRepoData.getIndices().values().stream().map(IndexId::getId).collect(Collectors.toSet());
for (final var indexEntry : foundIndices.entrySet()) {
final var indexSnId = indexEntry.getKey();
if (survivingIndexIds.contains(indexSnId)) {
continue;
}
staleBlobDeleteRunner.enqueueTask(listeners.acquire(ref -> {
try (ref) {
logger.debug("[{}] Found stale index [{}]. Cleaning it up", metadata.name(), indexSnId);
final var deleteResult = indexEntry.getValue().delete(OperationPurpose.SNAPSHOT);
blobsDeleted.addAndGet(deleteResult.blobsDeleted());
bytesDeleted.addAndGet(deleteResult.bytesDeleted());
logger.debug("[{}] Cleaned up stale index [{}]", metadata.name(), indexSnId);
} catch (IOException e) {
logger.warn(() -> format("""
[%s] index %s is no longer part of any snapshot in the repository, \
but failed to clean up its index folder""", metadata.name(), indexSnId), e);
}
}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this and above in their own separate methods as how they are now? Fewer nesting levels could be helpful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I inlined these because we ended up having to pass really quite a lot of parameters in, and it wasn't really even a coherent set of parameters so much as just "the things needed to run this method". It's still less than one screenful (on my screen anyway) and nicely exposes the execution pattern, so tbh I prefer it as it is now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually kind of a code smell throughout this class (and the snapshot codebase more generally). I have been working on a refactoring that should help simplify things in this area and will open a followup in the next few days.

@@ -295,4 +297,58 @@ public void testRepositoryConflict() throws Exception {
logger.info("--> wait until snapshot deletion is finished");
assertAcked(future.actionGet());
}

public void testLeakedStaleIndicesAreDeletedBySubsequentDelete() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cool test and I learnt a few tips and tricks from it. But it does not test the new parallelization change. Do we care?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is largely from the original contributor, but I think it's a reasonable test to write. I've played around with a few ideas for testing the new threading more precisely but it seems pretty tricky, and kinda doesn't matter so much as long as we do actually do the work somehow. I think I had an idea for a test tho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok my idea seems to work, see testCleanupStaleBlobsConcurrency added in b56455c (sorry for the force-push)

Comment on lines 178 to 180
if (isDone.get() == false) {
logger.error("runSyncTasksEagerly() was called on a queue [{}] containing an async task: [{}]", taskRunnerName, task);
assert false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a best effort, right? A mis-behaving task can still fork but release the reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed but that doesn't matter to us really. The ref is just so that the AbstractThrottledTaskRunner can track the relevant activities to completion. If we did fork an untracked task then presumably we're tracking it somewhere else.

Comment on lines 1208 to 1227
threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(new AbstractRunnable() {
@Override
protected void doRun() {
staleBlobDeleteRunner.runSyncTasksEagerly();
}

@Override
public void onFailure(Exception e) {
logger.error("unexpected failure while processing deletes on dedicated snapshot thread", e);
assert false : e;
}

@Override
public void onRejection(Exception e) {
if (e instanceof EsRejectedExecutionException esre && esre.isExecutorShutdown()) {
return;
}
super.onRejection(e);
}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this managed inside the task runner? It would be helpful for reuse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a good point. With only one caller it's hard to know where to put the abstraction boundary, but I can see value in allowing callers to pass in an Executor. They can always use EsExecutors.DIRECT_EXECUTOR_SERVICE if they really don't want to fork.

Comment on lines +188 to +189
assertTrue(queue.isEmpty());
assertNoRunningTasks(taskRunner);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my learning:

  1. Why do we need assert the queue is empty here? We already asserted it has 0 size four lines above. I don't see how the queue size can increase after all tasks are enqueued?
  2. In the existing assertNoRunningTasks, why is it necessary to spawn a batch of Runnables before verify runningTasks size is 0? We already verified that the queue is empty. It seems to me that just verify runningTasks is sufficient? Also running the extra Runnable does not guarantee that they actually excercise every thread in the pool. So not sure the purpose here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really we're just doing the same as the other tests here, verifying that the task runner is completely finished at the end of the test.

In assertNoRunningTasks() we need to wait for all the tasks spawned by the AbstractThrottledTaskRunner to completely finish before we can be sure the running task count is zero. That's because we call runningTasks.decrementAndGet() after completing each task, so when e.g. executedCountDown completes the count will not have reached zero yet. In order to make sure that the thread pool is completely free from all its current work, we enqueue N barrier tasks, which must therefore all be waiting on the barrier at the same time.

Put differently, if you remove that and run the tests in a loop then you should see them fail occasionally.

I'll add a comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for dragging on this point. But I think this threads flushing behaviour is not necessary for this test because the releasable is used in a try-with-resource block and should be released before executedCountDown finishes. This behaviour is needed for testEnqueueSpawnsNewTasksUpToMax because it explicitly closes the releasable after the countDown. If I flip the order, it also runs successfully 10K times without the thread flushing logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah hm I see what you mean now. Yes I think you're right we could simplify this if we made sure to release all the task refs before allowing that test to proceed.

@DaveCTurner DaveCTurner force-pushed the 2023/10/05/stale-index-deletion-speedup branch from 0aef608 to b56455c Compare October 5, 2023 15:47
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The new concurrency test is pretty awesome 👍

* Run a single task on the given executor which eagerly pulls tasks from the queue and executes them. This must only be used if the
* tasks in the queue are all synchronous, i.e. they release their ref before returning from {@code onResponse()}.
*/
public void runSyncTasksEagerly(Executor executor) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have an overload version of runSyncTasksEagerly() that just use the TaskRunner's own executor for the eager run as well?

Copy link
Contributor Author

@DaveCTurner DaveCTurner Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that could work too. I'm going to wait and see on that idea tho, I'd rather do one or the other, and the choice depends on whether we have other users that want to do something else or whether everyone just forks another task on AbstractThrottledTaskRunner#executor.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@DaveCTurner DaveCTurner merged commit cadcb9b into elastic:main Oct 9, 2023
@DaveCTurner DaveCTurner deleted the 2023/10/05/stale-index-deletion-speedup branch October 9, 2023 13:07
@DaveCTurner
Copy link
Contributor Author

Thanks both!

@DaveCTurner DaveCTurner restored the 2023/10/05/stale-index-deletion-speedup branch June 17, 2024 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make snapshot deletion faster
4 participants