Introduce SNAPSHOT_META Threadpool for Fetching Repository Metadata #73172

original-brownbear · 2021-05-17T16:19:02Z

Adds new snapshot meta pool that is used to speed up the get snapshots API
by making SnapshotInfo load in parallel. Also use this pool to load
RepositoryData.
A follow-up to this would expand the use of this pool to the snapshot status
API and make it run in parallel as well.

Adds new snapshot meta pool that is used to speed up the get snapshots API by making `SnapshotInfo` load in parallel. Also use this pool to load `RepositoryData`. A follow-up to this would expand the use of this pool to the snapshot status API and make it run in parallel as well.

elasticmachine · 2021-05-17T16:19:06Z

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear · 2021-05-17T17:22:00Z

server/src/main/java/org/elasticsearch/threadpool/ThreadPool.java

@@ -189,6 +191,8 @@ public ThreadPool(final Settings settings, final ExecutorBuilder<?>... customBui
        builders.put(Names.REFRESH, new ScalingExecutorBuilder(Names.REFRESH, 1, halfProcMaxAt10, TimeValue.timeValueMinutes(5)));
        builders.put(Names.WARMER, new ScalingExecutorBuilder(Names.WARMER, 1, halfProcMaxAt5, TimeValue.timeValueMinutes(5)));
        builders.put(Names.SNAPSHOT, new ScalingExecutorBuilder(Names.SNAPSHOT, 1, halfProcMaxAt5, TimeValue.timeValueMinutes(5)));
+        builders.put(Names.SNAPSHOT_META, new ScalingExecutorBuilder(Names.SNAPSHOT_META, 1, 2 * allocatedProcessors,


I only used 5 min here because we're using it everywhere else for now. It seems like this should really be less.
Twice the allocated processors seemed like a reasonable guess here since this is exclusively going to run on master nodes pretty much that don't have endless cores available. This might be a little low though in some cases. We could also go for an absolute value here I guess but in the end these threads will also be doing some heavy deserialisation work here and there so going way beyond the CPU count is questionable. OTOH depending on the repo implementation an upper bound may make sense too due to e.g. S3 SDK connection limits.

=> Happy to hear opinions here or discuss this :)

I have the feeling that having a fixed default upper bound would make it easier to reason about the snapshots related thread pools. Something like halfProcMaxAt5 maybe, so that we don't exceed the number of CPUs ad we don't either exceed by too much the default connection pool size.

original-brownbear · 2021-05-17T17:23:14Z

...n/java/org/elasticsearch/action/admin/cluster/snapshots/get/TransportGetSnapshotsAction.java

+        // put snapshot info downloads into a task queue instead of pushing them all into the queue to not completely monopolize the
+        // snapshot meta pool for a single request
+        final int workers = Math.min(threadPool.info(ThreadPool.Names.SNAPSHOT_META).getMax(), snapshotIdsToIterate.size());
+        final Executor executor = threadPool.executor(ThreadPool.Names.SNAPSHOT_META);


Used the same logic here that we use for file uploads and I believe also pre-warming where we do this kind of fake work-stealing-pool. This could be dried up in a follow-up since we have that same logic in a few places now.

tlrx

LGTM, I only left minor comments.

tlrx · 2021-05-18T08:56:55Z

...n/java/org/elasticsearch/action/admin/cluster/snapshots/get/TransportGetSnapshotsAction.java

+    }
+
+    private void getOneSnapshotInfo(boolean ignoreUnavailable,
+                                    Repository repository,


I wonder if we should retrieve the Repository from the the RepositoriesService for each SnapshotInfo to load, so that if the repository is gone the RepositoryMissing is easier to propagate through listeners (and grouped listener which clears the queue etc). Otherwise a RepositoryMissing might be thrown I think and will be caught at a higher level but we keep fetching snapshot info here.

We could I guess but it's not going to be a big cleanup/win since this situation is somewhat broken to begin with.
The behavior of the repository after close isn't well defined currently. Depending on the repo implementation the requests can either start failing or in case of FsRepository will just keep going actually because close is a noop there.
Might be worth just fixing that in general at some point?

I agree but I still think it is worth not mixing thrown exceptions and listeners here.

oh 🤦 now I get your comment. Sorry, I completely misread it for no good reason :( => Fix coming right up.

I pushed fb55daa to address this (and random formatting noise) now :) I went with this instead of looking up the repo in the loop, because the latter would be caught and suppressed by ignoreUnvailable which I found weird (albeit practically irrelevant).

fb55daa looks good, thanks! And sorry if I wasn't clear at first :)

tlrx · 2021-05-18T09:00:53Z

...n/java/org/elasticsearch/action/admin/cluster/snapshots/get/TransportGetSnapshotsAction.java

+        }
+    }
+
+    private void getOneSnapshotInfo(boolean ignoreUnavailable,


nit: can we add a bit of Javadoc?

tlrx · 2021-05-18T09:24:08Z

server/src/main/java/org/elasticsearch/threadpool/ThreadPool.java

@@ -189,6 +191,8 @@ public ThreadPool(final Settings settings, final ExecutorBuilder<?>... customBui
        builders.put(Names.REFRESH, new ScalingExecutorBuilder(Names.REFRESH, 1, halfProcMaxAt10, TimeValue.timeValueMinutes(5)));
        builders.put(Names.WARMER, new ScalingExecutorBuilder(Names.WARMER, 1, halfProcMaxAt5, TimeValue.timeValueMinutes(5)));
        builders.put(Names.SNAPSHOT, new ScalingExecutorBuilder(Names.SNAPSHOT, 1, halfProcMaxAt5, TimeValue.timeValueMinutes(5)));
+        builders.put(Names.SNAPSHOT_META, new ScalingExecutorBuilder(Names.SNAPSHOT_META, 1, 2 * allocatedProcessors,


I have the feeling that having a fixed default upper bound would make it easier to reason about the snapshots related thread pools. Something like halfProcMaxAt5 maybe, so that we don't exceed the number of CPUs ad we don't either exceed by too much the default connection pool size.

tlrx · 2021-05-18T09:24:41Z

server/src/test/java/org/elasticsearch/threadpool/ScalingThreadPoolTests.java

@@ -96,6 +96,7 @@ private int expectedSize(final String threadPoolName, final int numberOfProcesso
        sizes.put(ThreadPool.Names.REFRESH, ThreadPool::halfAllocatedProcessorsMaxTen);
        sizes.put(ThreadPool.Names.WARMER, ThreadPool::halfAllocatedProcessorsMaxFive);
        sizes.put(ThreadPool.Names.SNAPSHOT, ThreadPool::halfAllocatedProcessorsMaxFive);
+        sizes.put(ThreadPool.Names.SNAPSHOT_META, ThreadPool::twiceAllocatedProcessors);


Can we document this thread pool?

…read-threadpool

DaveCTurner

One minor question inline.

Should we also move TransportSnapshotsStatusAction and/or TransportMountSearchableSnapshotAction and/or restore operations from GENERIC to SNAPSHOT_META too?

DaveCTurner · 2021-05-18T10:50:06Z

...n/java/org/elasticsearch/action/admin/cluster/snapshots/get/TransportGetSnapshotsAction.java

+                                    BlockingQueue<SnapshotId> queue,
+                                    Collection<SnapshotInfo> snapshotInfos,
+                                    CancellableTask task,
+                                    Executor executor,


This is always the same executor, just wondering why not get it from the threadPool each time rather than passing it in.

Right, not sure why I did it this way initially, fixed in 24cf149

original-brownbear · 2021-05-18T11:29:29Z

Should we also move TransportSnapshotsStatusAction and/or TransportMountSearchableSnapshotAction and/or restore operations from GENERIC to SNAPSHOT_META too?

The snapshot status I was going to move to the pool as well and parallelize on it in a follow-up. I don't have a strong opinion either way when it comes to the mount action. Though maybe it would be nice to get it off the generic pool as well in case of massively concurrent calls actually. I'd also do that in a separate PR though as I just found an edge case bug in its threading anyway (PR incoming today, maybe just discuss it there?).

Update: see #73196 for the mentioned bug

DaveCTurner

LGTM

original-brownbear · 2021-05-18T12:40:30Z

Thanks David and Tanguy!

Small and obvious oversight from elastic#73172

Small and obvious oversight from #73172

) Backport of the recently introduced snapshot pagination and scalability improvements listed below. Merged as a single backport because the `7.x` and master snapshot status API logic had massively diverged between master and 7.x. With the work in the below PRs, the logic in master and 7.x once again has been aligned very closely again. #72842 #73172 #73199 #73570 #73952 #74236 #74451 (this one is only partly applicable as it was mainly a change to master to align `master` and `7.x` branches)

original-brownbear added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.14.0 labels May 17, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 17, 2021

original-brownbear commented May 17, 2021

View reviewed changes

original-brownbear requested review from tlrx and DaveCTurner May 17, 2021 17:45

tlrx approved these changes May 18, 2021

View reviewed changes

original-brownbear added 4 commits May 18, 2021 11:39

Merge remote-tracking branch 'elastic/master' into snapshot-metadata-…

f513d2a

…read-threadpool

CR: docs

b4ba77e

Merge remote-tracking branch 'elastic/master' into snapshot-metadata-…

28eea50

…read-threadpool

fix repo missing case

fb55daa

DaveCTurner reviewed May 18, 2021

View reviewed changes

original-brownbear added 2 commits May 18, 2021 13:23

don't pass executor around

24cf149

typos

bde33f6

original-brownbear requested a review from DaveCTurner May 18, 2021 11:29

fix test

9f465ac

original-brownbear mentioned this pull request May 18, 2021

Fix Edge-Case Threading Bug in TransportMountSearchableSnapshotAction #73196

Merged

DaveCTurner approved these changes May 18, 2021

View reviewed changes

original-brownbear merged commit da24285 into elastic:master May 18, 2021

original-brownbear deleted the snapshot-metadata-read-threadpool branch May 18, 2021 12:40

original-brownbear added the backport pending label May 18, 2021

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request May 18, 2021

Fix UpdateThreadPoolSettingsTests

2d55c28

Small and obvious oversight from elastic#73172

original-brownbear mentioned this pull request May 18, 2021

Fix UpdateThreadPoolSettingsTests #73199

Merged

original-brownbear added a commit that referenced this pull request May 18, 2021

Fix UpdateThreadPoolSettingsTests (#73199)

06fc62f

Small and obvious oversight from #73172

original-brownbear mentioned this pull request Jun 21, 2021

Improve Snapshot Repository Scalability #74350

Closed

16 tasks

original-brownbear mentioned this pull request Jun 29, 2021

Snapshot Pagination and Scalability Improvements Backport to 7.x #74676

Merged

original-brownbear removed the backport pending label Jun 29, 2021

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

original-brownbear restored the snapshot-metadata-read-threadpool branch April 18, 2023 20:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce SNAPSHOT_META Threadpool for Fetching Repository Metadata #73172

Introduce SNAPSHOT_META Threadpool for Fetching Repository Metadata #73172

original-brownbear commented May 17, 2021

elasticmachine commented May 17, 2021

original-brownbear May 17, 2021

tlrx May 18, 2021

original-brownbear May 17, 2021

tlrx left a comment

tlrx May 18, 2021

original-brownbear May 18, 2021

tlrx May 18, 2021

original-brownbear May 18, 2021

original-brownbear May 18, 2021 •

edited

Loading

tlrx May 18, 2021

tlrx May 18, 2021

tlrx May 18, 2021

tlrx May 18, 2021

DaveCTurner left a comment

DaveCTurner May 18, 2021

original-brownbear May 18, 2021

original-brownbear commented May 18, 2021 •

edited

Loading

DaveCTurner left a comment

original-brownbear commented May 18, 2021

Introduce SNAPSHOT_META Threadpool for Fetching Repository Metadata #73172

Introduce SNAPSHOT_META Threadpool for Fetching Repository Metadata #73172

Conversation

original-brownbear commented May 17, 2021

elasticmachine commented May 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear May 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented May 18, 2021 • edited Loading

DaveCTurner left a comment

Choose a reason for hiding this comment

original-brownbear commented May 18, 2021

original-brownbear May 18, 2021 •

edited

Loading

original-brownbear commented May 18, 2021 •

edited

Loading