-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent excessive thread creation in disk cache garbage collection #24099
Prevent excessive thread creation in disk cache garbage collection #24099
Conversation
@@ -36,7 +36,7 @@ public final class DiskCacheGarbageCollectorIdleTask implements IdleTask { | |||
private final DiskCacheGarbageCollector gc; | |||
|
|||
private static final ExecutorService executorService = | |||
Executors.newCachedThreadPool( | |||
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this workload is probably IO-bound, could you try how it performs if you use newThreadPerTaskExecutor
instead (so that it uses virtual threads)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the ForkJoinPool used by virtual threads default to the number of cpus? I think I'd prefer something like newFixedThreadPool(min(4, availableProcessors()), ...)
so that we still get some amount of parallel I/O on single-cpu systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added performance info to CollectionStats: d1a8354
This is the performance on a MacBook Pro M1 Max for the different options:
newFixedThreadPool(Runtime.getRuntime().availableProcessors(), ...
:
241027 16:43:49.855:I 471 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 16:45:02.959:I 471 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 505571 of 505571 files, reclaimed 19.0 GiB of 19.0 GiB in 13.09 seconds (38626 files/s, 1484 MB/s)
- unbounded virtual threads:
Executors.newThreadPerTaskExecutor(Thread.ofVirtual()...
:
241027 18:16:24.806:I 472 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 18:17:38.658:I 472 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 505572 of 505572 files, reclaimed 19.0 GiB of 19.0 GiB in 13.83 seconds (36546 files/s, 1404 MB/s)
and this the performance on a linux machine where the thread error was not happening:
- original
newCachedThreadPool
:
241027 18:10:55.309:I 569 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 18:10:59.136:I 569 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 505556 of 505556 files, reclaimed 18.9 GiB of 18.9 GiB in 3.81 seconds (132622 files/s, 5090 MB/s)
newFixedThreadPool(Runtime.getRuntime().availableProcessors(), ...
:
241027 19:54:50.313:I 561 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 19:54:53.935:I 561 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 505555 of 505555 files, reclaimed 18.9 GiB of 18.9 GiB in 3.61 seconds (140121 files/s, 5378 MB/s)
- unbounded virtual threads:
Executors.newThreadPerTaskExecutor(Thread.ofVirtual()...
:
241027 18:54:57.419:I 560 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 18:55:01.136:I 560 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 505556 of 505556 files, reclaimed 18.9 GiB of 18.9 GiB in 3.70 seconds (136637 files/s, 5244 MB/s)
Ok to use the Executors.newThreadPerTaskExecutor(Thread.ofVirtual()...
option?: d9afaa0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, the numbers suggest that virtual threads have slightly worse performance, so considering that this will have to be cherry-picked into 7.4.1, let's be conservative and stick with non-virtual threads for now?
Also, please do use min(4, cpus)
as the pool size; in the past I've seen very bad performance on similar filesystem-bound thread pools when running on single or dual core machines.
} catch (InterruptedException e) { | ||
logger.atInfo().withCause(e).log("Disk cache garbage collection interrupted"); | ||
} catch (Throwable e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throwable
should not be caught as Error
s aren't guaranteed to be recoverable. If this change avoids the OOME, wouldn't that be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I don't think we should catch Throwable
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The catch was not for recovering, but for easier diagnostics. Without it will not show the second line in:
241027 14:46:16.906:I 480 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 14:47:12.220:WT 480 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection failed
if there is some other unexpected error in the future.
@@ -36,7 +36,7 @@ public final class DiskCacheGarbageCollectorIdleTask implements IdleTask { | |||
private final DiskCacheGarbageCollector gc; | |||
|
|||
private static final ExecutorService executorService = | |||
Executors.newCachedThreadPool( | |||
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the ForkJoinPool used by virtual threads default to the number of cpus? I think I'd prefer something like newFixedThreadPool(min(4, availableProcessors()), ...)
so that we still get some amount of parallel I/O on single-cpu systems.
} catch (InterruptedException e) { | ||
logger.atInfo().withCause(e).log("Disk cache garbage collection interrupted"); | ||
} catch (Throwable e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I don't think we should catch Throwable
here.
Fixes bazelbuild#24098 With this change the disk cache garbage collection works correctly: ``` 241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started 241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB ``` Closes bazelbuild#24099. PiperOrigin-RevId: 690652512 Change-Id: Ie8d1fa6b2afb0bd5bd85fdb6835871023a64ad24
Fixes bazelbuild#24098 With this change the disk cache garbage collection works correctly: ``` 241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started 241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB ``` Closes bazelbuild#24099. PiperOrigin-RevId: 690652512 Change-Id: Ie8d1fa6b2afb0bd5bd85fdb6835871023a64ad24
…ction (#24114) Fixes #24098 With this change the disk cache garbage collection works correctly: ``` 241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started 241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB ``` Closes #24099. PiperOrigin-RevId: 690652512 Change-Id: Ie8d1fa6b2afb0bd5bd85fdb6835871023a64ad24 Commit 3746583 Co-authored-by: Roman Salvador <[email protected]>
…ction (#24113) Fixes #24098 With this change the disk cache garbage collection works correctly: ``` 241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started 241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB ``` Closes #24099. PiperOrigin-RevId: 690652512 Change-Id: Ie8d1fa6b2afb0bd5bd85fdb6835871023a64ad24 Commit 3746583 Co-authored-by: Roman Salvador <[email protected]>
Fixes bazelbuild#24098 With this change the disk cache garbage collection works correctly: ``` 241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started 241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB ``` Closes bazelbuild#24099. PiperOrigin-RevId: 690652512 Change-Id: Ie8d1fa6b2afb0bd5bd85fdb6835871023a64ad24
Fixes #24098
With this change the disk cache garbage collection works correctly: