Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent excessive thread creation in disk cache garbage collection #24099

Closed

Conversation

rsalvador
Copy link
Contributor

Fixes #24098

With this change the disk cache garbage collection works correctly:

241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB

@rsalvador rsalvador requested a review from a team as a code owner October 27, 2024 10:28
@github-actions github-actions bot added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Oct 27, 2024
@@ -36,7 +36,7 @@ public final class DiskCacheGarbageCollectorIdleTask implements IdleTask {
private final DiskCacheGarbageCollector gc;

private static final ExecutorService executorService =
Executors.newCachedThreadPool(
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this workload is probably IO-bound, could you try how it performs if you use newThreadPerTaskExecutor instead (so that it uses virtual threads)?

@tjgq

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the ForkJoinPool used by virtual threads default to the number of cpus? I think I'd prefer something like newFixedThreadPool(min(4, availableProcessors()), ...) so that we still get some amount of parallel I/O on single-cpu systems.

Copy link
Contributor Author

@rsalvador rsalvador Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added performance info to CollectionStats: d1a8354

This is the performance on a MacBook Pro M1 Max for the different options:

  1. newFixedThreadPool(Runtime.getRuntime().availableProcessors(), ...:
241027 16:43:49.855:I 471 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 16:45:02.959:I 471 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 505571 of 505571 files, reclaimed 19.0 GiB of 19.0 GiB in 13.09 seconds (38626 files/s, 1484 MB/s)
  1. unbounded virtual threads: Executors.newThreadPerTaskExecutor(Thread.ofVirtual()...:
241027 18:16:24.806:I 472 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 18:17:38.658:I 472 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 505572 of 505572 files, reclaimed 19.0 GiB of 19.0 GiB in 13.83 seconds (36546 files/s, 1404 MB/s)

and this the performance on a linux machine where the thread error was not happening:

  1. original newCachedThreadPool:
241027 18:10:55.309:I 569 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 18:10:59.136:I 569 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 505556 of 505556 files, reclaimed 18.9 GiB of 18.9 GiB in 3.81 seconds (132622 files/s, 5090 MB/s)
  1. newFixedThreadPool(Runtime.getRuntime().availableProcessors(), ...:
241027 19:54:50.313:I 561 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 19:54:53.935:I 561 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 505555 of 505555 files, reclaimed 18.9 GiB of 18.9 GiB in 3.61 seconds (140121 files/s, 5378 MB/s)
  1. unbounded virtual threads: Executors.newThreadPerTaskExecutor(Thread.ofVirtual()...:
241027 18:54:57.419:I 560 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 18:55:01.136:I 560 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 505556 of 505556 files, reclaimed 18.9 GiB of 18.9 GiB in 3.70 seconds (136637 files/s, 5244 MB/s)

Ok to use the Executors.newThreadPerTaskExecutor(Thread.ofVirtual()... option?: d9afaa0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, the numbers suggest that virtual threads have slightly worse performance, so considering that this will have to be cherry-picked into 7.4.1, let's be conservative and stick with non-virtual threads for now?

Also, please do use min(4, cpus) as the pool size; in the past I've seen very bad performance on similar filesystem-bound thread pools when running on single or dual core machines.

} catch (InterruptedException e) {
logger.atInfo().withCause(e).log("Disk cache garbage collection interrupted");
} catch (Throwable e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwable should not be caught as Errors aren't guaranteed to be recoverable. If this change avoids the OOME, wouldn't that be sufficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I don't think we should catch Throwable here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch was not for recovering, but for easier diagnostics. Without it will not show the second line in:

241027 14:46:16.906:I 480 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 14:47:12.220:WT 480 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection failed

if there is some other unexpected error in the future.

@@ -36,7 +36,7 @@ public final class DiskCacheGarbageCollectorIdleTask implements IdleTask {
private final DiskCacheGarbageCollector gc;

private static final ExecutorService executorService =
Executors.newCachedThreadPool(
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the ForkJoinPool used by virtual threads default to the number of cpus? I think I'd prefer something like newFixedThreadPool(min(4, availableProcessors()), ...) so that we still get some amount of parallel I/O on single-cpu systems.

} catch (InterruptedException e) {
logger.atInfo().withCause(e).log("Disk cache garbage collection interrupted");
} catch (Throwable e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I don't think we should catch Throwable here.

@rsalvador rsalvador requested review from tjgq and fmeum October 27, 2024 20:07
@tjgq tjgq added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Oct 28, 2024
@github-actions github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Oct 28, 2024
bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Oct 28, 2024
Fixes bazelbuild#24098

With this change the disk cache garbage collection works correctly:
```
241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB
```

Closes bazelbuild#24099.

PiperOrigin-RevId: 690652512
Change-Id: Ie8d1fa6b2afb0bd5bd85fdb6835871023a64ad24
bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Oct 28, 2024
Fixes bazelbuild#24098

With this change the disk cache garbage collection works correctly:
```
241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB
```

Closes bazelbuild#24099.

PiperOrigin-RevId: 690652512
Change-Id: Ie8d1fa6b2afb0bd5bd85fdb6835871023a64ad24
iancha1992 pushed a commit that referenced this pull request Oct 28, 2024
…ction (#24114)

Fixes #24098

With this change the disk cache garbage collection works correctly:
```
241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB
```

Closes #24099.

PiperOrigin-RevId: 690652512
Change-Id: Ie8d1fa6b2afb0bd5bd85fdb6835871023a64ad24

Commit
3746583

Co-authored-by: Roman Salvador <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Oct 28, 2024
…ction (#24113)

Fixes #24098

With this change the disk cache garbage collection works correctly:
```
241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB
```

Closes #24099.

PiperOrigin-RevId: 690652512
Change-Id: Ie8d1fa6b2afb0bd5bd85fdb6835871023a64ad24

Commit
3746583

Co-authored-by: Roman Salvador <[email protected]>
ramil-bitrise pushed a commit to bitrise-io/bazel that referenced this pull request Dec 18, 2024
Fixes bazelbuild#24098

With this change the disk cache garbage collection works correctly:
```
241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB
```

Closes bazelbuild#24099.

PiperOrigin-RevId: 690652512
Change-Id: Ie8d1fa6b2afb0bd5bd85fdb6835871023a64ad24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disk cache garbage collection fails due to thread creation error
3 participants