-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asynchronous sandbox cleanup leads to inode exhaustion with defaulted --experimental_sandbox_async_tree_delete_idle_threads=4 #20965
Comments
Cc @oquenchil |
We experienced the same thing with JS tests with big runfiles trees and a /dev/shm sandbox base |
Experiencing same issue |
@bazel-io flag |
@sgowroji This sounds more like an issue for Team-Local-Exec as it isn't language-specific, but related to sandboxing in general. I mentioned @oquenchil since he was involved in the flag "flip". |
@bazel-io fork 7.1.0 |
@iancha1992 This may need to be picked into a 7.0.2 depending on assessed severity. |
How many inodes are we talking about? I can't tell from the error which number is the inodes. Depending on how widespread this is we have two options, if few users are affected they can pass The long term solution would be to keep track of how many inodes are free and block before planting any more symlinks. I won't rush that solution for 7.0.2 though. |
The number of inodes on /dev/shm is 95M. Before removing the sandbox directory, there were 1700 free (it's not 0 because Bazel cleans some up before I check the number.) After manually removing the sandbox directory, there are 95M free. |
It takes around 65 seconds to run out of inodes when the build is running. Once the build fails, if I just leave Bazel alone to clean up the sandbox itself, it takes around 554 seconds to clean up the sandbox with |
In general it seems off to me that we generate inodes based on jobs
(typically CPU count) but don't scale cleanup based on CPUs. So the more
powerful your machine, the easier it is to have creation outpace deletion,
right?
…On Tue, Jan 23, 2024, 6:44 PM Jeremiah Rodriguez ***@***.***> wrote:
It takes around 65 seconds to run out of inodes when the build is running.
Once the build fails, if I just leave Bazel alone to clean up the sandbox
itself, it takes around 554 seconds to clean up the sandbox with
--experimental_sandbox_async_tree_delete_idle_threads not mentioned in
bazelrc (so defaulted to 4 at version 7.0.1).
—
Reply to this email directly, view it on GitHub
<#20965 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAL6JETCT4UMCCSCTVVLSFDYQBDM5AVCNFSM6AAAAABCDPDUH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXGA4TSNZQHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Agreed. With tmpfs, there would be a cpu bottleneck from in memory operations. So the rate of symlink creation during the build will scale up to the cpu and kernel limits, but the rate of symlink deletion during the build is capped at 1 core (according to the documentation). The docs state:
|
The reason you give sounds logical. A single deletion thread cannot keep up. Something else I was wondering, do you have I still don't have a sense of how much of an edge case your set up is in order to evaluate whether the flag should be unflipped immediately or have users affected turn it off by themselves. I can start working from today on a proper fix but I don't want this to be blocking the release. Would this affect most JS users? |
@oquenchil The 7.0.2 release has already happened, so there is no risk of blocking it. :-) |
I do not have |
We also have the reuse flag enabled. (As an aside, I am a little confused by the sandbox reuse. It works per-mnemonic but all tests share one so it seems easy to ping-pong a sandbox between different runfiles trees, potentially with little overlap. I'd imagine it works better for build actions where you have different mnemonics and a large toolchain gets consistently reused for every action with given mnemonic.) |
Coincidentally I realized this yesterday evening and I'm implementing a fix now. There is another issue in the current code. The second parameter here should be the sandbox base. Otherwise in cases like yours where you pass a different filesystem to the sandbox base, the stashing becomes copying. Once I get these two changes in, the execution should be much faster apart from fixing the inode exhaustion problem. |
Any update? Should this still be a release blocker for 7.1? |
Almost done. Is this the last blocker? |
Nope, there are still a number of them, I'm just doing a round of review. |
Because the sandbox_base can be set via the configuration flag `--sandbox_base`, it might be that the stashes end up in a different filesystem than the actual sandbox, which makes `reuse_sandbox_directories` inefficient. Apart from making sure that the stash is under `sandboxBase`, this change refactors the logic about which directories to delete from `sandboxBase` and when. The stash will persist between incremental builds but after a bazel server restart it will be deleted. This change also refactors the tests since we had three different tests which weren't testing anything different. The change also includes more profiling spans related to reuse_sandbox_directories. Works towards fixing #20965. PiperOrigin-RevId: 607023622 Change-Id: I59f9d2066e966dd60dd11f7d82815d0c89f2ce23
Because the sandbox_base can be set via the configuration flag `--sandbox_base`, it might be that the stashes end up in a different filesystem than the actual sandbox, which makes `reuse_sandbox_directories` inefficient. Apart from making sure that the stash is under `sandboxBase`, this change refactors the logic about which directories to delete from `sandboxBase` and when. The stash will persist between incremental builds but after a bazel server restart it will be deleted. This change also refactors the tests since we had three different tests which weren't testing anything different. The change also includes more profiling spans related to reuse_sandbox_directories. Works towards fixing #20965. PiperOrigin-RevId: 607023622 Change-Id: I59f9d2066e966dd60dd11f7d82815d0c89f2ce23
A fix for this issue has been included in Bazel 7.1.0 RC1. Please test out the release candidate and report any issues as soon as possible. Thanks! |
I tested a full rebuild of my project with Bazel 7.1.0 RC1. The result is that the build is actually able to complete now without specifying |
Glad to hear Jeremiah. Could you please comment on whether the build/tests got any faster? With the reuse changes it should have got much faster so it'd be disappointing to hear it didn't. |
You can build/test with Then get the time with:
At least there I'd expect to see an improvement. More info about profiling: https://bazel.build/advanced/performance/json-trace-profile |
I ran a few full rebuilds of my project:
I'm not sure what the units Bazel's profiling is using, but I pasted the output at your request. As far as I can tell, it shaved a few minutes off of my build. That's pretty good, considering I have a 1167 second critical path consisting of a clang compiler build + emscripten cache build. So I'm sure it helps the incremental builds even more. Nice work! |
Thank you Jeremiah. That's still quite a bit of overhead but we'll add more improvements. |
Description of the bug:
It seems that Bazel does not clean up the sandbox directories fast enough, which eventually leads to running out of inodes when performing a build.
Which category does this issue belong to?
Local Execution
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
I'm not sure what the simplest way to reproduce this bug is. It might be because I am running a 96 core machine with
and lots of symlinks from c++ compilation with tmpfs. The documentation hints that deletion threads are limited to 1 during the build which is probably the problem.
Which operating system are you running Bazel on?
Linux 6.7.0-arch3-1 #1 SMP PREEMPT_DYNAMIC Sat, 13 Jan 2024 14:37:14 +0000 x86_64 GNU/Linux
What is the output of
bazel info release
?release 7.0.1
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
Probably d169556 which changed the default of experimental_sandbox_async_tree_delete_idle_threads to 4. I did not actually run the bisection, but I will note that 7.0.0-pre.20231011.2 did not have this problem and 7.0.1 does have the issue.
Have you found anything relevant by searching the web?
#19266 (comment)
Any other information, logs, or outputs that you want to share?
It fails with an error message such as:
Inodes after failure:
Inodes after manually removing sandbox directory:
I am using --sandbox_base=/dev/shm which is mounted as tmpfs.
After the build fails, the sandbox directory is slowly cleaned up, unless bazel is shutdown, which is what originally alerted me to the cause.
It seems that this was probably caused by d169556 which enabled asynchronous sandbox cleanup by changing the default of experimental_sandbox_async_tree_delete_idle_threads to 4.
I confirmed that adding experimental_sandbox_async_tree_delete_idle_threads=0 fixes the issue for me.
The text was updated successfully, but these errors were encountered: