-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checking --disk_cache
cache is slow for huge tree artifacts
#17804
Comments
It might be worth experimenting with |
We'll give that a try and see what performance is like. Thanks for pointing out that flag! |
Why? |
Maybe we can! When I wrote that comment, I was under the impression that Loom becoming available in Q2 is not very probable, but from what I heard from you today, it may be available in time? It depends on when in Q2 Loom will be available. @tbaing said the milestone that cannot be done if this step is slow is expected sometime mid-Q3, which means that (building in some safety margin) this should be fixed by ~end of Q2. Is that not too early to depend on Loom? (I'd love it if it wasn't, but not depending on Loom is the safe choice) |
My line of reasoning is that I would rather not introduce another threadpool into Bazel which comes with its own complications when we have the proper solution on the horizon. |
(I first misread the word as "dreadpool", which is Freudian, I guess) How likely do you think is that Loom will actually be available in Q2? My mental assessment is that the chances are about even, but if you think there is a good chance that it will land, I think it's fine to put down Loom as a dependency; this would actually be a good way to introduce a low-risk but useful use case for Loom to get our toes wet. Thinking again, even if Loom won't work out, adding a threadpool doesn't seem to be a particularly invasive change. |
I would re-evaluate this issue ~early May and see where we are. |
@tbaing is the pain tolerable until then? |
I reproduced this yesterday. The code in Threadpool or Loom won't help because, from the profile, there are too many concurrent I would suggest try Bazel 6.1 because it has #17120 which limit the concurrent |
Interesting! Does this mean that the Merkle tree computation on large tree artifacts is in fact parallelized already? From the code of |
MerkleTree computation for one spawn is done sequentially, but for this build (and usually the same case for other builds), we have many actions running concurrently which are constructing Merkle tree. |
Ah, I see. I thought that was cached already, but apparently not without |
We have a user who ran into this issue. After upgrading from Bazel 6.4 to 7.0, actions with large input trees (30k files) would spend a lot of time on The user does have I also noticed a set of related changes: #17120 and #20558, not sure if those would help at all. Note: Because this is not an obvious failure, it's also not trivial for the user to leverage Bazelisk bisect to narrow down the culprit commit between the 2 releases. Having a manual bisect mode (like git-bisect) could be helpful in these cases. |
Duplicate of #21379. |
Description of the bug:
When an action does not hit the local action cache and it has an input that is a huge (hundreds of thousands of files and gigabytes of data) tree artifact, it takes a lot of time to check whether the action is in
--disk_cache
. CPU use is pegged at 100% in these cases.The CPU time seems to be spent in
MerkleTree.build(ActionInput)
. Its implementation seems to be single-threaded so it's not a big surprise that it's slow.The report I heard about this said that this happens when a BUILD file is changed, which is consistent with losing the in-memory metadata cache in Skyframe which makes it necessary to reconstruct the metadata from the data on disk.
I can imagine further clever optimizations, but doing the Merkle tree computation in parallel would be the first step. Doesn't look too complicated, although managing another thread pool would be annoying and we can't quite wait until Loom comes online.
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
(reproduction is internal to Google, cannot copy-paste it here)
Which operating system are you running Bazel on?
Linux
What is the output of
bazel info release
?unknown
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: