-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression from 6.4.0 -> 7 #21828
Comments
|
Understood, I will try that. From that discussion, should I potentially disable the merkle cache altogether? Also related, I am still observing memory leaks and OOMs as tracked in #16913 on 6.4.0. I have been unable to have successful stable bulds on 7.x as to determine if the memory leak is resolved or not due to your changes the merkle caching |
In my experience, the cache is only helpful if your build is dominated by large tree artifacts (i.e., |
We are very mixed; some areas in code have small files & trees (golang, python, java) and others have large directories from node_modules, managed by rules_js I'll try with and without to see if there is a benefit. How much does experimental_remote_merkle_tree_cache_size affect the heap usage, is there a static memory usage per cache size? I'm cautious to bump it up too high and result in OOMs. |
Unfortunately, there's no straightforward connection between the flag value and the amount of heap consumed; each cache entry corresponds to a depset node, and has size proportional to the number of direct elements of that node (roughly |
I can confirm that |
Yes, I believe an unintentional regression was introduced in the lead-up to 7.x, causing a catastrophic slowdown when the cache size is too small to fit the largest depset in the build (in 6.x, a cache too small might have been slower, but not catastrophically so). As long as the size is large enough, I haven't seen evidence that 7.x is slower than 6.x. Unfortunately, it's difficult to revert the culprit (it's actually a combination of two different changes) because some later work has come to depend on it, and per the discussion in #21378, we'd rather spend time rearchitecting the Merkle tree cache instead of addressing performance issues with the current implementation. |
This makes complete sense, and now I understand why 6 was sufficient, even if less optimal than it could be. Thank you for explaining. I will run some experiments on my end on 6.x and 7.x to evaluate the effectiveness of |
@joeljeske I'm curious if you've had a chance to test out build performance with and without the Merkle tree cache? Knowing whether it helps in your case would be valuable input into our future plans. |
Hey @tjgq thanks for reaching out. I have done some spot checks with the following flow: after a bazel clean I perform a build with remote execution where all remote actions are expected to be cached. Do you know if this flow is sufficient to evaluate the effectiveness of flipping --experimental_remote_merkle_tree_cache, or do I need to ensure actions are not cached remotely? |
It's fine if the actions are cached remotely, since in order to check for a cache hit we still need to construct the Merkle trees. |
Excellent. So far, I have not observed much/significant performance regression when running with this flag flipped off. I will try flipping it for a portion of my fleet, and observe for any regression. With this off, should I be setting |
|
My performance issues in 7.x are resolved with Flipping on I will close this now. I eagerly await any future optimizations #21378 you may make to the merkle tree calculation process 😄 |
Description of the bug:
In bumping bazel to 7.x from 6.4.0, I've encountered significant performance regressions during my build. At the start of the build, Bazel appears to be progressing quickly, and consumes all available
--jobs
with remote actions. Within ~45min, Bazel is only able to execute 0 or 1 action remotely, and the rest are stuck in [Prepa] for a long time. I do not observe this behavior on 6.4.0, but I can consistently reproduce in 7.0.1 and 7.1.1.Which category does this issue belong to?
Remote Execution
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
No response
Which operating system are you running Bazel on?
Ubuntu 20.04
What is the output of
bazel info release
?release 7.1.1
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
Potentially related to #21626
Any other information, logs, or outputs that you want to share?
Some relevant flags
Thread state: threads.txt
The text was updated successfully, but these errors were encountered: