-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Bazel more responsive and use less memory when --jobs is high #17120
Make Bazel more responsive and use less memory when --jobs is high #17120
Conversation
8ff3784
to
d75826b
Compare
FYI, a workaround for 2 is to use the Merkle cache. We've seen it reduce the memory needed from over 64 GB down to 17 GB. |
Thanks for bringing this up! We are exploring using virtual threads to eliminate flag For 2) and 3), we are saving memory for common cases but sacrificing performance for rare cases which sounds reasonable but probably need more data. e.g. what's the build time regression for a cold build? |
I also tried this option, but it didn't cause significant improvements for us (both in terms of memory usage and CPU time). Great to hear it works for you, though!
👍
I wasn't able to measure any regression in wall time, for the reason that in the case of a cold build the build time is mostly dominated by execution taking place on the remote side. So it's wasting more CPU cycles on the client, but only in the case where it would have been idling otherwise. Let me see what I can do to address the remaining test failures. |
Just so that point isn't forgotten: It might be interesting to collect measurements for dynamic execution (local + remote) as that will have the client do actual work in addition to orchestrating the remote executor. |
I would really appreciate it if anyone using dynamic execution is willing to test this. We never use it on our end, for the reason that remote execution is inherently faster for us (due to us using bb_clientd). I thus have no frame of reference. Sorry! |
d75826b
to
537f650
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate flag but to be on the safe side, can we add an experimental flag for 2 and 3 and set it disabled by default? It also let others have a chance to benchmark the impact.
There is also internal code relying on 3, making it conditional will help when importing.
src/test/java/com/google/devtools/build/lib/remote/RemoteExecutionServiceTest.java
Outdated
Show resolved
Hide resolved
537f650
to
8cba3d8
Compare
When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons: 1. It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees. As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees. Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. 2. Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root. That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS. 3. AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it. Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching). Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g.
8cba3d8
to
a66590d
Compare
Sure! I have just added a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I am importing the PR and make changes to internal code. If something went wrong internally, I will report back.
Interestingly I made this related change yesterday: 3d29b2e Side note: IMHO it would have been better to split this PR into three separate PRs (no action required). |
Could this be cherry-picked to bazel 6.1.0? |
@bazel-io fork 6.1.0 |
Sure. I believe this is a safe change but brings lots of value to remote execution. I would also like folks can play around with the new flag sooner and maybe report back their experience/benchmarks here. |
When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons: 1. It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees. As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees. Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. By making the semaphore interruptible, Bazel will immediately stop processing the `512 - nCPU` actions that were waiting for the semaphore. 2. Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root. That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS. 3. AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it. Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching without remote exection). Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g. Closes bazelbuild#17120. PiperOrigin-RevId: 500990181 Change-Id: I6d1ba03470b79424ce2e1c2e83abd8fa779dd268
…17398) When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons: 1. It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees. As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees. Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. By making the semaphore interruptible, Bazel will immediately stop processing the `512 - nCPU` actions that were waiting for the semaphore. 2. Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root. That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS. 3. AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it. Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching without remote exection). Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g. Closes #17120. PiperOrigin-RevId: 500990181 Change-Id: I6d1ba03470b79424ce2e1c2e83abd8fa779dd268 Co-authored-by: Ed Schouten <[email protected]> Co-authored-by: kshyanashree <[email protected]>
When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons: 1. It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees. As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees. Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. By making the semaphore interruptible, Bazel will immediately stop processing the `512 - nCPU` actions that were waiting for the semaphore. 2. Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root. That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS. 3. AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it. Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching without remote exection). Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g. Closes #17120. PiperOrigin-RevId: 500990181 Change-Id: I6d1ba03470b79424ce2e1c2e83abd8fa779dd268
Similar to #19924 (comment), I filed #20478. |
merkleTree = buildInputMerkleTree(spawn, context, toolSignature); | ||
} | ||
|
||
remoteExecutionCache.ensureInputsPresent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong because this makes a remote call, and this is now limited to N parallel remote calls. This should have been outside the semaphore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that this limits the number of merkle trees in flight preventing OOMs.
@EdSchouten was this the intention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm specifically talking about the ensureInputsPresent
call, which performs remote calls, and is now limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, it has to be limited to make sure we don't run OOM because memory is filled up with Merkle trees. Yes, this does mean that concurrency of outgoing network calls is limited as well.
One way to make this less problematic would be to implement buildInputMerkleTree() in such a way that it streams directories that it creates, and that FindMissingBlobs() is called in batches. But because buildInputMerkleTree() is not built like that, we currently don't have a lot of options here.
I guess I never observed this to be problematic for performance, because Buildbarn's FindMissingBlobs() implementation is sufficiently fast. It typically completes in ~2 milliseconds, even if 50k digests are provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limiting this by the # of cores is nonsensical when concerned with memory exhaustion - I might as well limit by the number 42 - you're no more or less likely to create memory pressure with this based on the number of cores on a host.
We should limit by a heap pool size, charge by the weight of the merkle tree, and if it turns out that portions of that tree are shared across merkle trees (not sure if that optimization exists), then it should be the net weight of all unique merkle tree components.
Measuring this in milliseconds means that you're blocked on it - for a remote that responds in 1ms, the throughput cap is N * 1000, decided arbitrarily. Utilizing the network also means that there's a potential for async resource exhaustion there that will impact the runtime (but has nothing to do with remediating memory). The case where this was discovered was pathologically small input trees, --jobs=10000 with sub-second execution times, where bandwidth consumption, queued request processing on the channel, and the like could drastically increase the time in the critical section, through no fault of any remote service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limiting it by the number of cores is a better heuristic than limiting it by the --jobs
flag. The reason being that the amount of RAM a system has tends to be proportional to the number of cores. The RAM on my system is not necessarily proportional to the number of worker threads my build cluster running on {AWS,GCP,...} has.
I'm fine with whatever limit it is. As long as there is a sane one. Ulf mentioned that this code should not have been covered by a semaphore. This is something I object to, give the fact that the Merkle tree computation code is not written in such a way that it can return partial Merkle trees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on async execution (running actions in virtual thread, check --experimental_async_execution
in HEAD) which might allow us to remove this limiting because the size of the underlying thread pool equals to the number of cores.
When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons:
It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees.
As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees.
Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. By making the semaphore interruptible, Bazel will immediately stop processing the
512 - nCPU
actions that were waiting for the semaphore.Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root.
That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS.
AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it.
Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching without remote exection).
Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g.