-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Bazel more responsive and use less memory when --jobs is high #17120
Closed
EdSchouten
wants to merge
1
commit into
bazelbuild:master
from
EdSchouten:eschouten/20230103-memory-usage
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong because this makes a remote call, and this is now limited to N parallel remote calls. This should have been outside the semaphore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that this limits the number of merkle trees in flight preventing OOMs.
@EdSchouten was this the intention?
also cc @tjgq (and leaving a reference to #21378 here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm specifically talking about the
ensureInputsPresent
call, which performs remote calls, and is now limited.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, it has to be limited to make sure we don't run OOM because memory is filled up with Merkle trees. Yes, this does mean that concurrency of outgoing network calls is limited as well.
One way to make this less problematic would be to implement buildInputMerkleTree() in such a way that it streams directories that it creates, and that FindMissingBlobs() is called in batches. But because buildInputMerkleTree() is not built like that, we currently don't have a lot of options here.
I guess I never observed this to be problematic for performance, because Buildbarn's FindMissingBlobs() implementation is sufficiently fast. It typically completes in ~2 milliseconds, even if 50k digests are provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limiting this by the # of cores is nonsensical when concerned with memory exhaustion - I might as well limit by the number 42 - you're no more or less likely to create memory pressure with this based on the number of cores on a host.
We should limit by a heap pool size, charge by the weight of the merkle tree, and if it turns out that portions of that tree are shared across merkle trees (not sure if that optimization exists), then it should be the net weight of all unique merkle tree components.
Measuring this in milliseconds means that you're blocked on it - for a remote that responds in 1ms, the throughput cap is N * 1000, decided arbitrarily. Utilizing the network also means that there's a potential for async resource exhaustion there that will impact the runtime (but has nothing to do with remediating memory). The case where this was discovered was pathologically small input trees, --jobs=10000 with sub-second execution times, where bandwidth consumption, queued request processing on the channel, and the like could drastically increase the time in the critical section, through no fault of any remote service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limiting it by the number of cores is a better heuristic than limiting it by the
--jobs
flag. The reason being that the amount of RAM a system has tends to be proportional to the number of cores. The RAM on my system is not necessarily proportional to the number of worker threads my build cluster running on {AWS,GCP,...} has.I'm fine with whatever limit it is. As long as there is a sane one. Ulf mentioned that this code should not have been covered by a semaphore. This is something I object to, give the fact that the Merkle tree computation code is not written in such a way that it can return partial Merkle trees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on async execution (running actions in virtual thread, check
--experimental_async_execution
in HEAD) which might allow us to remove this limiting because the size of the underlying thread pool equals to the number of cores.