-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V3 idea: Loosen the restriction of action input as Merkle tree #141
Comments
Thanks for filing this Fredrik!
We could also allow the DirectoryNode messages within Directory overlap - essentially allow aliasing at any part of the tree, not just the root. I'm not sure if that's actually more useful than repeating input_root_digest, but it has the benefit of not needing proto field changes, only a relaxation of what trees a server chooses to accept.
I've been pondering this off and on for the last few weeks as well, and overall I still like the sound of it, but think it'll need prototyping to demonstrate it would actually be a big performance win. E.g. for bazel, this would probably be a good if it ultimately produces a merkle tree with a few overlapping nodes at the root, but would probably break down if it produced a merkle tree with O(deps) overlapping nodes at the root, each of which is a linear chain of nodes down to one or two files - at that point the root node would be better off being a linear list of files. It may also be necessary to allow skipping levels, if the most expensive cost is actually rebuilding N levels of merkle tree nodes for each action, e.g.
|
Nice and simple suggestion, Eric! A simple implementation would be SortedMap<string, Digest> files
HashSet<Digest> visitedDirectories
def visitDirectory(rootPath, rootDigest) {
if not visitedDirectories.put(rootDigest).existed_since_before {
directoryMessage = cas.getDirectory(rootDigest)
for subPath, subDigest in directoryMessage.files {
files.putAndFailIfDifferent(rootPath / subPath, subDigest)
}
for subPath, subDigest in directoryMessage.directories {
visitDirectory(rootPath / subPath, subDigest)
}
}
}
visitDirectory("/", inputRootDigest)
deterministicInputDigest = sha256sum(files) Two observations:
Your suggestion allows for still using the v2 proto for transport, so it should be fairly simple to add Regarding performance win, my early alias implementation saves a lot of time: bazelbuild/bazel#10875 (comment). I also added an in-memory action cache and in-memory CAS to Bazel removes most of the cache handling time:
|
Since we are not using the mtime based validating cache anywhere and dont have immediate plans of using it due to performance reasons, I'm removing this code. This would make it a little bit simpler to update the cache entry of the output files. Test: Unit tests
Just out of curiosity, is this a consequence of Bazel not 'pushing down' the concept of a tree deep enough? I guess that if Bazel didn't provide us the inputs as a |
@EdSchouten one benefit comes down to reducing tree merging - for any given action bazel today has to merge the dependency trees of its direct dependencies, which will be partially but not completely overlapping. So it has to walk the ~full trees and recalculate a bunch of hashes for every single action, vs just re-hashing say one or two low-level nodes. So we're exploring ways to reduce that cost. As a rough analogy, you can think of this like the difference between adding items to a heap one by one (O(nlogn)), or collecting all the items and then heapifying in one shot (O(n)). In the case of bazel, executors merging a whole bunch of mostly-redundant trees at once at execution time should be cheaper than bazel having to update a new canonical tree for each dependency in turn. And then on top of this, both moving the tree-merging behind the cache lookup (skipped in 90%+ of cases) and into a remote system (distributed, not bottlenecked on the builder resources) should make it an easy win on net - deferring the merge costs far enough that they mostly don't get paid at all, reducing the net cost when they do get executed, and distributing that. I'm not sure the exact optimal algorithm, since you wouldn't want this to degenerate to "list everything linearly in a single root node", but e.g. "wait until you have N (10? 100?) redundant entries and then merge them together down one level" would presumably be much cheaper on bazel than full canonicalization, while keeping ~all the benefits of a recursive merkle tree in practice. @moroten also has a prototype in this bug that showed significant wins, but I didn't fully grok how it was structuring inputs so I'll leave it to him to describe :). Possibly more closely aligned with the existing nested sets in bazel, however they're structured today? |
The implementation I've done in Bazel, https://github.com/moroten/bazel/blob/c01401f38ec9ed7122a53ee9d4cc673b30ba590b/src/main/java/com/google/devtools/build/lib/remote/FastInputKeyCache.java#L51-L91 (Add fast cache check for remote execution) on https://github.com/moroten/bazel/tree/optimize-remote-execution, views a depset as a directed acyclic graph and memoizes the digest of each depset globally. The same input file might exist in multiple copies in the DAG at this stage, but that's okay. (The depset implementation has recently changed to avoid accessing the internals, so there is no clean way to do this at the moment.) Bazel then stores the "depset digest to merkle tree"-mapping globally. For now, I hijack the action cache by using the input tree digest and stdout fields. This creates an extra lookup round trip, but saves computational time on the Bazel client side. It would be better to have built in support for this mechanism in the ReAPI instead, basically sending the depsets. In the Bazel case, partitioning the input tree according to depsets will create a simple and reasonable tree structure. It will more or less follow the build rule partitioning. There must be similar datastructures in other clients as well to handle transitive input files. |
To reduce the workload of the remote execution clients, the requirement of constructing Merkle trees is removed. Instead, the client can choose any structure of it input file tree it would like. A client might choose to replicate the internal dependency graph. The down side is that different clients, even different versions of the clients, might not get cache hit on effectively the same action. The use case of sharing the cache for different clients is not a realistic use case. Also, clients are most likely stable in their serialization between versions. Fixes bazelbuild#141.
To reduce the workload of the remote execution clients, the requirement of constructing Merkle trees is removed. Instead, the client can choose any structure of it input file tree it would like. A client might choose to replicate the internal dependency graph. The down side is that different clients, even different versions of the clients, might not get cache hit on effectively the same action. The use case of sharing the cache for different clients is not a realistic use case. Also, clients are most likely stable in their serialization between versions. Fixes bazelbuild#141.
Sorry for jumping late into this conversation, but I want to propose an alternate hashing scheme for file sets that I think addresses many of the concerns here. Basically, define the hash of a set of files to be the xor of the hashes of the (filepath, contents) tuple for each file in the tree. This is really cheap to merge for any nonoverlapping sets of files (even if they're intermixed in various subdirectories) and adding a file is now constant time, rather than linear in the depth of the tree. Interestingly, removing a set of files is also a cheap operation, as long as you know that all the files were previously in the set. For a client like Bazel working with depsets, there are interesting options available where you cache hashes for each depset, then use graph-walking operations during a merge to figure out which graph nodes are duplicates in both sides of the merge, and simply subtract them back off. |
@HALtheWise Interesting XOR hash suggestion! When checking the cache, only the root hash is needed which could benefit of the faster calculation. One could still keep the requirement of uploading the Merkle tree structure when execution an action, like in v2. (My goal has been to optimize the case of 99.99% cache hit.) |
Ooh, that's an interesting direction! I could definitely see it as a path to cheaper cache hits. Relatedly, how would this work for trees that embed any data? (is_executable, node_properties, etc). In most cases I'd expect the set of paths to be unique enough, but that's not guaranteed by the spec and I can certainly imagine cases where there'd be collisions. (E.g. forgetting to set the executable bit -> correcting and re-running the same action). I think (filepath, contents) are not quite sufficient as a result, though I could imagine replacing that with a struct for all the necessary metadata of a file. At which point I guess the hypothesis is that it's easier for build tools to keep track of or update mutually non-intersecting sets of files+metadata, relative to some common root, than it is for them to keep track of recursively subdivided file sets of the same? I'll admit, I don't know bazel's fileset use enough to quite understand that - I could see it in the simple case of "add the outputs of action X to this prebuilt fileset", but I'm not sure how it works for "unify the visible files from these N subtrees", as I assume it works for merging the inputs to an action. |
I have observed that Bazel can spend a lot of CPU resources calculating merkel tree digests. This has been discussed in bazelbuild/bazel#10875 and Extend the Action Cache with alias digests.
The key point is that the single input Merkle tree only needs to be resolved on cache miss, which should be rare, so the client should be allowed to check for cache hit using something else.
One idea was to create an alias cache entry where the client would be able to calculate the digest in any suitable way. The problem is that the alias has to be uploaded by the clients, a trusted CI machine or an untrusted developer machine, but not by the remote execution server side. Therefore, using action cache alias makes the system vulnerable for cache poisoning.
Instead, @EricBurnett suggests to loosen the restriction on the input to describe partial trees:
What would be a good design?
message Directory
to include more extra roots, not just subdirectories?Action.input_root_digest
be repeated?Any other design ideas or any ideas to solve the problem in a totally different way?
The text was updated successfully, but these errors were encountered: