Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipping intermediate steps for highly cached builds #321

Open
matts1 opened this issue Dec 6, 2024 · 5 comments
Open

Skipping intermediate steps for highly cached builds #321

matts1 opened this issue Dec 6, 2024 · 5 comments

Comments

@matts1
Copy link

matts1 commented Dec 6, 2024

Consider a simple C++ binary. It generates two actions:
Compile (main.cc -> main.o)
Link (main.o -> main)

When you attempt to build main with remote execution enabled, it needs to calculate the action digest of the link action. However, in order to determine the action digest of the link action, you need the digest of all input files (main.o). To retrieve this, you need to request the action digest of the compile action. So, your critical path for a fully remotely cached build looks like:

  1. Calculate the digest of main.cc locally
  2. Calculate the action digest of the compile action
  3. Send ActionCache.GetActionResult(compile_action_digest) and wait for the response
  4. Read the response to get the digest of main.o, and use that to calculate the digest of the link action
  5. Send ActionCache.GetActionResult(link_action_digest) and wait for the response
  6. Read the digest of the main binary from the result, then read it from the CAS

In general, a fully cached action needs to execute O(#targets) GetActionResult requests, and the depth of the action graph will be the maximum number of requests that must be serialized.

However, we can make a simple observation. It is a sufficient (but not necessary) condition that for two actions to generate the same output, they have the same set of transitive inputs. More specifically, the same set of transitive source inputs (where a source file is not a generated file).

We can now define two digest functions for an action:

  • action_digest(action) = hash((action.cmd, [input.digest for input in action.inputs])) (this is the existing one)
  • transitive_source_action_digest(action) = hash((action.cmd, [transitive_source_action_digest(input.generating_action) if input.is_generated else input.digest for input in action.inputs]))

The action digest is extremely accurate, but is expensive to calculate, as it requires hitting the remote cache many times. On the other hand, the transitive source action digest isn't particularly accurate, but is cheap to calculate as it can be done fully locally with no network calls.

We can combine the best of both worlds, however, by simply using both. My proposal is:

  • We add a field transitive_source_action_digest to UpdateActionResultRequest
  • We either map transitive_source_action_digest(action) to action_digest(action), or attempt to inline it and map it directly to ActionResult.
  • We add a field transitive_source_action_digest to GetActionResultRequest
    • Could use existing field action_digest instead
  • Allow GetActionResultRequest to query for either the transitive_source_action_digest field or the action_digest field

With this mechanism, suppose you were building a large binary containing many targets. You would now instead, in parallel (you could do it sequentially, but it'd be slower probably):

  • Bottom-up, calculate the transitive source action digest. Once completed, start querying GetActionResultRequest on it top-down
  • Bottom-up, calculate the action digest to eventually calculate the action digest

In the case that it was already built at that commit by another user, this would ensure that you can build chrome in O(1) time, rather than the current, which is somewhere between O(depth) and O(n), depending on a variety of factors such as network performance and network parallelism.

This optimization could be implemented without changes to RBE by simply executing two UpdateActionResultRequests each time you execute an action, but doing it in RBE would improve performance and halve the storage space required. I also mention it here for visibility so that various build systems will hopefully see the idea.

@sluongng
Copy link
Contributor

sluongng commented Dec 6, 2024

It sounds like to me that you want to add transitive_source_action_digest into Action message instead of all the *Request messages.

Then you will have 2 action cache entries pointing to the same ActionResult message:

  1. The traditional Action with input_root_digest
  2. The new Action with transitive_source_action_digest (with or without a partial input_root_digest).

Which sounds ok until you think about the location of the output could matter.

For example, we may have 2 link actions (one for test, one for binary) both consuming lib.o but use them in different directories inside the input tree 🤔. And we want to accurately differentiate the cache of the 2 link actions based on where lib.o is placed inside the input tree (i.e. under /src or under /test).

So I think a better solution would be to add an ActionNode under Directory, which can indicate a "lazy" version of FileNode or DirectoryNode. That or modifying {File,Directory}Node to support referencing an action digest for laziness. 🤔

In the recent remote-apis meeting, some folks also expressed the desire to change FileNode into something more flexible (i.e. point it to a list of file chunk digests instead of the file digest to support large files).

@sluongng
Copy link
Contributor

sluongng commented Dec 6, 2024

The tricky part is ofc: an action could have multiple file/directory outputs. So we would want to have a way to specify which outputs we are interested in here (or perhaps, all of them?).

@mostynb
Copy link
Contributor

mostynb commented Dec 9, 2024

I work on a similar setup (unfortunately not opensource so I can't refer to it directly), and I have some questions:

  1. How does this interact with remote execution? Would it require that server implementations abandon the common security feature that only remote execution workers can populate the action cache?
  2. As I think @sluongng hinted, how do you determine which outputs to reference in the "transitive inputs" cache lookup, and what should the client download? Taking the chromium build as an example, how would you represent "data_deps"? And how does this relate to bazel's "build without the bytes" implementation?

@sluongng
Copy link
Contributor

sluongng commented Dec 9, 2024

For RBE: I think it is worth mentioning that there are some overlaps between this proposal and the Action Graph API that folks have been wanting to add:

Instead of letting the client compute the final action digest recursively, you can push the whole graph data to the server and let it handle digest compute, caching, and execution. That should help save all the additional network round trips in the problem statement here.

@bergsieker
Copy link
Collaborator

I agree that lack of higher-order caching within the API is a problem. I think it's most interesting in the case of either built-from-source tooling within a larger build, or build orchestrator models that bundle together multiple, independent builds. In both cases, being able to have a single cache entry for the output built from a whole subtree would be great.

To extend Son's comments, I think what you'd need is not simply a list of transitive sources, but a tree of (Action, Relevant Outputs) tuples. That allows both capturing the full input signature of the action (which includes the source files, the command, environment variables, toolchain, etc.) and discarding the irrelevant outputs (e.g., if an Action produces both a .o and a log file, downstream Actions probably only care about the .o and not the log file).

It's reasonable to experiment with this using platform properties. For a given Action, take all Actions that it depends on and strip out any irrelevant outputs, then insert those Actions into the platform properties under a well-known key. Repeat for each level of the tree until you get to the target that you're interested in, then call GetActionResult (or UpdateActionResult as appropriate) with your Action (that includes all the recursive action dependencies).

While this will fundamentally work, I think you'll run into some practical issues:

  • Per Mostyn, there are security implications to allowing clients to call UpdateActionResult directly. [1] So, you either need to significantly weaken security, or find a way for the server to verify that the provided Action tree really does result in the specified outputs. That said, many people are perfectly happy using local execution + remote caching as a performance improvement, populating the cache from postsubmit builders and allow reads from developers/presubmit/postsubmit.

  • You'll need to decide what to do with intermediate outputs. In the common case you're probably only interested in the final output, but as soon as anything goes awry users will want to download everything so they can debug.

  • If you want a verifiable cache entry, you're essentially looking at a graph execution that can accept either the actual graph or the recursive Action structure and handle execution at each node, as well as caching at each layer. This would definitely require support in the API, but it's something that we've repeatedly considered and discarded. (Not because it's a bad idea per se, but because we couldn't get consensus on whether it's a realistic idea or what it should look like.)

I'd take issue with the assertion that native API support is required for performance/space reasons. Making an additional call to UpdateActionResult should not cause a performance issue--it can be batched with the current call, or done asynchronously if you only care about eventual optimization of the cache. The actual binary outputs are already stored in content-addressed storage; the additional storage from new keys pointing to the same outputs should be neglible.

[1] Actually, there are significant security implications to letting workers call UpdateActionResult directly as well, because the worker itself could be compromised by the Action that it runs. At least some server implementations only let the server itself set cache entries. With this setup, a malicious action could still provide a bad result for its own key but can't write any other keys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants