-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Experimental] Project Skymeld: Merging Analysis and Execution Phases of Skyframe #14057
Comments
A quick progress update:
Things that still need work/verification (non-exhaustive):
|
If I understood correctly at your BazelCon talk (thanks!), 6.0 will include this, guarded by an experimental flag. Can you share the flag(s) to set to enable this? The command-line reference for 6.0 isn't up yet, but I'd like to give it a try. I found https://bazel.build/reference/command-line-reference#flag--experimental_skymeld_ui - but the documentation suggests this only updates the terminal output. Or does this flag have the undocumented side-effect of also enabling skymeld? |
|
Hi everyone, Exciting news! Skymeld has reached a certain level of maturity that we're now comfortable calling for its dogfooding in Bazel! How to dogfood
Measuring impactsSkymeld is expected to improve the end-to-end wall time of multi-target builds with remote execution. All the wall time wins should come from the analysis phase time. It's best if you already have your own mechanism to track build performance. Using bazel-bench to benchmark your builds also provides a good estimate. To ensure a controlled environment, we recommend using a dedicated machine for this. We're also very interested in any performance wins or issues (e.g. OOM) that you encounter. Bug ReportingPlease file a bug with:
Excited to fix all the incoming bugs. Thanks in advance, adventurous dogfooders! Edit 2023.07.24: remove the reference to |
This is unnecessary. We don't have a use case now where this flag is enabled and the skymeld flag isn't. #14057 PiperOrigin-RevId: 539598912 Change-Id: I5e2bda47085c606728b3a4a19d38ee0afa214812
This is unnecessary. We don't have a use case now where this flag is enabled and the skymeld flag isn't. bazelbuild#14057 PiperOrigin-RevId: 539598912 Change-Id: I5e2bda47085c606728b3a4a19d38ee0afa214812
|
I did a small benchmark today against BuildBuddy code base Building > hyperfine --prepare 'bazel clean --async' \
--warmup 1 \
'bazel build --config=x -k --remote_instance_name="$RANDOM" server' \
'bazel build --config=x -k --remote_instance_name="$RANDOM" --config=skymeld server'
Benchmark 1: bazel build --config=x -k --remote_instance_name="$RANDOM" server
Time (mean ± σ): 241.279 s ± 42.673 s [User: 0.134 s, System: 0.149 s]
Range (min … max): 169.694 s … 318.005 s 10 runs
Benchmark 2: bazel build --config=x -k --remote_instance_name="$RANDOM" --config=skymeld server
Time (mean ± σ): 213.551 s ± 75.335 s [User: 0.118 s, System: 0.129 s]
Range (min … max): 148.260 s … 400.258 s 10 runs
Summary
bazel build --config=x -k --remote_instance_name="$RANDOM" --config=skymeld server ran
1.13 ± 0.45 times faster than bazel build --config=x -k --remote_instance_name="$RANDOM" server Building > hyperfine --prepare 'bazel clean --async' \
--warmup 1 \
'bazel build --config=x -k server' \
'bazel build --config=x -k --config=skymeld server'
Benchmark 1: bazel build --config=x -k server
Time (mean ± σ): 19.282 s ± 0.473 s [User: 0.014 s, System: 0.023 s]
Range (min … max): 18.656 s … 20.218 s 10 runs
Benchmark 2: bazel build --config=x -k --config=skymeld server
Time (mean ± σ): 17.732 s ± 0.407 s [User: 0.014 s, System: 0.023 s]
Range (min … max): 17.118 s … 18.626 s 10 runs
Summary
bazel build --config=x -k --config=skymeld server ran
1.09 ± 0.04 times faster than bazel build --config=x -k server And building + testing all targets (FOSS and Enterprise) with remote cache > hyperfine --prepare 'bazel clean --async' \
--warmup 1 \
'bazel build --config=x -k //...' \
'bazel build --config=x -k --config=skymeld //...'
Benchmark 1: bazel build --config=x -k //...
Time (mean ± σ): 27.113 s ± 1.020 s [User: 0.014 s, System: 0.020 s]
Range (min … max): 25.938 s … 28.833 s 10 runs
Benchmark 2: bazel build --config=x -k --config=skymeld //...
Time (mean ± σ): 25.349 s ± 1.155 s [User: 0.014 s, System: 0.021 s]
Range (min … max): 23.567 s … 27.314 s 10 runs
Summary
bazel build --config=x -k --config=skymeld //... ran
1.07 ± 0.06 times faster than bazel build --config=x -k //... where So overall, the perf gain is about 7->13% when skymeld is enabled. I have been using it on my personal setup in the past week and noticed no observable issues. Great job @joeleba! |
That's great! Thanks for the benchmark @sluongng ! |
@joeleba One thing I am very excited about with Skymeld is that it makes a Having this is important because it would help us eliminate the requirement to run all repository rules locally and instead, to run them remotely with RBE. User's workspace would not need to download any of the external dependencies to start a build and instead, delegate to RBE service to handle the download and caching remotely. Repository rules would then be the same as build rules with additional network dependencies. Do you think this is something that would be easy/hard to achieve once we stabilize Skymeld? |
@sluongng Just note that fully running a repository rule remotely would require some kind of a standalone Starlark interpreter to run the full logic. From what I have seen, Buck2's anonymous targets, when applied to the typical situation repo rule's solve in Bazel, would look more like a |
@fmeum you could already emulate the action execution (i.e. download the archive, unpack it, and populate it BUILD files by some sort of patching). However, you could not feed the downloaded content and generated BUILD files as build targets back to skyframe. AFAIK, this ability is currently unique to repository rules as skyframe is mostly fixed after the Analysis phase. So the ask here is not to reimplement
The starlark interpreter could sit on the user's laptop, and missing starlark files could be fetched lazily from RBE cache similar to Build without Bytes. The pain point I am trying to solve here is: In a large repo/workspace with tens to hundreds of thousand external dependencies, asking user to download all of that locally before starting a build remotely is not feasible. As we begin to blur the boundary between Analysis phase and Execution phase, it seems like we should be able to run and cache most of the heavy parts of the Analysis phase remotely... by allowing Execution rules to return typical Analysis rules' results and feed them back into skyframe graph. |
Totally understand this pain point.
Repository rules and build rules are fundamentally different, those differences won't go away just because of enabling Skymeld. But to solve the pain point, we don't have to merge repository rules and build rules. What we really need is to somehow bring remote cache to repository rules. @Wyverald will work on a true repository cache design, it's probably a good idea also consider how it could work efficiently with remote execution. FYI @coeuvre, our remote execution expert. |
@sluongng Sorry, I'm not super familiar with how external dependencies are handled in Bazel. In particular, it's unclear to me how Skymeld could help this. My guess is "Skymeld would allow actions which don't require external deps to be run while the fetching is ongoing". Is that what you meant? |
I could totally compromise on the solutions as long as the pain point is solved 🤗
I was thinking that The most critical part here is (c).
So when I run |
As you said, part (c) is very much non-trivial. It essentially asks the action graph and the configured target graph to be intertwined. This is likely to be a multi-year project; for reference, Skymeld has taken multiple years despite having arguably a less ambitious goal (it doesn't shuffle the "phases" around). Even part (b) is not that trivial. It's basically saying that a build rule should be able to run a repo rule. It's easy to conceptualize a "build" version of And part (a) is also opening the floodgates... Bazel prides itself on hermeticity and reproducibility of builds, and that's what enables a lot of the caching. Repo rules are a clearly demarcated API to introduce nonhermeticity. If you blurred the lines, it'd be unclear when we could cache a build rule.
It's not very clear to me what "bringing remote cache to repository rules" means. I could imagine a couple of interpretations:
|
Yes, this is what I meant.
Thanks for pointing this out! I agree with the conclusion, it is indeed beyond the scope of the "true repo cache" design. I think the key reason is that the current Bazel remote cache works for Bazel generated build outputs, however external repos are not build outputs, instead there are actually sources prepared before the build. |
I agree with the analysis. Much appreciated. Here is my understanding so far: we agreed on the problem that the analysis phase could be costly to run locally. In general, there are 2 approaches to solving this:
I was hoping that Skymeld would make (2) more feasible, but it seems like we are on the road of exploring (1) already. But both approaches would then depend on the final critical component: the Starlark loading during the analysis phase must happen locally where Bazel's JVM run. Hence the need to be able to lazily load Starlark files and other needed dependencies. There are 2 ways to go about it: a. Lazily load within Bazel, similar to Build without Bytes today. I guess I will wait to see how (b) gona work out with #12823 (comment) |
Bazel@HEAD + Downstream: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3355#_ The above run didn't perform worse than the latest [automated daily run](https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3353#_) #14057 PiperOrigin-RevId: 570950844 Change-Id: I0c38839cb2b9e979265bbe3bff903a54162da4f4
#14057 PiperOrigin-RevId: 572493968 Change-Id: Ifbb1840eb54cea6777e142408b784703382eb4f0
This is coming in Bazel 7.0, if nothing changes 😄 |
Note to the consumers of BEP metrics: please be aware of the new meanings of bazel/src/main/java/com/google/devtools/build/lib/buildeventstream/proto/build_event_stream.proto Lines 1018 to 1035 in 6935927
|
The new code documents:
If the sum is always >= than the wall time, does that mean the wall time is completely covered by only analysis and execution phase? How does this relate to bazel/src/main/java/com/google/devtools/build/lib/profiler/ProfilePhase.java Lines 20 to 32 in f596485
Just looking at these two files, I would have assumed the wall time includes all phases, which would then mean that |
Thanks for spotting this. The wall time indeed covers more than only the analysis and execution phases and it's technically possible that For context: the comment was mainly to highlight that:
These aren't super meaningful, but they sort of fit the reality where we don't have the hard divide between analysis/execution anymore. It's the overlapping of the phases that makes it possible that |
Context: bazelbuild/bazel#14057 (comment) PiperOrigin-RevId: 584831759
Context: #14057 (comment) PiperOrigin-RevId: 584831759 Change-Id: I87359df6551f4221a6e506c1f458ccbeb9b798f2
Context: bazelbuild#14057 (comment) PiperOrigin-RevId: 584831759 Change-Id: I87359df6551f4221a6e506c1f458ccbeb9b798f2
Context: #14057 (comment) PiperOrigin-RevId: 584831759 Change-Id: I87359df6551f4221a6e506c1f458ccbeb9b798f2
Relevant
I just wrote a proposal for this - feel free to give feedback: https://docs.google.com/document/d/1OsEHpsJXXMC9SFAmAh20S42Dbmgdj4cNyYAsFOHMibo/edit |
Context: bazelbuild/bazel#14057 (comment) (cherry picked from commit 1b71de1)
Description of the problem / feature request:
In a regular build, Bazel loads and analyzes the target patterns with Skyframe to form the ActionGraph, this is the Loading-and-Analysis phase. It then performs some extra-Skyframe setup before commencing executing the actions, or the Execution Phase.
Our hypothesis: by allow interleaving the loading/analysis and execution phases, we could improve the build performance, especially for multi-target builds.
By removing the barrier between the phases, we allow targets which have finished analyzing to immediately start with execution. There tends to be many dormant threads towards the end of the analysis phase, and we could make use of those resources for the execution phase.
More details to follow.
The text was updated successfully, but these errors were encountered: