-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid action conflicts due to different configuration hashes #14236
Comments
I'm a bit lost in the conversation history here and at #13587. I didn't fully follow Case 3. For example, can you elaborate "This has the unwanted consequence that E can no longer be a dependency of D"? And what is the exact goal? All of @sdtwigg 's ongoing work should eliminate action conflicts more fundamentally. Is the primary concern of this issue action conflicts or duplicated action performance issues or both? |
#14023 (and @stdwigg's solution) takes care of the "directory name fragment" problem while this issue takes care of some other issues discussed in #13587 . Currently this a real issue to us (see below) while we have #13587 as sufficient workaround for #14023. Note also that there were also other namespace aspects discussed in #13587, like the IDE/debug-issue, not covered in either of these issues. This means solving both of these issues is required but not necessarily sufficient to deprecate the "output directory name" transition.
Sure. Since D (or edge to D) needs to reset build-setting values to avoid above conflict it cannot depend on any target dependent on these build-settings (E above). The build-setting dependency can arise both through use of native functions like cc_common.compile and/or the use of custom build-settings. This means that the graph has to be rewritten such that E is not a dependency of D (for instance moved to B and C in the example above).
The goal with this is issue is to remove the need for current and upcoming workarounds (technical debt). Workarounds is chosen case by case depending on situation but is typically:
|
That makes more sense, thank you. So in the above example, the sole purpose of the build setting is for B/C to set it and for E to consume it? And you desire both of
I support this goal. But there's an intrinsic tension between efficiency (non-duplication, efficient graphs) and correctness (no action conflicts). We can't confidently know if D's actions depend on the build setting without having fine-grained insight into whether it's actions consume anything coming from E. The current Starlark APIs don't provide enough isolation to make that easy: D's rule implementation runs a bunch of Starlark that presumably includes E somehow (otherwise why would E be a dep?) and also creates a bunch of actions. If any of these are intertwined we get implicit dependencies that require D's outputs to be different. If you really know for a fact this isn't a concern for your rules, we can consider some kind of special tagging. But that won't apply generally, and that puts more responsibility on you as rule designer to ensure these tags are correct and don't fall out of date. There's also the stripped output path approach, which I've long been interested in: don't worry about these issues, but before sending the work to the executor strip all config paths completely. The net effect is actions won't be executed redundantly. But the build graph itself will still retain redundancies, so it's only a partial solution. Whether that's enough by itself of course depends on specific use cases. |
The lens that I have found most clear for thinking about this is thinking about how to handle a single rule with multiple actions, some of which depend on the build configuration while others don't. For example, a (custom) C++ rule that wanted to compile only once but link in several different configurations for different targets. In this view, an ideal API would @gregestren correctly points out that a naive implementation of this breaks correctness if the rule author misses some subtle way that the configuration value can affect the action parameters. At minimum, the rule user can always pass For our use case (@torgil can you confirm for yours) I think we can avoid this problem by instead just requiring that all instances of an action that generate a given file in a given output root must be identical, and failing the build if that is not the case. That way, if someone tries to get tricky with select statements, it just fails. This proposed rule (all actions generating a given file in a single build must be identical) is looser than the current rule in Bazel master (only one action may generate a file) but stricter than that in 4.x (hahaha, go for it, hopefully the outputs are equivalent). It still leaves the onus on rule authors to determine what files are "supposed" to depend on what configuration values, but that's acceptable for us. |
Yes, but it can also be the case that E sends build-setting based information up to B/C to consume. It can (depending on use-case) also make sense for E to avoid name-collisions in the local file path, eg
A common situation leading here is that variation tend to appear in lower level nodes (E). D may be library without need for variation itself but it calls functions that is in turn has configuration specific functionality depending on os, platform, hardware or other. B and C may want to control this environment/configuration through a build-setting. The object files and static library D produces are independent of the build-setting but when B or C link binaries they need to link against the correct (configuration dependent) libraries given by E. |
With "output directory name" transition (no hashes), this fails with action conflicts today.
Yes. This will produce an action conflict the rule author needs to take care of. Can you elaborate on how a user could break correctness with tricky select statements?
In case3 above, D have two identical actions generating the same file. It worked on master as of Friday. The "output directory name" patch didn't add anything enforce this. |
Which change from 4.x to master do you mean? Removal of the This is similar to action sharing, although my reading is you're saying something slightly different. |
I may be wrong about how this works, and apologies for taking so man words to explain this, but what I think you're referring to with case 3 is where all configuration settings on the rule are identical, in which case the analysis-time rule graph only has a single node, and I've been referring to that as a "single rule" and "single action". I'm instead trying to refer to a situation where multiple differently-configured rules produce actions with identical parameters, because this particular action in the rule doesn't depend on the configuration values that differ. This was possible to accomplish in 4.x using
I hadn't heard of this before, but it seems like it might be exactly what I'm looking for here. Multiple differently-configured versions of a rule should be able to declare that a specific output file does not depend on the configuration, in which case the output root should be configuration-blind and any actions producing that file should be "shared" between the differently-configured rule instances and only executed once (with an error if the action key is different). |
In Bazel 4.x, we were using # This rule is written assuming that its output files will always be
# bit-for-bit identical regardless of the active configuration
generate_code(
srcs = select({"@platforms//os:linux": ["a.in"], ...})
) I could have sworn that I saw this happen once without Bazel raising an error, and instead just picking one configuration of the rule to "win", although it's possible that even in 4.x there were safety guards against action conflicts like this and I'm misremembering. If those guards are in place, I think they should still be sufficient going forward to make sure that no two different-key actions in the same build produce the same file, which is the only way correctness issues arise. I'm not aware of any way to produce this sort of conflict in 5.x prereleases today. |
Alright. This is exactly what I'm trying to solve with
This should also be hit by above config hash issue. It would simplify deduplication of actions in cases where a rule creates both platform dependent and platform independent actions. The alternates I've considered are less appealing:
Is there another way to deal with this today? I don't see this replace |
I looked little further in the case 2 action conflict and noticed it failed because the compile action(s) for D2.c weren't shareable. The following two patches seems to fix case 2:
With these patchs, aquery says there is one compile action for D2.c and one link action for libD2.a. Building A2 seems to have a glitch that it shows multiple versions of the same file in the command line output:
Actions seems shared (building A1 has 13 processes, 8 linux-sandbox). Patches are here (based on master today): @gregestren Do we know why Cpp actions are not shareable? Except for the glitch, are there any other problems with the above solution for case 2? |
C++ actions aren't shareable because of include scanning: finding dependencies by reading It's all annoyingly complicated but the main takeaway is it's harder to know at analysis time if actions are really the same when the above means an action may actually change later in the build. We've talked about ways of loosening this restriction but it's a challenging issue. |
For cases where the new directory name should propogate all the way through the dependency tree (third party code), what are your use cases that aren't solved by #14023 combined with using a transition so you can always depend on the I need to do more thinking about how to handle third-party rules like rust, maybe rules should have a default attribute to specify config keys to ignore in all actions/files they generate? |
Yes. This is similar to what we do today but it's non-ideal:
Good point. In this case Bazel should have the knowledge to drop those build-settings in the third party target configuration without the need to explicitly reset them in a transition since they're not included in the third party code dependency tree. |
Could two actions that has identical action hash after the analyze phase end up as different actions in the execution phase?
Is it a problem that two actions that are different in the analyze phase ends up the same after include scanning?
I've updated my branch above with a new core option |
@sdtwigg is working on the overall action conflict issue: let's wait for that to land and then re-address this. |
Rebased output directory name revert can be found here: https://github.com/torgil/bazel/commits/output-directory-name-option Edit: The patch from #13587 is still needed (included in master above) as the fix for #12731 somehow triggered changes in "affected by starlark transition" for different paths in some graphs. These changes ended up in the action-hash through ST-hashes appended to "internal/_middlemen" path-names which resulted in action conflicts. |
I've now created #16844 on this issue since
Is it possible to solve this within the 6.0 cycle? Edit: question/suggestion about initial configuration removed, need to rethink that a couple of iterations |
Made a pull request on the --experimental_shareable_cpp_compile_actions flag (see above): #16863 |
@eric-skydio @gregestren What about an option to omit the "output directory name" completely and let rules resolve path conflicts at package level (eg bazel-out/external/... or bazel-out/allconfig/external/... if too many scripts break) ? |
@torgil this is complicated and long-standing because the issue is fundamentally complicated. And it's not a single issue: it's a related collection of issues with varying levels of support and safety challenges. We don't want to make choices that compromise correctness. It's extremely easy to do that with some of these ideas. We're obligated to find a balance that doesn't make it easy for users to run builds with bad results (by "bad", I don't mean action conflicts, I mean builds that say they're successful but produce wrong output).
Maybe we could make shared C++ actions work. But we should really understand why they're not shareable today and how such a flag affects that. We haven't discussed in these threads, for example, how Skyframe (Bazel's core execution engine), might get confused and build the wrong targets because of that. That's the kind of problem we need some answer for to promote a path like that. I realize this is directly pertinent to you. All theory aside, you want builds that work and are fast. And that's compromised right now. I'm totally on board with trying to help with that. And I'm open to more hacky solutions if they're practically helfpul for lack of better solutions now. But we really have to be cautious, clear, and methodical about exactly what problems we're solving with any solution and what side effects it has. Examples of related but distinct problems:
We're looking at all of these and making at least some progress. To be fair, I don't think we've communicated that well: it's mostly been ad hoc conversations with whoever happens to be part of those conversations. And there's been lots of behind-the-scenes work that hasn't yet manifested into user-accessible Bazel features. Here are some examples of progress:
It's also possible we can think of creative new ways to define transitions that might avoid some of the propagation issues you and others experience. As stated by @eric-skydio and others upthread, non-C++ actions should be perfectly shareable as long as they're the same actions (I think there's one other class of non-shareable actions, but most are shareable). And for problems like non-functional actions, a better Starlark API for writing actions and containing their inputs could address that. That's likely the only practical approach toward that problem. We had some great chats at this last BazelCon. As one followup I'd like to promote a clearer common view of the problems and how we imagine addressing them. I think we need a better interface than GitHub issues because they quickly get unfocused and out-of-date. And it's hard to reason precisely about all the challenges without really precise common understandings of the problems. |
Oh, I forgot: 7f51c8b could help with rules that really have no configuration needs. That's limited to solving these problems with more transitions, and would need a Starlark hook to be generally usable. |
@gregestren Great job with the In testing on our repo with the 6.0 release candidates, it initially seems to work really well but exposed one other tiny bug that needs to get fixed for us to see all the benefits (#16911). Without that fixed, I can't find a way to unify the configuration of a generated code file that's depended on by both a executable tool and an output target. |
@gregestren Thank you for your detailed response. I totally respect the complexity regarding these issues (and I have seen most of them), especially for the general case and I appreciate your patience towards a sound solution. I've wrapped my head into these issues numerous times and the problem domain seems to get more complex for each round (probably ~80% of the times I need to fire up jdb on bazel is due to these issues). Of your enumerated issues I prefer to get action conflicts over silent code duplication (to later detect you have built a random tool multiple times with flags intended for target code testing). It forces you to learn about your build and deal with it accordingly. I understand that this cannot be forced on the "casual user" as a default behavior. The "allconfig" suggestion above was meant to address the problem with different naming needs for different actions within a single target, not as a configuration transition (which you still need for other reasons, like third-party dependencies).
If the transitions are affecting a starlark rule build setting both A, B and D (but not E) will have a path to it in the dependency graph. Is it possible to use that information?
This is certainly a big step forward and it also removes all "affected by starlark transition" issues. Good job!
Would this run into the same issues as the "host" config where transitions on dependencies silently gets dropped? |
Thanks for your patience! Could something like ctexplain help? The idea behind that is a tool you could run over any build that reports precise information on where transitions happen and where they're wasteful. Such a tool could help us quickly target a build's exact waste profile which could help us more precisely evaluate and propose fixes. Part of what makes these issues hard, IMO, is the impact of these issues really varies by build. So it's hard to know which approaches are best without concrete data. I wrote a functional v1 of ctexplain a while back. But it only partially landed in @bazelbuild/bazel (more PRs need merging). I'd love to integrate it more deeply but haven't had the personal bandwidth to push this forward.
If I'm reading you correctly, that's extremely valuable information. Yes, we can work with that. The challenge is how big that subpath may be. If it's really short (one node to its direct dep), I can imagine some pretty effective solutions. If it's arbitrarily long, we run into the tension between performance (extra computation needed to analyze a big subgraph) vs. maintainability (users can manually tag these edges to avoid that computation, but that can be a big user burden across dependencies, edges, different projects, etc.). My auto-trimming prototype I mentioned above is an attempt to automatically solve that when the computation cost is acceptable.
Yes. The intention of this config is that once you're in it, that's it. It's terminal. No more transitions. |
Yes. Interesting. I ran the analyze function below on the examples in the description using a Bazel from Dec 8 master branch.
ctexplain is a wrapper function for ctexplain.py to find it deps. It also needed a few source changes for my python version, eg the usual "make random python script work" stuff. Output for case 1 (no transition):
The shared action above is a file write on "Dl.c". Output for case 2 (output directory name transition):
Do you mean adding an edge transition to set the "baseline" value here?
Are these PRs available somewhere? |
@gregestren I pushed some small updates I needed to run ctexplain here: #17109 |
I think the ctexplain followups are
|
Just chiming in quickly to point out that this limitation prevents us from using this to solve our issues with duplicating code generation work. In particular, we have code generation actions that don't depend on build configuration and we'd like to avoid running many copies of. Our current workaround is to manually starlark-transition every config value that gets set anywhere in our build tree to a sane default value for the codegen targets, creating the Unfortunately, since the codegen targets depend on tools that need to be built in |
@eric-skydio naively I'd say the codegen tools are configuration-dependent, since some of their inputs are configuration-dependent. So what you're saying is that the differences don't matter? i.e. if |
Sorry, just noticed this. None of the tools nor their outputs depend on the target configuration, but they do depend on their own exec configuration being set correctly. This isn't the exact use case, but as an example, imagine that we have a C++ executable In general, there's lots of ways that a tool can require transitions to work correctly between it and its dependencies, at minimum any |
Or to put it differently graph LR
multi[":multi_target"] -- transition --> x86
multi[":multi_target"] -- transition --> arm
x86[":target (x86)"] -- srcs --> gen[":generated_code (arch-agnostic)"]
arm[":target (arm)"] -- srcs --> gen
gen -- tool --> tool[":tool (x86)"]
tool -- dep --> lib[":lib (x86)"]
It's important that Our current solution of explicitly transitioning to The thing we can't do currently (and this might require a separate issue) is handle cases where some actions in a rule depend on (a piece of) the configuration but others don't. This is particularly common for C++, where you might want to build an object file once then link it against several different versions of its dependencies depending on the build configuration, and there's potentially a lot of speed improvement to be had from deduplicating that compile action. Solving this is a hard problem, unfortunately. |
@gregestren Should we consider this completed? |
Sure. This has become a Frankenstein bug: covers lots of interesting themes but with lost focus. Let's continue open-ended discussion on paths & efficiency on a https://github.com/bazelbuild/bazel/discussions. And keep specific actionables in focused new issues. Apologies if this comment misses someone's ongoing actionable concern: please remind us. As an aside, I'd still like to flip |
This "Frankenstein bug" have come back to haunt me a another lap. Could you point me to relevant discussions? @gregestren What was the problem with allowing devs to set "output directory name" and let them control their own namespace while protected by action conflicts? I'm not too excited about the new path-mapping track as it adds complexity, is limited in scope and doesn't address the design issue. |
Unless I'm misunderstanding this will increase complexity and risk of action conflicts because it shifts the burden to devs to do it right. How do we ensure two transitions that set distinct configs don't set the same "output directory name" that gets consumed by some common library which then crashes? |
Causal users will not use the "output directory name" feature, once you use it is because you have analyzed your build and got into the details, possibly after transitions and system level multi-configuration builds. I see/use action conflicts as a feature, they force you to investigate the "why" and go from there as it has either detected a configuration issue or a namespace issue. Your example with the "common library" is a perfect (and common) example. If it's not detected by build system tests the "crash" here is an action conflict and that will lead to a resolution according to above, a new build system test and maybe the need for a transition on the edge to the common library. The opposite side of the coin is to not detect these issues, like configuration leaking down to dependencies it shouldn't or unnecessarily duplicating builds multiple times. This can even go undetected unless you hit a cache or disc space issue. It's non-trivial for a developer in a big project to even know about these consequences when adding a build-setting, transition or command line option. |
I sympathize with this idea: that action conflicts can be a meaningful signal and not an obtuse failure. I also sympathize that there's a tradeoff: auto-avoiding conflicts has its own efficiency consequences that are hard to trace. Not just efficiency but even correctness if a build just fails because of an OOM. My biggest concern about this approach is connecting these failures with the people who are best informed and qualified to address them. This can get really hard when building projects with external dependencies that cross team and org boundaries. We've seen this sort of thing a lot at Google: even qualified language maintainers struggled to understand where conflicts happen or what to do about them when random project XYZ fails. Reducing the possibility of action conflicts seems to have helped with that repo. Of course not all repos are the same. I'm genuinely curious how your perspective differs. And we of course have to be careful setting precedents that would apply to other orgs & repos with their own distributions of expertise. |
Description of the problem / feature request:
Bazel errors out if the same action (with identical action hash) is created with different configuration hash. Configuration hash should be irrelevant when considering if two actions are identical.
The graph in the minimal example below shows the issue where B and C sets different values of a build setting that controls the output of E:
A -> B -> D -> E
A -> C -> D -> E
Case 1: Using bazel with current ST-folders on above graph yields duplicated actions for target D:
Case 2: To avoid duplicated actions in D, we can control the "output directory name" through transitions and thus remove the ST-hash. E avoids action conflicts by consider the build setting value in it's output path. This produces the the error in the topic.
Note that his works for bazel 4.x but this functionality was dropped in af0e20f. Later versions need a revert of that commit to work.
Case3: Today we need, as a workaround, have the same value of the conflicting build setting in all paths to D. In this example we reset the value to default value at the input of D. This has the unwanted consequence that E can no longer be a dependency of D.
Both versions of E are gone and both B and C links to an unintended default version.
To workaround 3, we have to rearrange the dependency graph in a non-optimal way, possibly with more targets than desired.
Feature requests: what underlying problem are you trying to solve with this feature?
To allow functionality as explained in the example without duplicated actions or action conflicts.
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Start with an empty directory and run the following script inside it. For usage, see description above.
config_hash_action_conflict_example_setup.sh.txt
What operating system are you running Bazel on?
Ubuntu 20.04
What's the output of
bazel info release
?development version
If
bazel info release
returns "development version" or "(@non-git)", tell us how you built Bazel.Used current HEAD from master branch on github
$ git checkout e8a066e
$ git revert af0e20f
$ # fix conflicts
$ git revert --continue
What's the output of
git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?$ git remote get-url bazelbuild; git rev-parse bazelbuild/master ; git rev-parse HEAD ; git rev-parse HEAD~1
https://github.com/bazelbuild/bazel.git
e8a066e
4d2f7a74ead44aecf80f1b0b271f2b9fa2816d01
e8a066e
Have you found anything relevant by searching the web?
Any other information, logs, or outputs that you want to share?
Similar to #14023, solution to this issue should make #13587 obsolete.
The text was updated successfully, but these errors were encountered: