Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental incorrectness on transitions + build setting defaults #15732

Closed
gregestren opened this issue Jun 24, 2022 · 8 comments
Closed

Incremental incorrectness on transitions + build setting defaults #15732

gregestren opened this issue Jun 24, 2022 · 8 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Configurability platforms, toolchains, cquery, select(), config transitions type: bug

Comments

@gregestren
Copy link
Contributor

Description of the bug:

When a configuration transition depends on a user-defined flag, the transition doesn't update when the flag's default value updates.

This produces incorrect output for incremental builds (build a target, change the flag default, build the target again).

Workarounds:

  • Force re-analysis by arbitrarily changing some other flag at the command line
  • bazel shutdown to clear the transition cache

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

https://gist.github.com/gregestren/3c6d67a3f23ea2a0b0baac34d9917dfb

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 5.2.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

n/a

Have you found anything relevant by searching the web?

Forked from #15653 (comment)

Any other information, logs, or outputs that you want to share?

No response

@gregestren
Copy link
Contributor Author

I think this is the culprit:

// For rule self-transitions, the transition instance encapsulates both the transition logic
// and attributes of the target it's attached to. This is important: the same transition in
// the same configuration applied to distinct targets may produce different outputs. See
// StarlarkRuleTransitionProvider.FunctionPatchTransition for details.
// TODO(bazel-team): the transition code (i.e. StarlarkTransitionFunction) hashes on identity.
// Check that unnecessary copies of the transition function don't dilute this cache. Quick
// experimentation shows the # of such instances is very small. But it's unclear how strong
// of an interning contract there is.
this.transition = transition;
this.fromOptions = fromOptions;
this.hashCode = Objects.hash(transition, fromOptions);

Transition cache keys include a) the transition logic, b) the attributes of the target the transition is attached to, and c) the input flag values. I believe the problem is input flag defaults appear as null here regardless of their actual default. The transition logic extracts their actual defaults when needed. But if the transition logic doesn't run because of caching, this never happens.

@gregestren
Copy link
Contributor Author

Alternatively: the input flag doesn't appear in fromOptions at all because it was never set on the command line.

@gregestren gregestren added team-Configurability platforms, toolchains, cquery, select(), config transitions P1 I'll work on this now. (Assignee required) labels Jun 24, 2022
@gregestren gregestren self-assigned this Jun 24, 2022
@sdtwigg
Copy link
Contributor

sdtwigg commented Jun 24, 2022

I took a brief look b/c saw this email:

I can think of 2-3 fixes thus far:

  1. Add the defaults to the cache key (or, similarly, the buildSettingsPackages parameter). This seems easiest and safest. Need to verify hashability of the added parameter.

  2. Move addDefaultStarlarkOptions to before the caching. This means fromOptions will have the proper defaults instead of null. HOWEVER, also will need to move the call validate (or some subset that handles nulling out option defaults) after the cache hit. This is in case the defaults change, want to make sure the nulling out of the default options is done properly. See

    StarlarkTransition.addDefaultStarlarkOptions(fromOptions, transition, buildSettingPackages);

  3. Just make transition application a SkyFunction. There are some performance implications related to restarting though....

@gregestren
Copy link
Contributor Author

Also:

Map<PackageValue.Key, PackageValue> buildSettingPackages =
StarlarkTransition.getBuildSettingPackages(env, transition);

where the transition loads its input flags and their defaults.

Currently, this executes only after a cache miss. So, as you say, we somehow need to integrate it into the cache checking logic.

Note that if we just move it before caching - so it executes unconditionally - this might slow down speed for transition-heavy builds. Performance optimization is exactly why we added the transition cache in the first place. Even a few Skyrame calls is more overhead than nothing. I'm not sure if it makes a practical difference.

Re 3: my original approach indeed used a Skyframe function: I still have the prototype code - StarlarkTransitionFunction.java - hanging around in an unmerged client. I can't remember why that wasn't ultimately done but my guesses are a) performance and b) the soft value caching used by the current approach:

starlarkTransitionCache = Caffeine.newBuilder().softValues().build();

which is presumably more memory efficient?

@sdtwigg
Copy link
Contributor

sdtwigg commented Jun 24, 2022

I think I disagree on 471-472 executing only under a cache miss. It looks like this executes unconditionally.

Map<PackageValue.Key, PackageValue> buildSettingPackages =
StarlarkTransition.getBuildSettingPackages(env, transition);

applyTransitionWithSkyframe is called first, unconditionally, via various callers (resolveGenericTransition and collectTransitionKeys).
In the case of a Starlark transition, it then populates buildSettingsPackages and then finally calls applyStarlarkTransition.

In applyStarlarkTransition lines 496-503 handle creating the cache key and looking for a cache hit.

StarlarkTransitionCacheKey cacheKey = new StarlarkTransitionCacheKey(transition, fromOptions);
StarlarkTransitionCacheValue cachedResult = starlarkTransitionCache.getIfPresent(cacheKey);
if (cachedResult != null) {
if (cachedResult.nonErrorEvents != null) {
cachedResult.nonErrorEvents.replayOn(eventHandler);
}
return cachedResult.result;
}

Then 504+ handle a cache miss (including reading buildSettingsPackages to apply defaults).

So it seems all the work of actually building up buildSettingsPackages is always done, regardless of cache hit or miss. This makes Option 1 a tempting quick fix. Also note then there is a nice symmetry between the cache key and the inputs to applyStarlarkTransition (which gives a better visual assurance the cache key is correct).

PS: I do agree that Option 3 is worth exploring; I just fear it may take time to think through and experiment with converting transition application to a SkyFunction. In particular, I am concerned with the performance impact of restarts.

PPS: I put Option 2 in the list but it seems so fiddly and potentially error prone that I am retracting it as a suggested option.

@gregestren
Copy link
Contributor Author

Oh you're right! All the better!

@fmeum
Copy link
Collaborator

fmeum commented Jun 25, 2022

Please ignore if you already know this: Skyframe recently got the capability to maintain state through restarts. So unless the restarts would happen during the execution of a Starlark function, they should be much less costly than in the past.

copybara-service bot pushed a commit that referenced this issue Jul 23, 2022
Two main goals of this CL:
* Properly include 'metadata about Starlark build settings' in the transition application cache.
* Better isolate where and why the transition infrastructure reads BUILD file information. This unifies previously duplicated logic in SkyframeExecutor.

First, the transition application cache, on a miss, would also read defaults, type, aliasing information, etc. from BUILD files but failed to include any of this info in the cache key. This lead to failure to re-evaluate a transition when that build setting information in BUILD files changes.

This CL fixes that by introducing a new type StarlarkBuildSettingsDetailsValue, which contains all that information for a given set of Starlark build settings. That value is now included in the cache key.

Second, rather than having ConfiguredTargetFunction (via machinery in StarlarkTransition) itself do the Skyframe calls and subsequent computations necessary to compute StarlarkBuildSettingsDetailsValue, a new SkyFunction StarlarkBuildSettingsDetailsFunction was created to do so.

This provides a few benefits:
1. ConfiguredTargetFunction now is only registering an edge onto a single StarlarkBuildSettingsDetailsFunction, rather than all the PackageValue needed to resolve the StarlarkBuildSettings.

StarlarkBuildSettingsDetailsValue more narrowly describes how sensitive the transition is to change. This means irrelevant changes to a package involved in build setting resolution will not cause as much invalidation.

This information is now cached so StarlarkTransition (called via ConfiguredTargetFunction) can more quickly compute the cache key. (Beforehand, it was always unconditionally following any build setting alias chains).

2. SkyframeExecutor can just evaluate StarlarkBuildSettingsDetailsValue rather than having a slightly different implementation getBuildSettingsPackages. (The new getStarlarkBuildSettingsDetailsValue functions that must be synced are much simpler and easier to keep in sync.)

As part of this unification, the error messages are now consistent regardless of whether the problematic build settings are asked for when transitioning the top-level target or an underlying target.

3. The split between reading Packages to compute StarlarkBuildSettingsDetailsValue and then later consuming it in StarlarkTransition should be easier to read and maintain.

PS: As part of the refactor, significant issues were found in PrepareAnalysisPhaseFunction. Warnings have been added as comments and upcoming work will be deleting that file entirely.

This fixes caching issue described at #15732

PiperOrigin-RevId: 462741847
Change-Id: I517a9a841735e6cff26faea5373f50a164334f86
@sdtwigg
Copy link
Contributor

sdtwigg commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Configurability platforms, toolchains, cquery, select(), config transitions type: bug
Projects
None yet
Development

No branches or pull requests

3 participants