Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel query silently thrashes the analysis cache #10902

Closed
kastiglione opened this issue Mar 4, 2020 · 23 comments
Closed

Bazel query silently thrashes the analysis cache #10902

kastiglione opened this issue Mar 4, 2020 · 23 comments
Assignees
Labels
not stale Issues or PRs that are inactive but not considered stale P2 We'll consider working on this in future. (Assignee optional) team-Configurability platforms, toolchains, cquery, select(), config transitions type: bug

Comments

@kastiglione
Copy link
Contributor

kastiglione commented Mar 4, 2020

Description of the problem / feature request:

bazel query invalidates the analysis cache for subsequent builds, and neither the output of bazel build nor bazel query indicate the problem.

The best solution would be to call bazel query in a way that uses and preserves the current analysis cache. A secondary solution is to have one or both commands log about the problem.

bazel build does output a message to say that the analysis cache is being discarded – when build flags are changed. However bazel query is silently causing discarding the analysis cache, so we didn't know this was happening.

Chrome traces were showing over 10s of overhead at the start of each build, and we didn't know why. Removing our use of bazel query between builds solved this problem.

Feature requests: what underlying problem are you trying to solve with this feature?

Allow bazel query to reuse and preserve the analysis cache from the recent bazel build, if applicable.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

  1. Do a build
  2. Change a file
  3. Do the same build again

In the second build, the output should say:

INFO: Analyzed target //package:target (0 packages loaded, 0 targets configured).

Now, to see the bug:

  1. Do a build
  2. Do a bazel query
  3. Change a file
  4. Do the same build again

In the second build, the output will show that bazel has had to load >0 packages and configure >0 targets.

What operating system are you running Bazel on?

macOS

What's the output of bazel info release?

release 2.1.0

@irengrig irengrig added P1 I'll work on this now. (Assignee required) team-Core Skyframe, bazel query, BEP, options parsing, bazelrc untriaged labels Mar 6, 2020
@haxorz
Copy link
Contributor

haxorz commented Mar 6, 2020

@kastiglione Please provide more information. You're certainly on the right track that bazel flags are the culprit, but haven't told us anything about your flag usage. I think bazel's INFO log suffices to fully debug this. If you can't upload the full file for sensitivity reasons, please read over it and copy/paste the relevant parts. Someone on the Bazel team can help you decipher the log if you need help.

@irengrig Please find a different assignee. I don't have time to look at this.

@haxorz haxorz assigned irengrig and unassigned haxorz Mar 6, 2020
@irengrig
Copy link
Contributor

irengrig commented Mar 6, 2020

Removing P1, we should indeed get more information.

/cc @lberki can you help investigate the problem?

@irengrig irengrig added more data needed and removed P1 I'll work on this now. (Assignee required) labels Mar 6, 2020
@gregestren
Copy link
Contributor

@kastiglione - is it possible that you're setting flags in a .bazelrc that apply to build but not query?

@kastiglione
Copy link
Contributor Author

kastiglione commented Mar 17, 2020

@gregestren yes, that's a possible cause. After posting this, we learned about #10961. In particular, we had one flag that had to be moved to the common settings, and this was the cause of thrashing with some bazel run usage we had. See #10961 (comment)

I'll retry query and see if that issue was the cause/solution to this issue.

@kastiglione
Copy link
Contributor Author

I just did a build, then query, then build again, and I no longer see re-analysis. This was solved by moving --incompatible_allow_tags_propagation from build to common in our .bazelrc.

It seems to me that this case should result in a user visible warning.

@janakdr janakdr added team-Configurability platforms, toolchains, cquery, select(), config transitions and removed more data needed team-Core Skyframe, bazel query, BEP, options parsing, bazelrc untriaged labels Apr 20, 2020
@janakdr
Copy link
Contributor

janakdr commented Apr 20, 2020

This issue is not specific to query. Bazel is supposed to print "... changed, discarding analysis cache". However, it doesn't appear to do that for options in StarlarkSemanticOptions. Can easily be tested by doing for opt in "" "no"; do bazel build --${opt}debug_depset_depth --nobuild //target ; done.

@gregestren
Copy link
Contributor

StarlarkSemanticOptions isn't in a build's "configuration", hence it not getting caught by the message-printing code.

Recognizing and changing that class' impact is much more awkward than "normal" flags. Next step for this should be to summarize exactly which code paths recognize these changes and how they can link to the message generating code.

@gregestren gregestren added the P3 We're not considering working on this, but happy to review a PR. (No assignee) label Apr 20, 2020
@moroten
Copy link
Contributor

moroten commented Apr 25, 2020

Annoyingly, RemoteOptions are not supported by query and cannot be moved from build to common. So when using remote execution, a bazel query discard the analysis cache. See also

public Iterable<Class<? extends OptionsBase>> getCommandOptions(Command command) {
return ImmutableList.of("build", "test", "fetch").contains(command.name())
? ImmutableList.of(RemoteOptions.class, AuthAndTLSOptions.class)
: ImmutableList.of();
}

@ulfjack
Copy link
Contributor

ulfjack commented Apr 25, 2020

I don't think that's accurate. Just because there are options specified on the build command doesn't mean that they cause a discard of the analysis phase. Let us know if you can repro an analysis phase discard due to remote options changes.

@moroten
Copy link
Contributor

moroten commented Apr 26, 2020

@ulfjack Below is a minimalistic example. Notice that I had INFO: 0 processes before the query, but got INFO: 1 process: 1 remote cache hit after the query. The observed behaviour is the same as if the analysis cache has been discarded, but it might be something else.

I might be wrong, but my suspicion is

void setRemoteOutputsMode(RemoteOutputsMode remoteOutputsMode) {
PrecomputedValue.REMOTE_OUTPUTS_MODE.set(injectable(), remoteOutputsMode);
}
private void setRemoteExecutionEnabled(boolean enabled) {
PrecomputedValue.REMOTE_EXECUTION_ENABLED.set(injectable(), enabled);
}
called from
/**
* Initializes and syncs the graph with the given options, readying it for the next evaluation.
*/
public void sync(

$ cat <<EOF >BUILD.bazel
cc_binary(
    name = "foo",
    srcs = ["foo.c"],
)
EOF

$ cat <<EOF >foo.c
int main() {
    return 0;
}
EOF

$ bazel build -s --remote_instance_name=developer --remote_download_toplevel --remote_executor=grpc://remote-execution:12345 --genrule_strategy=remote,sandboxed,local --spawn_strategy=remote,sandboxed,local //:foo
SUBCOMMAND: # //:foo [action 'Compiling foo.c', configuration: a4e1f2caf6a8f50757c8ef3548a6910ec759e8fbd1f331986e9a8d0f83ca072c]
SUBCOMMAND: # //:foo [action 'Linking foo', configuration: a4e1f2caf6a8f50757c8ef3548a6910ec759e8fbd1f331986e9a8d0f83ca072c]
INFO: 2 processes: 1 remote cache hit, 1 remote.

$ bazel build -s --remote_instance_name=developer --remote_download_toplevel --remote_executor=grpc://remote-execution:12345 --genrule_strategy=remote,sandboxed,local --spawn_strategy=remote,sandboxed,local //:foo
INFO: 0 processes.

$ bazel query //:foo
Loading: 0 packages loaded
//:foo

$ bazel build -s --remote_instance_name=developer --remote_download_toplevel --remote_executor=grpc://remote-execution:12345 --genrule_strategy=remote,sandboxed,local --spawn_strategy=remote,sandboxed,local //:foo
SUBCOMMAND: # //:foo [action 'Compiling foo.c', configuration: a4e1f2caf6a8f50757c8ef3548a6910ec759e8fbd1f331986e9a8d0f83ca072c]
INFO: 1 process: 1 remote cache hit.

$ bazel build -s --remote_instance_name=developer --remote_download_toplevel --remote_executor=grpc://remote-execution:12345 --genrule_strategy=remote,sandboxed,local --spawn_strategy=remote,sandboxed,local //:foo
INFO: 0 processes.

@ulfjack
Copy link
Contributor

ulfjack commented Apr 27, 2020

@buchgr, your commit d480c5f broke Skyframe caching of actions across query invocations, even for builds that don't use the functionality (as long as they use remote execution).

I think it's the injection of the remote default exec properties. The --remote_download_outputs defaults to ALL, and ALL is injected if the flag isn't set or remote execution is disabled.

@ulfjack
Copy link
Contributor

ulfjack commented Apr 29, 2020

Just to clarify: I was able to reproduce the issue brought up by @moroten. I also audited the code for remote options, and it's only two options that are intentionally injected into the Skyframe dependency graph, and one of them (remote_download_outputs) appears to be benign.

My understanding is that it intends to cause action re-execution on changes to the command-line-specified remote exec properties. Unfortunately, because they're only injected on build commands and not on query, that causes a skyframe invalidation of all actions (query implicitly clears the injected nodes if nothing is specified otherwise). There are a few options for what to do, but none are very appealing.

ulfjack added a commit to ulfjack/bazel that referenced this issue Jun 2, 2020
Top-level output files can be symlinks to non-top-level output files, such as
in the case of a cc_binary linking against a .so built in the same build.

I reuse the existing mayInsensitivelyPropagateInputs() method, which was
originally introduced for action rewinding, but seems to have exactly the
intended semantics here as well.

Unfortunately, this requires a small change to the BlazeModule API to pass in
the full analysis result (so we can access the action graph, so we can read
the aformentioned flag).

I considered using execution-phase information on whether an output is a
symlink. However, at that point it's too late! The current design only allows
pulling an output file *when its generating action runs*. It would be better if
this was changed to pull output files independently of action execution. That
would avoid a lot of problems with the current design, such as bazelbuild#10902.

Fixes bazelbuild#11532.

Change-Id: Iaf1e48895311fcf52d9e1802d53598288788a921
@ulfjack
Copy link
Contributor

ulfjack commented Jun 4, 2020

I think build-without-the-bytes should be rewritten to separate download of outputs from action execution. In the current implementation, outputs can only be downloaded when an action executes, which requires the injection of these flags into skyframe in order to trigger action re-execution (and which causes this exact bug).

@ulfjack
Copy link
Contributor

ulfjack commented Jun 4, 2020

This bug should be assigned to team-remote-exec or whoever is responsible for build-without-the-bytes.

ulfjack added a commit to ulfjack/bazel that referenced this issue Aug 6, 2020
Top-level output files can be symlinks to non-top-level output files, such as
in the case of a cc_binary linking against a .so built in the same build.

I reuse the existing mayInsensitivelyPropagateInputs() method, which was
originally introduced for action rewinding, but seems to have exactly the
intended semantics here as well.

Unfortunately, this requires a small change to the BlazeModule API to pass in
the full analysis result (so we can access the action graph, so we can read
the aformentioned flag).

I considered using execution-phase information on whether an output is a
symlink. However, at that point it's too late! The current design only allows
pulling an output file *when its generating action runs*. It would be better if
this was changed to pull output files independently of action execution. That
would avoid a lot of problems with the current design, such as bazelbuild#10902.

Fixes bazelbuild#11532.

Change-Id: Iaf1e48895311fcf52d9e1802d53598288788a921
ulfjack added a commit to ulfjack/bazel that referenced this issue Aug 19, 2020
Top-level output files can be symlinks to non-top-level output files, such as
in the case of a cc_binary linking against a .so built in the same build.

I reuse the existing mayInsensitivelyPropagateInputs() method, which was
originally introduced for action rewinding, but seems to have exactly the
intended semantics here as well.

Unfortunately, this requires a small change to the BlazeModule API to pass in
the full analysis result (so we can access the action graph, so we can read
the aformentioned flag).

I considered using execution-phase information on whether an output is a
symlink. However, at that point it's too late! The current design only allows
pulling an output file *when its generating action runs*. It would be better if
this was changed to pull output files independently of action execution. That
would avoid a lot of problems with the current design, such as bazelbuild#10902.

Fixes bazelbuild#11532.

Change-Id: Iaf1e48895311fcf52d9e1802d53598288788a921
ulfjack added a commit to ulfjack/bazel that referenced this issue Sep 16, 2020
Top-level output files can be symlinks to non-top-level output files, such as
in the case of a cc_binary linking against a .so built in the same build.

I reuse the existing mayInsensitivelyPropagateInputs() method, which was
originally introduced for action rewinding, but seems to have exactly the
intended semantics here as well.

Unfortunately, this requires a small change to the BlazeModule API to pass in
the full analysis result (so we can access the action graph, so we can read
the aformentioned flag).

I considered using execution-phase information on whether an output is a
symlink. However, at that point it's too late! The current design only allows
pulling an output file *when its generating action runs*. It would be better if
this was changed to pull output files independently of action execution. That
would avoid a lot of problems with the current design, such as bazelbuild#10902.

Fixes bazelbuild#11532.

Change-Id: Iaf1e48895311fcf52d9e1802d53598288788a921
bazel-io pushed a commit that referenced this issue Sep 21, 2020
Top-level output files can be symlinks to non-top-level output files, such as
in the case of a cc_binary linking against a .so built in the same build.

I reuse the existing mayInsensitivelyPropagateInputs() method, which was
originally introduced for action rewinding, but seems to have exactly the
intended semantics here as well.

Unfortunately, this requires a small change to the BlazeModule API to pass in
the full analysis result (so we can access the action graph, so we can read
the aformentioned flag).

I considered using execution-phase information on whether an output is a
symlink. However, at that point it's too late! The current design only allows
pulling an output file *when its generating action runs*. It would be better if
this was changed to pull output files independently of action execution. That
would avoid a lot of problems with the current design, such as #10902.

Fixes #11532.

Change-Id: Iaf1e48895311fcf52d9e1802d53598288788a921

Closes #11536.

Change-Id: Ie1bf49a8d08f0b2422426ecd95fe79b3686f8427
PiperOrigin-RevId: 332939828
Yannic pushed a commit to Yannic/bazel that referenced this issue Oct 5, 2020
Top-level output files can be symlinks to non-top-level output files, such as
in the case of a cc_binary linking against a .so built in the same build.

I reuse the existing mayInsensitivelyPropagateInputs() method, which was
originally introduced for action rewinding, but seems to have exactly the
intended semantics here as well.

Unfortunately, this requires a small change to the BlazeModule API to pass in
the full analysis result (so we can access the action graph, so we can read
the aformentioned flag).

I considered using execution-phase information on whether an output is a
symlink. However, at that point it's too late! The current design only allows
pulling an output file *when its generating action runs*. It would be better if
this was changed to pull output files independently of action execution. That
would avoid a lot of problems with the current design, such as bazelbuild#10902.

Fixes bazelbuild#11532.

Change-Id: Iaf1e48895311fcf52d9e1802d53598288788a921

Closes bazelbuild#11536.

Change-Id: Ie1bf49a8d08f0b2422426ecd95fe79b3686f8427
PiperOrigin-RevId: 332939828
@meisterT
Copy link
Member

cc @coeuvre

@janakdr
Copy link
Contributor

janakdr commented Oct 20, 2020

It's possible I incidentally fixed this with f6f8dfe. Can someone check? There is still one injection of remote options (whether remote execution is enabled), so it may not have been enough.

@olschofkah
Copy link

I also ran into this issue. Taking the approach that @kastiglione suggested worked for me with moving applicable cmd line options in the .bazelrc files from build to common. If it helps solve the issue, this behavior started happening for me specifically with version 2.2.0. It would be a big improvement if query just logged that it had to discard the analysis due to options specified.

@coeuvre
Copy link
Member

coeuvre commented Dec 8, 2020

I think build-without-the-bytes should be rewritten to separate download of outputs from action execution. In the current implementation, outputs can only be downloaded when an action executes, which requires the injection of these flags into skyframe in order to trigger action re-execution (and which causes this exact bug).

@ulfjack I am looking into separating download from execution. Any suggestions for the implementation details?

@bjacklyn
Copy link

I was also just bitten by this and had to change some of the flags in the .bazelrc to be under common rather than under build so that they are shared with query.

Has the section of code that causes this problem been identified -- could we at least quickly add a warning to check your .bazelrc flags there if a fix isn't coming soon?

It turns out I'm also affected by the missing --repo_env param for query which causes this same problem with the thrashed analysis cache - #11943 - although there seems to be a (highly discouraged) workaround for that one.

@gregestren
Copy link
Contributor

gregestren commented Feb 11, 2021

More communication from the tool is perfectly reasonable.

I don't know how hard the implementation would be. The main reason is different flags invalidate the graph in different places. Some of the --incompatible flags mentioned are injected into the build graph at

PrecomputedValue.STARLARK_SEMANTICS.set(injectable(), starlarkSemantics);

So checking for changes there and reporting changes could work... for those flags.

Other flags are injected in other places. I suspect remote-related flags have their own hooks. So there's a bit of whack-a-mole to capture everything.

There's an important subtlety here that it's not necessarily just analysis being invalidated, but the entire graph (including package semantics, which are an even deeper dependency).

So I don't know... is the whack-a-mole approach reasonable for this?

@github-actions
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 2+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

@github-actions github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label Apr 26, 2023
@coeuvre
Copy link
Member

coeuvre commented Apr 26, 2023

This is not stale. Actually, with my recent internal work, this might be resolved. I will get back to this soon once those changes are submitted.

@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) and removed P3 We're not considering working on this, but happy to review a PR. (No assignee) labels Apr 26, 2023
@coeuvre coeuvre self-assigned this Apr 26, 2023
@sgowroji sgowroji added not stale Issues or PRs that are inactive but not considered stale and removed stale Issues or PRs that are stale (no activity for 30 days) labels Apr 26, 2023
@coeuvre
Copy link
Member

coeuvre commented Jul 5, 2023

Fixed by d426b3d.

@coeuvre coeuvre closed this as completed Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not stale Issues or PRs that are inactive but not considered stale P2 We'll consider working on this in future. (Assignee optional) team-Configurability platforms, toolchains, cquery, select(), config transitions type: bug
Projects
None yet
Development

No branches or pull requests