Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote_coverage_tools/Main cannot locate runfiles directory when building with --nobuild_runfile_links #20577

Closed
UebelAndre opened this issue Dec 17, 2023 · 28 comments
Labels
team-Core Skyframe, bazel query, BEP, options parsing, bazelrc type: bug untriaged

Comments

@UebelAndre
Copy link
Contributor

UebelAndre commented Dec 17, 2023

Description of the bug:

Whenever --nobuild_runfiles_links is enabled I encounter a failure where the remote_coverage_tools/Main tool errors and fails to find runfiles. The logging is confusing as searching appears as though it should be inhibited.

https://buildkite.com/bazel/rules-rust-rustlang/builds/10120#018c7929-4564-4947-9fa6-ea73aa7e0e26

+ JAVA_RUNFILES=
+ exec bazel-out/k8-opt-exec-ST-13d3ddad9198/bin/external/remote_coverage_tools/Main --coverage_dir=/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3234/execroot/rules_rust/bazel-out/k8-fastbuild/testlogs/test/rust_test_suite/tests_suite_tests/integrated_test_b/_coverage --output_file=/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3234/execroot/rules_rust/bazel-out/k8-fastbuild/testlogs/test/rust_test_suite/tests_suite_tests/integrated_test_b/coverage.dat --filter_sources=/usr/bin/.+ --filter_sources=/usr/lib/.+ --filter_sources=/usr/include.+ --filter_sources=/Applications/.+ --source_file_manifest=/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3234/execroot/rules_rust/bazel-out/k8-fastbuild/bin/test/rust_test_suite/tests_suite_tests/integrated_test_b.instrumented_files
/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3234/execroot/rules_rust/bazel-out/k8-opt-exec-ST-13d3ddad9198/bin/external/remote_coverage_tools/Main: Cannot locate runfiles directory. (Set $JAVA_RUNFILES to inhibit searching.)

This issue is reproduced on bazelbuild/rules_rust#2340

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

This issue is reproduced on bazelbuild/rules_rust#2340

Which operating system are you running Bazel on?

Linux, MacOS

What is the output of bazel info release?

7.0.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@UebelAndre
Copy link
Contributor Author

cc @scentini @krasimirgg

@UebelAndre
Copy link
Contributor Author

Might relate to #4685

@UebelAndre
Copy link
Contributor Author

@sgowroji this bug is reproduced without remote execution. Is that tag correct?

@tjgq
Copy link
Contributor

tjgq commented Dec 18, 2023

Does the bug go away if you use --noenable_runfiles instead of --nobuild_runfile_links?

This is admittedly very confusing, but --noenable_runfiles means "no runfile symlinks are created ever", while --nobuild_runfilke_links means "no runfile symlinks are created when the binary is built, but they'll be created just in time before the binary is executed (as a build action, bazel test or bazel run)". However, here you're invoking the binary directly outside of Bazel, so Bazel can't create them just in time.

Alternatively, you might be able to hack around it by editing the Java binary wrapper script to say RUNFILES_MANIFEST_ONLY=1 (see here - the placeholder is replaced at build time depending on the state of --enable_runfiles).

@tjgq tjgq added team-Core Skyframe, bazel query, BEP, options parsing, bazelrc and removed team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Dec 18, 2023
@tjgq
Copy link
Contributor

tjgq commented Dec 18, 2023

(Note: assigning to the core team because they're the best owner for runfiles-related issues. I'm not actually convinced there's a bug, this might just be an unfortunate consequence of how things work.)

@tjgq
Copy link
Contributor

tjgq commented Dec 18, 2023

However, here you're invoking the binary directly outside of Bazel

On a second look, I'm not sure this statement is true. But I'm still interested in the result of running with --noenable_runfiles, as well as the value of RUNFILES_MANIFEST_ONLY in the wrapper script.

@UebelAndre
Copy link
Contributor Author

Does the bug go away if you use --noenable_runfiles instead of --nobuild_runfile_links?

This is fairly different behavior but I can add it to my PR and show you the logs

--nobuild_runfilke_links means "no runfile symlinks are created when the binary is built, but they'll be created just in time before the binary is executed (as a build action, bazel test or bazel run)". However, here you're invoking the binary directly outside of Bazel, so Bazel can't create them just in time.

I don't believe I'm doing such a thing. As far as I can tell everything that's running with the exception of the statically linked coverage binary from _collect_cc_coverage.

@UebelAndre
Copy link
Contributor Author

UebelAndre commented Dec 18, 2023

Does the bug go away if you use --noenable_runfiles instead of --nobuild_runfile_links?

This is fairly different behavior but I can add it to my PR and show you the logs

It seems like --noenable_runfiles makes this issue go away but many other tests (expectedly) fail https://buildkite.com/bazel/rules-rust-rustlang/builds/10135#018c7dd1-2133-4a46-adab-babcdde9cf43. However, --noenable_runfiles is quite a different flag and not the desired behavior here. I'm not sure what you mean by:

here you're invoking the binary directly outside of Bazel

but it seems like the way coverage is setup should be to take this java tool as a binary so runfiles are kept.

@UebelAndre
Copy link
Contributor Author

I'm not actually convinced there's a bug, this might just be an unfortunate consequence of how things work.

Sounds like "It's not a bug, it's a feature" 😄. Arguable though lol

@tjgq
Copy link
Contributor

tjgq commented Dec 18, 2023

It seems like --noenable_runfiles makes this issue go away but many other tests (expectedly) fail

Ok, that validates the theory that the problem is that the Main binary is under the assumption that the runfiles tree will exist at the time it runs, because it was built with --nobuild_runfile_links. If you build it with --noenable_runfiles, it makes no such assumption. (I understand that setting --noenable_runfiles is not a viable workaround.)

I'm not sure what you mean by here you're invoking the binary directly outside of Bazel but it seems like the way coverage is setup should be to take this java tool as a binary so runfiles are kept.

Yeah, I'm now convinced that there's an actual bug here. Judging from the logs, it looks like the following is happening:

  • bazel coverage invokes collect_coverage.sh
  • collect_coverage.sh assembles an LCOV_MERGER_CMD, then invokes it
  • LCOV_MERGER_CMD runs the (previously built?) .../remote_coverage_tools/Main binary
  • The binary is unable to find its runfiles symlink tree, because it was neither created at build time nor just-in-time

So I expect there's some missing logic in the coverage command to ensure that the runfiles symlink tree for .../remote_coverage_tools/Main has been created before the command runs, if --nobuild_runfile_links is set. (For comparison, the local spawn runner does this by calling RunfilesTreeUpdater#updateRunfiles before executing the spawn; here it's a bit more involved because the runfiles aren't for the binary that is directly invoked by Bazel, but for a binary that it transitively invokes.)

@tjgq tjgq changed the title remote_coverage_tools/Main: Cannot locate runfiles directory remote_coverage_tools/Main cannot locate runfiles directory when building with --nobuild_runfile_links Dec 18, 2023
@UebelAndre
Copy link
Contributor Author

This would be a great one to get fixed since --nobuild_runfiles_links is a common flag in all large python code bases I'm familiar with but now to introduce rust either breaks coverage or explodes build times. Neither of which are acceptable trades.

@UebelAndre
Copy link
Contributor Author

I also see this when I attempt to generate coverage reports using llvm-cov from a clang cc toolchain :(

@fmeum
Copy link
Collaborator

fmeum commented Jan 9, 2024

A fairly simple solution for this would be to replace the LCOV merger with a Graal native image, which as far as I know is the setup Google is using internally. This is already done for Turbine now, it would "just" require splitting remote_coverage_tools into one repo per architecture.

cc @hvadehra

@UebelAndre
Copy link
Contributor Author

Is that something that can be done for Bazel 7.1.0? 🙏

@UebelAndre
Copy link
Contributor Author

Any updates here? This is still impacting rules_rust.

@UebelAndre
Copy link
Contributor Author

Ping here @tjgq @scentini @hvadehra

@hvadehra
Copy link
Member

hvadehra commented May 7, 2024

@c-mita WDYT about the idea in #20577 (comment)?

@UebelAndre
Copy link
Contributor Author

Friendly ping @c-mita

@Ryang20718
Copy link

also interested. friendly ping @c-mita :)

@Ryang20718
Copy link

We currently are trying to upgrade to bazel 7 with a rust + python codebase. Allowing the building of runfile links significantly slows down build time. is there a workaround in the meantime to allow --nobuild_runfile_links but still generate coverage files?

@fmeum
Copy link
Collaborator

fmeum commented Jun 10, 2024

I just discovered that this has been fixed by 404d8d9. I sent #22676 to verify this in tests.

@UebelAndre
Copy link
Contributor Author

Is this fix gonna be in Bazel 7.3.0?

@fmeum
Copy link
Collaborator

fmeum commented Jun 12, 2024

This commit is part of a series of cleanups that look challenging to cherry-pick. @lberki Can you assess how much work that would be?

@Ryang20718
Copy link

really hoping that this can go in to 7.3.0! we're still stuck on bazel 6.5 because of this. friendly ping @lberki

fmeum added a commit to fmeum/bazel that referenced this issue Jul 5, 2024
Fixes bazelbuild#20577

Closes bazelbuild#22676.

PiperOrigin-RevId: 649065192
Change-Id: I0a57f580c1cb3a03184950e3e5ea24e9ef4962b4
@Ryang20718
Copy link

Friendly ping @lberki @sgowroji wondering whether this c800608 fix will be cherrypicked into the next release?

@tomrenn
Copy link

tomrenn commented Aug 12, 2024

I've also come across this issue and wondering if it's possible to backport to 7.X?

@iancha1992
Copy link
Member

@tomrenn we can include this in 7.4.0

@iancha1992
Copy link
Member

@bazel-io fork 7.4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Core Skyframe, bazel query, BEP, options parsing, bazelrc type: bug untriaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants