Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

baseline_coverage.dat missing with BES + BwtB #17001

Closed
keith opened this issue Dec 12, 2022 · 14 comments
Closed

baseline_coverage.dat missing with BES + BwtB #17001

keith opened this issue Dec 12, 2022 · 14 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@keith
Copy link
Member

keith commented Dec 12, 2022

Description of the bug:

We're seeing a log like:

(13:23:40) WARNING: Uploading BEP referenced local files: /private/var/tmp/_bazel_iosci/6462d53b34f6666019a2973c03de6978/execroot/lyftios/bazel-out/darwin_arm64-fastbuild/testlogs/Modules/.../baseline_coverage.dat (No such file or directory)
java.io.FileNotFoundException: /private/var/tmp/_bazel_iosci/6462d53b34f6666019a2973c03de6978/execroot/lyftios/bazel-out/darwin_arm64-fastbuild/testlogs/Modules/.../baseline_coverage.dat (No such file or directory)
	at com.google.devtools.build.lib.unix.NativePosixFiles.stat(Native Method)
	at com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:209)
	at com.google.devtools.build.lib.unix.UnixFileSystem.stat(UnixFileSystem.java:199)
	at com.google.devtools.build.lib.vfs.PathTransformingDelegateFileSystem.stat(PathTransformingDelegateFileSystem.java:252)
	at com.google.devtools.build.lib.remote.RemoteActionFileSystem.stat(RemoteActionFileSystem.java:460)
	at com.google.devtools.build.lib.remote.RemoteActionFileSystem.getFileSize(RemoteActionFileSystem.java:392)
	at com.google.devtools.build.lib.vfs.Path.getFileSize(Path.java:559)
	at com.google.devtools.build.lib.remote.util.DigestUtil.compute(DigestUtil.java:61)
	at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.readPathMetadata(ByteStreamBuildEventArtifactUploader.java:186)

For every single test target when we run them on CI with BES uploads and --remote_download_toplevel --experimental_remote_build_event_upload=minimal

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I'm not sure, it's hard to debug with cache hits vs not

Which operating system are you running Bazel on?

macOS

What is the output of bazel info release?

release 6.0.0rc4

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@brentleyjones
Copy link
Contributor

@coeuvre @tjgq

@sgowroji sgowroji added type: bug untriaged team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Dec 13, 2022
@coeuvre
Copy link
Member

coeuvre commented Dec 13, 2022

Probably the test action hit the cache so some files are not downloaded. Might related #16556.

Can you please provide an repro? Not exactly sure how to trigger this error.

@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) more data needed labels Dec 13, 2022
@thirtyseven
Copy link
Contributor

thirtyseven commented Apr 3, 2023

I am also getting this warning on Bazel 6.1.0, but it doesn't seem to matter what values I specify for --remote_download_toplevel or --experimental_remote_build_event_upload=minimal. What's odder is that after the build completes, the baseline_coverage.dat file that Bazel complains it couldn't find is present in the output tree.

@UebelAndre
Copy link
Contributor

I'm seeing this spam in rules_rust builds on Bazel 6.2.1 for the following pull-request bazelbuild/rules_rust#2001

@coeuvre does this provide you a sufficient repro?

https://buildkite.com/bazel/rules-rust-rustlang/builds/8715#0188aaec-43c1-4ad1-a9fe-62cdbf7d03e3

(14:48:09) WARNING: Uploading BEP referenced local files: /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/execroot/rules_rust/bazel-out/k8-fastbuild/testlogs/test/unit/proc_macro/extern_flag_passed_when_compiling_macro_2018/baseline_coverage.dat (No such file or directory)
java.io.FileNotFoundException: /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/execroot/rules_rust/bazel-out/k8-fastbuild/testlogs/test/unit/proc_macro/extern_flag_passed_when_compiling_macro_2018/baseline_coverage.dat (No such file or directory)
	at com.google.devtools.build.lib.unix.NativePosixFiles.stat(Native Method)
	at com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:210)
	at com.google.devtools.build.lib.unix.UnixFileSystem.stat(UnixFileSystem.java:200)
	at com.google.devtools.build.lib.unix.UnixFileSystem.getFileSize(UnixFileSystem.java:391)
	at com.google.devtools.build.lib.vfs.Path.getFileSize(Path.java:559)
	at com.google.devtools.build.lib.remote.util.DigestUtil.compute(DigestUtil.java:61)
	at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.readPathMetadata(ByteStreamBuildEventArtifactUploader.java:194)
	at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.lambda$upload$7(ByteStreamBuildEventArtifactUploader.java:330)
	at io.reactivex.rxjava3.internal.operators.flowable.FlowableMap$MapSubscriber.onNext(FlowableMap.java:64)
	at io.reactivex.rxjava3.internal.operators.flowable.FlowableFromIterable$IteratorSubscription.fastPath(FlowableFromIterable.java:185)
	at io.reactivex.rxjava3.internal.operators.flowable.FlowableFromIterable$BaseRangeSubscription.request(FlowableFromIterable.java:129)
	at io.reactivex.rxjava3.internal.subscribers.BasicFuseableSubscriber.request(BasicFuseableSubscriber.java:153)
	at io.reactivex.rxjava3.internal.jdk8.FlowableCollectWithCollectorSingle$CollectorSingleObserver.onSubscribe(FlowableCollectWithCollectorSingle.java:102)
	at io.reactivex.rxjava3.internal.subscribers.BasicFuseableSubscriber.onSubscribe(BasicFuseableSubscriber.java:67)
	at io.reactivex.rxjava3.internal.operators.flowable.FlowableFromIterable.subscribe(FlowableFromIterable.java:69)
	at io.reactivex.rxjava3.internal.operators.flowable.FlowableFromIterable.subscribeActual(FlowableFromIterable.java:47)
	at io.reactivex.rxjava3.core.Flowable.subscribe(Flowable.java:15917)
	at io.reactivex.rxjava3.internal.operators.flowable.FlowableMap.subscribeActual(FlowableMap.java:38)
	at io.reactivex.rxjava3.core.Flowable.subscribe(Flowable.java:15917)
	at io.reactivex.rxjava3.internal.jdk8.FlowableCollectWithCollectorSingle.subscribeActual(FlowableCollectWithCollectorSingle.java:71)
	at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
	at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:37)
	at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
	at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:37)
	at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
	at io.reactivex.rxjava3.internal.operators.single.SingleMap.subscribeActual(SingleMap.java:35)
	at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
	at io.reactivex.rxjava3.internal.operators.single.SingleUsing.subscribeActual(SingleUsing.java:83)
	at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
	at io.reactivex.rxjava3.internal.operators.single.SingleSubscribeOn$SubscribeOnObserver.run(SingleSubscribeOn.java:89)
	at io.reactivex.rxjava3.internal.schedulers.ScheduledDirectTask.call(ScheduledDirectTask.java:38)
	at io.reactivex.rxjava3.internal.schedulers.ScheduledDirectTask.call(ScheduledDirectTask.java:25)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

@UebelAndre
Copy link
Contributor

@scentini do you know what would be needed to gain some traction here? These warnings are exploding logs in my projects.

@coeuvre
Copy link
Member

coeuvre commented Jun 23, 2023

Sorry for the delay. I will look into it.

@coeuvre coeuvre added P1 I'll work on this now. (Assignee required) and removed P2 We'll consider working on this in future. (Assignee optional) more data needed labels Jun 23, 2023
@coeuvre
Copy link
Member

coeuvre commented Jun 27, 2023

I can reproduce this issue using rules_rust with Bazel 6.2.1, but can't reproduce it with Bazel@HEAD. The root cause is similar to the issue described in #17798. I will work on a fix for 6.3.0.

@coeuvre
Copy link
Member

coeuvre commented Jun 27, 2023

And this issue exists without using BwoB. It's just that the file baseline_coverage.dat was not generated at the time when the BEP event was emitted.

@coeuvre
Copy link
Member

coeuvre commented Jun 27, 2023

cc @c-mita in case you are already aware of a fix for this issue.

@c-mita
Copy link
Member

c-mita commented Jun 27, 2023

I'm not aware of anything regarding this.

Does it complain about missing baseline_coverage.dat files for all transitive targets? Or just the top-level test targets?

@coeuvre
Copy link
Member

coeuvre commented Jun 27, 2023

Just the top-level targets. The missing baseline_coverage.dat files are generated afterwards. My suspicion is there are missing deps between the BaselineCoverageAction and CompletionFunction so that the TargetCompletedEvent was emitted without waiting for baseline_coverage.dat to be generated. I am running a bisection to look which commit fixed this.

@coeuvre
Copy link
Member

coeuvre commented Jun 27, 2023

Bisection points me to 30cf799 which forces skyframe to use FJP. And if I rerun the repro with flag --experimental_use_fork_join_pool using 6.2.1, guess what, the issue disappeared.

cc @meisterT

@coeuvre
Copy link
Member

coeuvre commented Jun 27, 2023

In conclusion, the issue is baseline_coverage.dat hasn't been generated when TargetCompleteEvent was emitted which references the file. Then the uploader tried to read it from local filesystem but failed. The baseline_coverage.dat was generated afterwards.

The root cause is:

  1. baseline_coverage.dat is a toplevel artifact and is generated by BaselineCoverageAction.
  2. TargetCompleteEvent is created by CompletionFunction which is executed after a test target had finished. The has a reference to baseline_coverage.dat but the function doesn't check whether the file has been generated.
  3. Since there is no connection between BaselineCoverageAction and CompletionFunction currently, there is no guarantee for the existence of baseline_coverage.dat when TargetCompleteEvent is emitted.

The fact that the issue cannot be reproduced at HEAD is because of flag --experimental_use_fork_join_pool. I guess FJP silently changed how skyframe schedule actions and accidentally make all the BaselineCoverageAction run before CompletionFunction.

To mitigate the issue for 6.x users, we can simply cherry-pick commit 30cf799 into 6.3.0. A proper fix requires us somehow ask CompletionFunction wait for the referenced baseline_coverage.dat to be generated before emit the TargetCompleteEvent.

@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) and removed P1 I'll work on this now. (Assignee required) labels Jun 27, 2023
@tjgq
Copy link
Contributor

tjgq commented Aug 29, 2024

We believe this has been fixed at head through a combination of recent changes; please reopen otherwise.

@tjgq tjgq closed this as completed Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

9 participants