-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bazel build //X:all
records target completion events in the BEP stream with file://
URIs, but sometimes doesn't write files to disk
#20737
Comments
I can't reproduce this issue with simple |
This is a bit of a shot in the dark, but does setting |
That's a pretty good shot in the dark! It does indeed make the issue go away (multiple successful builds with Bazel 7, with no failures). |
@bazel-io fork 7.0.1 |
I added this issue as a soft release blocker for 7.0.1, since the fix is still unclear and there is a workaround. We can maybe have a 7.0.2 if this doesn't get fix in time. |
@bazel-io fork 7.1.0 |
@froydnj We looked into this issue today but still cannot reproduce it. Can you try to work out a minimal repro, or provide a full set of flags you are using? On the other hand, were you building incrementally or cleanly? |
@nya3jp does this always happen for |
The |
That might actually be relevant! The code path to fetch the outputs of aspects is slightly different depending on the value of @joeleba mind taking a look at this to see if anything stands out? In particular, I'm worried that Alternatively, I could imagine that if the analysis of the aspect that promotes an artifact to a top-level artifact finishes after the artifact is built, the artifact will not get downloaded. |
The second alternative it is. Here is a reproduction. As expected, it requires cat > BUILD <<'EOF'
filegroup(name = "top", srcs = [":actual", "//slow"])
genrule(name = "proxy", srcs = [":actual"], outs = ["proxy_file"], cmd = "cp $< $@")
genrule(name = "actual", srcs = [], outs = ["actual_file"], cmd = "echo ACTUAL > $@")
EOF
mkdir -p slow
mkfifo slow/BUILD
bazel build :proxy --disk_cache=/tmp/dc # Prime the disk cache
bazel clean # Clean the output tree so that the disk cache is hit again and actual_file is deleted
bazel build :top :proxy --disk_cache=/tmp/dc &
# Wait until :proxy builds; most principled is reading the BEP
# (I just used manual waiting for a few seconds, which is too brittle for an actual test)
echo "filegroup(name = 'slow', visibility = ['//visibility:public'])" > slow/BUILD # Unblock analyzing :top
wait %1 # Wait for the second Bazel invocation
# Behold the absence of bazel-bin/actual_file |
From tracing the code with the debugger, it seems like your theory is correct, Lukacs. Each time a top level target is analyzed, we add its output artifacts to a Trie of Naive question: is it feasible to just attempt to download every artifact in this Trie at the end of the build? That would take away the importance of when an artifact is registered as "top-level". |
That'll be for @coeuvre to decide -- that's definitely the simplest solution although in incurs a performance hit, since the download would start potentially much later than the moment Bazel realizes that a file needs to be downloaded. |
I think the idea here is that we still download action outputs during action execution if we can, but we need a final pass at the end for anything that was only marked for download after the respective action finished. We could also consider storing top-level outputs that came from an aspect separately, so we only have to iterate those at the end, instead of the entire |
We tried a similar idea before and encountered #17798. |
No, this could still trigger #17798. |
Are we still targeting this for 7.2.0? (first RC next Monday, the 13th) |
Yes, I am working on the fix now. |
…ent. Fixes bazelbuild#20737. PiperOrigin-RevId: 632430888 Change-Id: I1f2e1034c73162e925e25eb57b31d6a32b30fad2
…ent. Fixes bazelbuild#20737. PiperOrigin-RevId: 632430888 Change-Id: I1f2e1034c73162e925e25eb57b31d6a32b30fad2
A fix for this issue has been included in Bazel 7.2.0 RC1. Please test out the release candidate and report any issues as soon as possible. |
Description of the bug:
We use
--build_event_json_file
to watch for target completion events and run some post-processing on the produced files in tandem with Bazel running. With Bazel 6.3.2, this worked fine, though we did run into #17798, and therefore had to add some polling to wait for the files being written to disk. This polling has a built-in (per-target) timeout of 3 minutes, which we ~never see hit (post-processing generally completes within seconds).When attempting to upgrade to Bazel 7.0.0, we are seeing this timeout regularly getting hit, despite remote cache hits and an otherwise quiescent system. That is, we are seeing target completion events in the BEP stream, but for a small number of those targets, the relevant files never appear on disk (at least within the 3-minute timeout;
bazel build
finishes within ~40s, so the machine is not particularly overloaded). There does not seem to be any commonality between the files that are not present between differentbazel build
runs; a given target might not be downloaded in one run and be downloaded in another. This behavior occurs with both fully-cached and mostly-cached builds.The same behavior occurs when building with
//X:...
.We do occasionally see successful builds with Bazel 7, but they are rare (~1%, maybe less). It is possible, of course, that raising the timeout would work, but given that we ~never hit the timeout with Bazel 6 and ~always hit the timeout with Bazel 7, increasing the timeout would represent a significant performance hit.
We have checked the BEP streams between Bazel 6 and Bazel 7 (fully cached build for both) and there are only trivial differences. We took GRPC logs from our remote cache (fully cached build for both) and again there are only trivial differences.
We are using
--remote_download_toplevel
with both versions. This issue happens with remote caching enabled and remote execution disabled, which is our CI configuration (we know...); we have a separate configuration with remote caching and remote execution enabled that does not appear to exhibit the problem.We do not see this behavior if we manually expand
//X:all
on the command line, i.e. listing out all the relevant targets that//X:all
would have built.cc @sushain97 @ngroman @xieyuxi-stripe @coeuvre
Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
We haven't been able to come up with a simple example that reproduces this outside of our codebase.
Which operating system are you running Bazel on?
x86-64 Linux
What is the output of
bazel info release
?release 7.0.0
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
Yes, this is a regression. We have not tried to isolate the relevant commit(s).
Have you found anything relevant by searching the web?
I have looked through the last several months of Github issues, but do not see anything relevant.
bazel-discuss does not appear to have anything related.
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: