-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bazel 5.0.0rc3 crashes with local and remote actions being done #14433
Comments
/cc @Wyverald |
Is this caused by manual interruptions during the build? |
/cc @larsrc-google |
I don't see anything in our logs that would indicate a manual interruption during the build. EDIT: It's certainly possible that this is a fluke (or maybe a bug in our CI system), but I figured it's worth posting in case it isn't. I've never encountered this error before in our CI system. |
I'm seeing this also on darwin, definitely no user interruption. |
What's the most recent version you're not seeing this issue with? |
I submitted a fix for this kind of problem in fe1e25b, and the further changes to src/main/java/com/google/devtools/build/lib/dynamic/DynamicSpawnStrategy.java should not be able to cause this. However, 04754ef can increase the number of actions getting executed without a delay, which could cause this kind of problem to become more common. Before that, 1962a59 did a significant changes, but that was picked into 4.2.0. If you want to help reproduce it, you can increase the chance of hitting this case by setting @jlaxson Are you seeing that on a CI system or with local builds? |
I'm seeing it locally on 5.0.0rc3, our CI is all remote with --remote_download_minimal. Anecdotally I don't remember it on rc2, but not sure if I was using dynamic execution for much of that time. But yesterday, I think I hit it about ten times in a row with maybe 1 dirty cpp file and a linking step. I'll dev on 5.0.0-pre.20210708.4 today and see what happens. |
I'm unfortunately on vacation until the new year so I haven't been running this RC as frequently as I normally would, apologies. I won't have any more data points until I get back from vacation. |
No problem. Enjoy your vacation! |
This is marked as the last release blocker. @philsc were you able to get more data points? |
Apologies, I've been working on #14505 to see if there's anything on our side causing it. I'll see if I can still reproduce this error. Though if it's a race condition (for example) I may not see it again. |
Since it came out, I ran 5.0.0rc3 a total 8 times so far on our repo (8 different days) and I encountered the crash 4 times. 2 of those times happened on the same build for different configurations. In other words, it seems fairly prevalent. At least in our setup. Is there some more debugging information I can try gathering? |
Running with Would you be able to run it with one of the earlier RCs? Does the error message always say that the local is cancelled and the remote is done, or does it vary? Are there earlier warnings about null results? The only way I can think of this happening is if the remote branch returns null, which it's not supposed to. |
These are the 4 crashes. I trimmed other text because it was just regular build progress output.
I cannot because they immediately trigger analysis errors. RC3 is the first one I was able to run because of #14289 (i.e. fixed in RC3). I will trigger some runs with |
I had one run with |
I've been on release 5.0.0-pre.20210708.4 for a few weeks and have noticed nothing. Forget why I chose that one... |
Those are interesting errors. The message is actually misleading - we caught the While waiting for legal to get back to you, just the outputs from those crashes might be useful in eliminating some possibilities. I'm also curious why your CI is using dynamic execution as Chi asked. I would have expected the CI to run all builds remotely. |
Can you clarify what you mean by outputs? The stack traces are all the same, but let me double-check.
I think @80degreeswest might be a better person to answer that. I haven't been involved with that aspect of our CI a whole lot. |
We use --experimental_spawn_scheduler to account for times where small targets seem to run very slow remotely, but execute extremely fast locally (not any specific targets, and not 100% of the time). |
@philsc I mean other outputs when run with |
Still trying to reproduce this locally. I have a setup that builds Bazel itself with dynamic execution and ran it for 15+ hours but no lucky. |
@philsc Can you try with flag |
@philsc gentle ping. This is one of the last release blockers of 5.0 and I'd like to push out a final RC as soon as possible. |
@Wyverald , apologies. I'm back from traveling. EDIT: I submitted an anonymized log for their review. If accepted, I will post it here. |
@larsrc-google , no we are not. |
@philsc Thank you! Given how elusive this issue is and its relatively low prevalence, I talked to a few people and decided to go ahead and release an RC4 without resolving this issue. If we can get this debugged and resolved soon-ish (say within a week from now), then we can cherrypick the fix, otherwise we'll just ship 5.0 and aim for a 5.0.1 point release to eventually fix this. |
I now have a Bazel setup where I can do dynamic execution, but I don't have a target I can reproduce this with. If anyone sees this happening with a publicly available srcs, please let me know. Until then, I shall see what philsc can provide. |
Here's an anonymized build log of one of the crashes: bazel-issue-14433.txt |
Hm, not much to learn from that, alas. The local branch was cancelled, and then the remote branch got interrupted in a non-cancellation way. I don't think we can get this debugged quickly. Don't let it hold up the release. |
@bazel-io fork 5.1 |
This is marked as a release blocker for 5.1. Has any progress been made on it? |
Apologies. I haven't had the time to collect more data. Specifically, reproducing the crash with |
I'm able to reproduce this as well. I've shared logs using |
I'm having some trouble extracting these artifacts from our CI pipelines. I should be able to get this resolved, create an anonymous version of the artifacts and upload them here. They appear to be ~300 MiB in size though. EDIT: I guess the file size is related to how far into the build the crash occurs. |
I found one crash that died very early in the run. Here's the anonymized output generated with That log corresponds to this crash:
Note: I did not make make the filenames match in the text output and the GRPC log. If that's important, let me know and I can rework it. |
This looks similar to a thing we've been investigating internally. Our current hypothesis is that something in the remote branch uses interrupts wrongly. It doesn't look like these happen because the remote branch gets interrupted by the dynamic execution system. There's a general problem that if one of the branches uses interrupts internally, it's impossible to say if an interrupt was from the dynamic execution cancellation or from inside the branch. But that's not what we're seeing here. |
Found the root cause! Working on the fix. |
… enabled. Fixes bazelbuild#14433. The root cause is, inside `RemoteExecutionCache`, the result of `FindMissingDigests` is shared with other threads without considering error handling. For example, if there are two or more threads uploading the same input and one thread got interrupted when waiting for the result of `FindMissingDigests` call, the call is cancelled and others threads still waiting for the upload will receive upload error due to the cancellation which is wrong. This PR fixes this by effectively applying reference count to the result of `FindMissingDigests` call so that if one thread got interrupted, as long as there are other threads depending on the result, the call won't be cancelled and the upload can continue. Closes bazelbuild#15001. PiperOrigin-RevId: 436180205
… enabled. Fixes bazelbuild#14433. The root cause is, inside `RemoteExecutionCache`, the result of `FindMissingDigests` is shared with other threads without considering error handling. For example, if there are two or more threads uploading the same input and one thread got interrupted when waiting for the result of `FindMissingDigests` call, the call is cancelled and others threads still waiting for the upload will receive upload error due to the cancellation which is wrong. This PR fixes this by effectively applying reference count to the result of `FindMissingDigests` call so that if one thread got interrupted, as long as there are other threads depending on the result, the call won't be cancelled and the upload can continue. Closes bazelbuild#15001. PiperOrigin-RevId: 436180205
… enabled. (#15091) Fixes #14433. The root cause is, inside `RemoteExecutionCache`, the result of `FindMissingDigests` is shared with other threads without considering error handling. For example, if there are two or more threads uploading the same input and one thread got interrupted when waiting for the result of `FindMissingDigests` call, the call is cancelled and others threads still waiting for the upload will receive upload error due to the cancellation which is wrong. This PR fixes this by effectively applying reference count to the result of `FindMissingDigests` call so that if one thread got interrupted, as long as there are other threads depending on the result, the call won't be cancelled and the upload can continue. Closes #15001. PiperOrigin-RevId: 436180205
Description of the problem / feature request:
I'm trying out 5.0.0rc3 in our CI environment and saw the following crash:
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
This is just setting
.bazelversion
to5.0.0rc3
and running it on our entire build. This involves a remote buildfarm cluster.I don't know of a "simple" way to reproduce this.
What operating system are you running Bazel on?
Everything's on x86_64 Linux.
What's the output of
bazel info release
?Have you found anything relevant by searching the web?
I couldn't find anything pertinent.
Any other information, logs, or outputs that you want to share?
Not at this time.
The text was updated successfully, but these errors were encountered: