bazel 5.0.0rc3 crashes with local and remote actions being done #14433

philsc · 2021-12-15T12:33:44Z

Description of the problem / feature request:

I'm trying out 5.0.0rc3 in our CI environment and saw the following crash:

(11:09:44) INFO: Caught InterruptedException from ExecException for remote branch of sensors/tools/linear_range/_objs/LinearRangeControllerModule_static/linear_range_controller_module.pic.o, which may cause a crash.
--
  | (11:10:01) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of sensors/tools/linear_range/_objs/LinearRangeControllerModule_static/linear_range_controller_module.pic.o completed. Local was cancelled and done and remote was not cancelled and done.
  | at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.waitBranches(DynamicSpawnStrategy.java:345)
  | at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.exec(DynamicSpawnStrategy.java:733)
  | at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
  | at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:68)
  | at com.google.devtools.build.lib.rules.cpp.CppCompileAction.beginExecution(CppCompileAction.java:1430)
  | at com.google.devtools.build.lib.actions.Action.execute(Action.java:133)
  | at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:907)
  | at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1076)
  | at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1031)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:152)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:91)
  | at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:492)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:856)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:169)
  | at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:590)
  | at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
  | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
  | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
  | at java.base/java.lang.Thread.run(Unknown Source)
  | (11:10:03) Failed with return code 37.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

This is just setting .bazelversion to 5.0.0rc3 and running it on our entire build. This involves a remote buildfarm cluster.
I don't know of a "simple" way to reproduce this.

What operating system are you running Bazel on?

Everything's on x86_64 Linux.

What's the output of `bazel info release`?

$ bazel info release
Starting local Bazel server and connecting to it...
release 5.0.0rc3

Have you found anything relevant by searching the web?

I couldn't find anything pertinent.

Any other information, logs, or outputs that you want to share?

Not at this time.

The text was updated successfully, but these errors were encountered:

philsc · 2021-12-15T12:35:58Z

/cc @Wyverald

coeuvre · 2021-12-15T13:31:29Z

Is this caused by manual interruptions during the build?

coeuvre · 2021-12-15T13:45:43Z

/cc @larsrc-google

philsc · 2021-12-15T14:46:54Z

Is this caused by manual interruptions during the build?

I don't see anything in our logs that would indicate a manual interruption during the build.

EDIT: It's certainly possible that this is a fluke (or maybe a bug in our CI system), but I figured it's worth posting in case it isn't. I've never encountered this error before in our CI system.

jlaxson · 2021-12-15T23:14:08Z

I'm seeing this also on darwin, definitely no user interruption.

larsrc-google · 2021-12-16T13:34:01Z

What's the most recent version you're not seeing this issue with?

larsrc-google · 2021-12-16T13:52:26Z

I submitted a fix for this kind of problem in fe1e25b, and the further changes to src/main/java/com/google/devtools/build/lib/dynamic/DynamicSpawnStrategy.java should not be able to cause this. However, 04754ef can increase the number of actions getting executed without a delay, which could cause this kind of problem to become more common. Before that, 1962a59 did a significant changes, but that was picked into 4.2.0.

If you want to help reproduce it, you can increase the chance of hitting this case by setting --experimental_local_execution_delay=0 and reducing the number of jobs to ~#CPUs, which forces a lot of races. If you're able to reproduce this, could you check the commits above?

@jlaxson Are you seeing that on a CI system or with local builds?
@philsc I'm slightly surprised that your CI uses dynamic execution. I would have expected purely remote execution.

jlaxson · 2021-12-16T19:39:41Z

I'm seeing it locally on 5.0.0rc3, our CI is all remote with --remote_download_minimal. Anecdotally I don't remember it on rc2, but not sure if I was using dynamic execution for much of that time.

But yesterday, I think I hit it about ten times in a row with maybe 1 dirty cpp file and a linking step. I'll dev on 5.0.0-pre.20210708.4 today and see what happens.

coeuvre · 2021-12-22T15:04:16Z

@jlaxson Did you reproduce this error with 5.0.0rc2 or earlier versions?
@philsc Do you still see this error on your CI?

philsc · 2021-12-22T15:16:55Z

@philsc Do you still see this error on your CI?

I'm unfortunately on vacation until the new year so I haven't been running this RC as frequently as I normally would, apologies. I won't have any more data points until I get back from vacation.

coeuvre · 2021-12-22T15:31:07Z

No problem. Enjoy your vacation!

brentleyjones · 2022-01-05T14:20:21Z

This is marked as the last release blocker. @philsc were you able to get more data points?

philsc · 2022-01-05T15:23:40Z

Apologies, I've been working on #14505 to see if there's anything on our side causing it. I'll see if I can still reproduce this error. Though if it's a race condition (for example) I may not see it again.

philsc · 2022-01-05T19:13:59Z

Since it came out, I ran 5.0.0rc3 a total 8 times so far on our repo (8 different days) and I encountered the crash 4 times. 2 of those times happened on the same build for different configurations.

In other words, it seems fairly prevalent. At least in our setup. Is there some more debugging information I can try gathering?

larsrc-google · 2022-01-05T19:29:44Z

Running with --experimental_debug_spawn_scheduler will output some more debugging information. There's also a bunch of information in the log, with that option there will be a lot. If you're able to post it here or mail it to me, I can certainly take a look.

Would you be able to run it with one of the earlier RCs?

Does the error message always say that the local is cancelled and the remote is done, or does it vary? Are there earlier warnings about null results? The only way I can think of this happening is if the remote branch returns null, which it's not supposed to.

philsc · 2022-01-06T09:24:36Z

These are the 4 crashes. I trimmed other text because it was just regular build progress output.

(15:23:07) INFO: Caught InterruptedException from ExecException for remote branch of simulation/validation/_objs/safety_headway_measurement_validation_params/safety_headway_measurement_validation_params.pic.o, which may cause a crash.
--
  | (15:23:08) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of simulation/validation/_objs/safety_headway_measurement_validation_params/safety_headway_measurement_validation_params.pic.o completed. Local was cancelled and done and remote was not cancelled and done.

(10:40:35) INFO: Caught InterruptedException from ExecException for remote branch of common/logging/python/log_uri_python.so, which may cause a crash.
--
  | (10:40:38) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of common/logging/python/log_uri_python.so completed. Local was cancelled and done and remote was not cancelled and done.

(11:33:00) INFO: Caught InterruptedException from ExecException for remote branch of planning/ml/libtf_scope_join.so, which may cause a crash.
--
  | (11:33:09) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of planning/ml/libtf_scope_join.so completed. Local was cancelled and done and remote was not cancelled and done.

(11:09:44) INFO: Caught InterruptedException from ExecException for remote branch of sensors/tools/linear_range/_objs/LinearRangeControllerModule_static/linear_range_controller_module.pic.o, which may cause a crash.
--
  | (11:10:01) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of sensors/tools/linear_range/_objs/LinearRangeControllerModule_static/linear_range_controller_module.pic.o completed. Local was cancelled and done and remote was not cancelled and done.

Would you be able to run it with one of the earlier RCs?

I cannot because they immediately trigger analysis errors. RC3 is the first one I was able to run because of #14289 (i.e. fixed in RC3).

I will trigger some runs with --experimental_debug_spawn_scheduler.

philsc · 2022-01-06T12:41:55Z

I had one run with --experimental_debug_spawn_scheduler that resulted in 3 crashes. Let me check with legal if I'm allowed to post the whole thing or e-mail it to you, @larsrc-google.

jlaxson · 2022-01-06T23:58:49Z

I've been on release 5.0.0-pre.20210708.4 for a few weeks and have noticed nothing. Forget why I chose that one...

larsrc-google · 2022-01-07T09:37:58Z

Those are interesting errors. The message is actually misleading - we caught the InterruptedException from an ExecutionExecption, not an ExecException. I think this means the branch thread (remote execution) threw InterruptedException in some other case than being cancelled. That would make the underlying problem be within the remote executor, which @coeuvre knows more about that me.

While waiting for legal to get back to you, just the outputs from those crashes might be useful in eliminating some possibilities.

I'm also curious why your CI is using dynamic execution as Chi asked. I would have expected the CI to run all builds remotely.

philsc · 2022-01-07T10:30:54Z

While waiting for legal to get back to you, just the outputs from those crashes might be useful in eliminating some possibilities.

Can you clarify what you mean by outputs? The stack traces are all the same, but let me double-check.
EDIT: Yeah, all the stack traces are identical to what I posted in the original report.

I'm also curious why your CI is using dynamic execution as Chi asked. I would have expected the CI to run all builds remotely.

I think @80degreeswest might be a better person to answer that. I haven't been involved with that aspect of our CI a whole lot.

80degreeswest · 2022-01-07T13:46:22Z

We use --experimental_spawn_scheduler to account for times where small targets seem to run very slow remotely, but execute extremely fast locally (not any specific targets, and not 100% of the time).

larsrc-google · 2022-01-10T13:06:20Z

@philsc I mean other outputs when run with --experimental_debug_spawn_scheduler. A log, or at least the part of the log leading up to the crash, would be better of course. My two theories are 1) something inside the remote execution system interrupts something and the exception from that escapes and 2) there's some corner case where the dynamic execution system manages to interrupt the remote thread without it being marked as cancelled. I'm leaning strongly towards the former, though.

coeuvre · 2022-01-11T13:14:48Z

Still trying to reproduce this locally. I have a setup that builds Bazel itself with dynamic execution and ran it for 15+ hours but no lucky.

coeuvre · 2022-01-11T13:22:16Z

@philsc Can you try with flag --experimental_remote_grpc_log and share the gRPC log? I am curious about what the remote branch did when the error occurs.

larsrc-google · 2022-01-11T16:00:45Z

@jlaxson @philsc Are you two working on the same code base?

Wyverald · 2022-01-12T14:49:51Z

@philsc gentle ping. This is one of the last release blockers of 5.0 and I'd like to push out a final RC as soon as possible.

philsc · 2022-01-12T18:37:42Z

@Wyverald , apologies. I'm back from traveling.
I'm not trying to hold up the release. I'm happy to continue investigating this if you decide to release 5.0.
I need to catch up on e-mail and figure out what legal is letting me do.

EDIT: I submitted an anonymized log for their review. If accepted, I will post it here.

philsc · 2022-01-12T20:29:07Z

@jlaxson @philsc Are you two working on the same code base?

@larsrc-google , no we are not.

Wyverald · 2022-01-13T13:52:35Z

@philsc Thank you! Given how elusive this issue is and its relatively low prevalence, I talked to a few people and decided to go ahead and release an RC4 without resolving this issue. If we can get this debugged and resolved soon-ish (say within a week from now), then we can cherrypick the fix, otherwise we'll just ship 5.0 and aim for a 5.0.1 point release to eventually fix this.

larsrc-google · 2022-01-14T14:55:03Z

I now have a Bazel setup where I can do dynamic execution, but I don't have a target I can reproduce this with. If anyone sees this happening with a publicly available srcs, please let me know. Until then, I shall see what philsc can provide.

philsc · 2022-01-14T21:03:51Z

Here's an anonymized build log of one of the crashes: bazel-issue-14433.txt
I had to anonymize it in order to be allowed to post it, apologies. It should still be useful though. If not, let me know.
Next week I will hopefully be able to reproduce with --experimental_remote_grpc_log

larsrc-google · 2022-01-17T08:07:51Z

Hm, not much to learn from that, alas. The local branch was cancelled, and then the remote branch got interrupted in a non-cancellation way.

I don't think we can get this debugged quickly. Don't let it hold up the release.

Wyverald · 2022-02-03T17:26:51Z

@bazel-io fork 5.1

brentleyjones · 2022-02-16T13:56:47Z

This is marked as a release blocker for 5.1. Has any progress been made on it?

philsc · 2022-02-18T20:18:21Z

Apologies. I haven't had the time to collect more data. Specifically, reproducing the crash with --experimental_remote_grpc_log. I suspect I will get around to it next week.

tomrenn · 2022-02-22T23:08:59Z

I'm able to reproduce this as well. I've shared logs using --experimental_remote_grpc_log and --experimental_debug_spawn_scheduler to @coeuvre.

philsc · 2022-03-01T21:32:21Z

I'm having some trouble extracting these artifacts from our CI pipelines. I should be able to get this resolved, create an anonymous version of the artifacts and upload them here. They appear to be ~300 MiB in size though.

EDIT: I guess the file size is related to how far into the build the crash occurs.

philsc · 2022-03-04T06:43:19Z

I found one crash that died very early in the run. Here's the anonymized output generated with --experimental_remote_grpc_log:
grpc-events-build.bin.5e2ba1cb-21ee-465b-88f6-7cf5520fc0e2.anonymized.gz

That log corresponds to this crash:

(22:38:46) INFO: Analyzed 69157 targets (0 packages loaded, 0 targets configured).
(22:38:46)
(22:38:46) INFO: Found 69157 targets...
(22:38:47) [0 / 280] [Prepa] Creating source manifest for //:file000000 ... (8 actions, 0 running)
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000001.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000002.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000003.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000004.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000005.pic.o, which may cause a crash.
(22:38:48) FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.AssertionError: Neither branch of file000001.pic.o completed. Local was cancelled and done and remote was not cancelled and done.
        at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.waitBranches(DynamicSpawnStrategy.java:345)
        at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.exec(DynamicSpawnStrategy.java:733)
        at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:68)
        at com.google.devtools.build.lib.rules.cpp.CppCompileAction.beginExecution(CppCompileAction.java:1430)
        at com.google.devtools.build.lib.actions.Action.execute(Action.java:133)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:907)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1076)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1031)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:152)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:91)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:492)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:856)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:169)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:590)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)

Note: I did not make make the filenames match in the text output and the GRPC log. If that's important, let me know and I can rework it.

larsrc-google · 2022-03-07T11:11:44Z

This looks similar to a thing we've been investigating internally. Our current hypothesis is that something in the remote branch uses interrupts wrongly. It doesn't look like these happen because the remote branch gets interrupted by the dynamic execution system.

There's a general problem that if one of the branches uses interrupts internally, it's impossible to say if an interrupt was from the dynamic execution cancellation or from inside the branch. But that's not what we're seeing here.

coeuvre · 2022-03-08T10:56:07Z

Found the root cause! Working on the fix.

… enabled. Fixes bazelbuild#14433. The root cause is, inside `RemoteExecutionCache`, the result of `FindMissingDigests` is shared with other threads without considering error handling. For example, if there are two or more threads uploading the same input and one thread got interrupted when waiting for the result of `FindMissingDigests` call, the call is cancelled and others threads still waiting for the upload will receive upload error due to the cancellation which is wrong. This PR fixes this by effectively applying reference count to the result of `FindMissingDigests` call so that if one thread got interrupted, as long as there are other threads depending on the result, the call won't be cancelled and the upload can continue. Closes bazelbuild#15001. PiperOrigin-RevId: 436180205

… enabled. (#15091) Fixes #14433. The root cause is, inside `RemoteExecutionCache`, the result of `FindMissingDigests` is shared with other threads without considering error handling. For example, if there are two or more threads uploading the same input and one thread got interrupted when waiting for the result of `FindMissingDigests` call, the call is cancelled and others threads still waiting for the upload will receive upload error due to the cancellation which is wrong. This PR fixes this by effectively applying reference count to the result of `FindMissingDigests` call so that if one thread got interrupted, as long as there are other threads depending on the result, the call won't be cancelled and the upload can continue. Closes #15001. PiperOrigin-RevId: 436180205

philsc changed the title ~~bazel 5.0.0rc3 crashes with local and remote actions being cancelled~~ bazel 5.0.0rc3 crashes with local and remote actions being done Dec 15, 2021

Wyverald added this to the Bazel 5.0 Release Blockers milestone Dec 15, 2021

Wyverald added under investigation team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Dec 15, 2021

Wyverald modified the milestones: Bazel 5.0 Release Blockers, 5.1 release blockers Jan 21, 2022

Wyverald removed this from the 5.1 release blockers milestone Feb 3, 2022

bazel-io mentioned this issue Feb 3, 2022

[5.1] bazel 5.0.0rc3 crashes with local and remote actions being done #14709

Closed

coeuvre added P1 I'll work on this now. (Assignee required) type: bug and removed under investigation labels Mar 8, 2022

coeuvre self-assigned this Mar 8, 2022

coeuvre mentioned this issue Mar 8, 2022

Remote: Fix crashes by InterruptedException when dynamic execution is enabled. #15001

Closed

bazel-io closed this as completed in 702df84 Mar 21, 2022

coeuvre mentioned this issue Mar 21, 2022

[5.1] Remote: Fix crashes by InterruptedException when dynamic execution is… #15091

Merged

bazel 5.0.0rc3 crashes with local and remote actions being done #14433

bazel 5.0.0rc3 crashes with local and remote actions being done #14433

Comments

philsc commented Dec 15, 2021

Description of the problem / feature request:

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

philsc commented Dec 15, 2021

coeuvre commented Dec 15, 2021

coeuvre commented Dec 15, 2021

philsc commented Dec 15, 2021 • edited Loading

jlaxson commented Dec 15, 2021

larsrc-google commented Dec 16, 2021

larsrc-google commented Dec 16, 2021

jlaxson commented Dec 16, 2021

coeuvre commented Dec 22, 2021

philsc commented Dec 22, 2021

coeuvre commented Dec 22, 2021

brentleyjones commented Jan 5, 2022

philsc commented Jan 5, 2022 • edited Loading

philsc commented Jan 5, 2022

larsrc-google commented Jan 5, 2022

philsc commented Jan 6, 2022

philsc commented Jan 6, 2022

jlaxson commented Jan 6, 2022

larsrc-google commented Jan 7, 2022

philsc commented Jan 7, 2022 • edited Loading

80degreeswest commented Jan 7, 2022

larsrc-google commented Jan 10, 2022

coeuvre commented Jan 11, 2022

coeuvre commented Jan 11, 2022

larsrc-google commented Jan 11, 2022

Wyverald commented Jan 12, 2022

philsc commented Jan 12, 2022 • edited Loading

philsc commented Jan 12, 2022

Wyverald commented Jan 13, 2022

larsrc-google commented Jan 14, 2022

philsc commented Jan 14, 2022

larsrc-google commented Jan 17, 2022

Wyverald commented Feb 3, 2022

brentleyjones commented Feb 16, 2022

philsc commented Feb 18, 2022

tomrenn commented Feb 22, 2022

philsc commented Mar 1, 2022 • edited Loading

philsc commented Mar 4, 2022 • edited Loading

larsrc-google commented Mar 7, 2022

coeuvre commented Mar 8, 2022

What's the output of `bazel info release`?

philsc commented Dec 15, 2021 •

edited

Loading

philsc commented Jan 5, 2022 •

edited

Loading

philsc commented Jan 7, 2022 •

edited

Loading

philsc commented Jan 12, 2022 •

edited

Loading

philsc commented Mar 1, 2022 •

edited

Loading

philsc commented Mar 4, 2022 •

edited

Loading