Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash after build in downloadBlob #18337

Closed
jmmv opened this issue May 5, 2023 · 3 comments
Closed

Crash after build in downloadBlob #18337

jmmv opened this issue May 5, 2023 · 3 comments
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@jmmv
Copy link
Contributor

jmmv commented May 5, 2023

Description of the bug:

While running a build against a remote build farm, we hit the crash below.

We configure --remote_timeout=10m, which probably explains the 10-minute delays I've annotated in the log. We are running with --remote_download_toplevel.

The crash seems to originate from downloadBlob. It's entirely possible that the connection to the build farm died / was aborted, but note that because the gaps are almost-exactly 10-minute long, this could also be Bazel itself aborting the connection due to a timeout. In any case, Bazel should definitely not crash.

2023-05-05 04:56:01  Target //foo:bar up-to-date:
2023-05-05 04:56:01    bazel-bin/foo/bar
2023-05-05 04:56:01  INFO: Elapsed time: 216.877s, Critical Path: 49.43s
2023-05-05 04:56:01  INFO: 15189 processes: 9232 remote cache hit, 5956 internal, 1 local.
2023-05-05 04:56:01  INFO: Build completed successfully, 15189 total actions
... 10 minutes pass with this message being repeatedly printed on onto the console ...
2023-05-05 05:06:02  INFO: Build completed successfully, 15189 total actions
... 10 minutes pass with radio silence ...
2023-05-05 05:16:10  INFO: Build completed successfully, 15189 total actions
2023-05-05 05:16:10  FATAL: bazel crashed due to an internal error. Printing stack trace:
2023-05-05 05:16:10  java.lang.RuntimeException: java.io.IOException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
2023-05-05 05:16:10  	at io.reactivex.rxjava3.internal.util.ExceptionHelper.wrapOrThrow(ExceptionHelper.java:46)
2023-05-05 05:16:10  	at io.reactivex.rxjava3.internal.observers.BlockingMultiObserver.blockingGet(BlockingMultiObserver.java:94)
2023-05-05 05:16:10  	at io.reactivex.rxjava3.core.Completable.blockingAwait(Completable.java:1468)
2023-05-05 05:16:10  	at com.google.devtools.build.lib.remote.util.AsyncTaskCache.awaitInProgressTasks(AsyncTaskCache.java:383)
2023-05-05 05:16:10  	at com.google.devtools.build.lib.remote.util.AsyncTaskCache$NoResult.awaitInProgressTasks(AsyncTaskCache.java:543)
2023-05-05 05:16:10  	at com.google.devtools.build.lib.remote.AbstractActionInputPrefetcher.flushOutputTree(AbstractActionInputPrefetcher.java:686)
2023-05-05 05:16:10  	at com.google.devtools.build.lib.remote.RemoteOutputService.flushOutputTree(RemoteOutputService.java:110)
2023-05-05 05:16:10  	at com.google.devtools.build.lib.runtime.commands.RunCommand.exec(RunCommand.java:314)
2023-05-05 05:16:10  	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:610)
2023-05-05 05:16:10  	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:234)
2023-05-05 05:16:10  	at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:550)
2023-05-05 05:16:10  	at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:614)
2023-05-05 05:16:10  	at io.grpc.Context$1.run(Context.java:566)
2023-05-05 05:16:10  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
2023-05-05 05:16:10  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
2023-05-05 05:16:10  	at java.base/java.lang.Thread.run(Unknown Source)
2023-05-05 05:16:10  Caused by: java.io.IOException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
2023-05-05 05:16:10  	at com.google.devtools.build.lib.remote.GrpcCacheClient.lambda$downloadBlob$18(GrpcCacheClient.java:346)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:203)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:190)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:133)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:104)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
2023-05-05 05:16:10  	at com.google.devtools.build.lib.remote.util.RxFutures$2.onError(RxFutures.java:257)
2023-05-05 05:16:10  	at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback$FlatMapSingleObserver.onError(SingleFlatMap.java:117)
2023-05-05 05:16:10  	at io.reactivex.rxjava3.internal.operators.single.SingleUsing$UsingSingleObserver.onError(SingleUsing.java:180)
2023-05-05 05:16:10  	at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.tryOnError(SingleCreate.java:95)
2023-05-05 05:16:10  	at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.onError(SingleCreate.java:81)
2023-05-05 05:16:10  	at com.google.devtools.build.lib.remote.util.RxFutures$OnceSingleOnSubscribe$1.onFailure(RxFutures.java:172)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1124)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
2023-05-05 05:16:10  	at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
2023-05-05 05:16:10  	at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onError(GrpcCacheClient.java:422)
2023-05-05 05:16:10  	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
2023-05-05 05:16:10  	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
2023-05-05 05:16:10  	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
2023-05-05 05:16:10  	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
2023-05-05 05:16:10  	at com.google.devtools.build.lib.remote.NetworkTimeInterceptor$NetworkTimeCall$1.onClose(NetworkTimeInterceptor.java:81)
2023-05-05 05:16:10  	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
2023-05-05 05:16:10  	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
2023-05-05 05:16:10  	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
2023-05-05 05:16:10  	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
2023-05-05 05:16:10  	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
2023-05-05 05:16:10  	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
2023-05-05 05:16:10  	... 3 more
2023-05-05 05:16:10  Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
2023-05-05 05:16:10  	at io.grpc.Status.asRuntimeException(Status.java:535)
2023-05-05 05:16:10  	... 14 more
2023-05-05 05:16:10  Caused by: io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Probably: make Bazel download a very large artifact with --remote_download_toplevel (say ~4 GB) and then make the remote build farm node (cache?) crash to reset its open connections.

Code inspection based on the stack trace pasted above should lead to the place where the exception is not being handled though.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

bazel-6.1.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@sgowroji sgowroji added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label May 8, 2023
@wilwell wilwell added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels May 9, 2023
@coeuvre
Copy link
Member

coeuvre commented Sep 7, 2023

Should be fixed with recent changes.

@coeuvre coeuvre closed this as completed Sep 7, 2023
@sluongng
Copy link
Contributor

@coeuvre could you please clarify what is the root cause for this issue? We have customers reported similar issues on 6.3.2 and 6.4.0. Could the fix be cherry-picked back to 6.x?

@coeuvre
Copy link
Member

coeuvre commented Jan 12, 2024

For the original issue, there were two problems:

  1. Bazel downloads outputs in the background. After the invocation is done, it waits for the download to finish. If there are errors from the downloads, it crashes.
  2. The download doesn't retry on connection reset.

I think only the first problem was fixed by moving the downloads into action execution and the changes for this fix cannot be cherry-picked into 6.x.

I think the second problem still exists in HEAD and the fix should be as simple as #14258 - we need to handle more exceptions in the RemoteRetrier to properly retry the download. Can you open a new issue with more context (e.g. update-to-date stack trace)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

6 participants