Crash with `--experimental_remote_download_outputs=toplevel` #8508

keith · 2019-05-29T22:36:56Z

Description of the problem / feature request:

When using --experimental_remote_download_outputs=toplevel, and having a cache return a 500, bazel crashes. I would expect it to recover in this case and fall back to retrying the request, or worst case doing the build locally. Without this flag it does seem to be resilient to this.

ERROR: /root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/com_google_protobuf/BUILD:388:1: Linking of rule '@com_google_protobuf//:protoc' failed due to unexpected I/O exception: 504 Gateway Time-out
<html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.15.3</center>
</body>
</html>

com.google.devtools.build.lib.remote.blobstore.http.HttpException: 504 Gateway Time-out
<html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.15.3</center>
</body>
</html>

	at com.google.devtools.build.lib.remote.blobstore.http.HttpDownloadHandler.channelRead0(HttpDownloadHandler.java:116)
	at com.google.devtools.build.lib.remote.blobstore.http.HttpDownloadHandler.channelRead0(HttpDownloadHandler.java:41)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297)
	at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1476)
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1225)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1272)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

bazel run TARGET when pointing at an unhealthy http cache and passing --experimental_remote_download_outputs=toplevel

What operating system are you running Bazel on?

macOS

What's the output of `bazel info release`?

release 0.26.0

cc @buchgr

The text was updated successfully, but these errors were encountered:

buchgr · 2019-06-03T12:07:05Z

@keith there's no retries for the http protocol yet. I believe this error happens independent of --experimental_remote_download_outputs=toplevel no?

keith · 2019-06-03T15:06:47Z

Regardless of retries what's the expected behavior in this case?

…

-- Keith Smiley

On Jun 3, 2019, at 08:07, Jakob Buchgraber ***@***.***> wrote: @keith there's no retries for the http protocol yet. I believe this error happens independent of --experimental_remote_download_outputs=toplevel no? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

buchgr · 2019-06-03T15:08:54Z

@keith to fallback to local execution. it's a bug either way.

buchgr · 2019-07-09T09:02:50Z

I can't reproduce this with 0.27.1 ... can you?

buchgr@buchgr0:~/code/bazel2$ bazel build src:bazel --remote_http_cache=http://localhost:9090 --experimental_remote_download_outputs=toplevel --experimental_inmemory_jdeps_files --experimental_inmemory_dotd_files
INFO: Invocation ID: 0ee5a3cd-bdeb-42ad-b149-5b7dc94cbe01
INFO: Analyzed target //src:bazel (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
WARNING: Reading from Remote Cache:
500 Internal Server Error
timed out
[54 / 89] 22 actions running
    Building src/main/java/com/google/devtools/build/lib/remote/libremote.jar (25 source files) and running annotation processors (OptionProcessor); 3s remote-cache, worker
    Compiling Java headers src/main/java/com/google/devtools/build/lib/exec/local/liblocal-hjar.jar (5 files); 2s remote-cache, linux-sandbox

keith · 2019-07-09T17:33:09Z

Do you have a specific fix in mind that would have addressed this? I'll have to merge this again on our side to test well enough to see if it breaks again

keith · 2019-07-09T18:30:11Z

It looks like I can't test this in our case because we hit this:

ERROR: /root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/com_google_protobuf/BUILD:130:1: C++ compilation of rule '@com_google_protobuf//:protobuf' failed due to unexpected I/O exception: Failed to fetch file with hash '1ef4741cf10441feeca5c8dde2b6c754f0cbd96fcae697d54a604757289a5e9b' because it does not exist remotely. --experimental_remote_outputs=minimal does not work if your remote cache evicts files during builds.
java.io.IOException: Failed to fetch file with hash '1ef4741cf10441feeca5c8dde2b6c754f0cbd96fcae697d54a604757289a5e9b' because it does not exist remotely. --experimental_remote_outputs=minimal does not work if your remote cache evicts files during builds.

So if you think this is fixed on your side I guess we can close this

buchgr · 2019-07-10T08:30:32Z

@keith in order to remove this error you need to clear your remote cache. which remote cache are you using?

keith · 2019-07-11T01:42:29Z

This is from an internal cache implementation. But in theory anything could be evicted from other caches during a build as well right?

buchgr · 2019-07-11T08:38:36Z

@keith see https://groups.google.com/d/msg/remote-execution-apis/zmcEV8kubRc/5PPcHOVZAgAJ

keith · 2019-07-11T22:27:42Z

@buchgr do you know how bazel would handle errors from the cache, even if the key itself hasn't been evicted, it's just temporarily inaccessible? Specifically if we returned a 500 in this case would this be better than returning a 404?

keith · 2019-07-16T04:27:12Z

@buchgr I enabled this flag set on envoy and we've started seeing this crash as well https://circleci.com/gh/envoyproxy/envoy/249586 we're pointing to the GCS remote cache. Any ideas on how we can mitigate this?

buchgr · 2019-07-16T04:53:26Z

@keith disable garbage collection on GCS. The issue is that GCS does not understand the action graph of what it's caching and evicts entries purely based on time. So it might have a cache entry for an action A1 but deleted all the outputs of this action. If an action A2 now needs the outputs of A1, then Bazel can't download them and also can't re-run action A1 and thus has to print this error.

I ll have to write a blog post about this and discussing potential ways for mitigation. For GCS specifically the only currently viable strategy is to disable garbage collection and wipe the whole cache from time to time. I imagine it would be easy enough to write a cloud function (or something) that reads in the action graph from GCS and properly evicts items from time to time.

keith · 2020-05-14T20:21:45Z

Pretty sure this is still a problem FWIW

meisterT · 2020-05-15T05:03:58Z

Thanks Keith, we should probably document what Jakob described.

bjacklyn · 2020-08-07T00:31:40Z

Is there any active work on bits without the bytes to make it gracefully handle download errors like the normal build does?

I just uprev'd to v3.4.1 and tried experimenting with this feature (--remote_download_toplevel and --remote_download_minimal). I am able to produce the below crash 100% of the time from fully deleted ~/.cache/bazel directory. However my stacktrace is quite different, does this look like the same issue?

It is not a garbage collection issue in my case because we only cleanup on a cron once/month. Although I suspect it is a similar problem, for example:

The AC exists but the CAS doesn't for a file because our nginx kills uploads that are taking too long/too large, or
The CAS exists but during download something in our unreliable network kills the download

ERROR: <snip>/BUILD.bazel:499:10: Couldn't build file <snip>/binary: Linking of rule '//<snip>:binary' failed due to Exec failed due to IOException: null
com.google.devtools.build.lib.remote.BulkTransferException
	at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:229)
	at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:121)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:225)
	at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:228)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:127)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:100)
	at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
	at com.google.devtools.build.lib.rules.cpp.CppLinkAction.beginExecution(CppLinkAction.java:332)
	at com.google.devtools.build.lib.actions.Action.execute(Action.java:124)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:785)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:948)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:907)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:128)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:80)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:425)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:926)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:310)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
	Suppressed: io.netty.channel.ExtendedClosedChannelException
		at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

ulfjack · 2020-08-12T11:47:42Z

I'm collecting all issues for build-without-the-bytes in the tracking issue (#6862). I'm making slow progress getting through the backlog. If you have issues that are easy to repro, I can take a look.

ulfjack · 2020-09-05T15:29:28Z

I am unable to reproduce this issue.

keith · 2020-09-08T18:13:09Z

FYI I still see a similar stack to one above here:

WARNING: Reading from Remote Cache:
UNAVAILABLE: HTTP/2 error code: NO_ERROR
Received Goaway
load_shed
ERROR: /Users/iosci/buildkite/src/iosci84/lyft/ios/Modules/AppShortcuts/BUILD:25:10: Bundling, processing and signing AppShortcutsTests.__internal__.__test_bundle failed due to Exec failed due to IOException: null
com.google.devtools.build.lib.remote.BulkTransferException
	at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:229)
	at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:121)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:225)
	at com.google.devtools.build.lib.exec.local.LocalSpawnRunner$SubprocessHandler.start(LocalSpawnRunner.java:315)
	at com.google.devtools.build.lib.exec.local.LocalSpawnRunner$SubprocessHandler.runOnce(LocalSpawnRunner.java:217)
	at com.google.devtools.build.lib.exec.local.LocalSpawnRunner$SubprocessHandler.run(LocalSpawnRunner.java:191)
	at com.google.devtools.build.lib.exec.local.LocalSpawnRunner.exec(LocalSpawnRunner.java:141)
	at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:241)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:132)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:100)

but I can file a new issue if you think it's unrelated

ulfjack · 2020-09-08T20:37:15Z

Is this with the gRPC remote cache? AFAIK, the HTTP remote cache only supports HTTP 1.1. Also, can you provide the full stack trace? I think the interesting part is below where you cut it off.

keith · 2020-09-08T20:44:00Z

This is with GCP RBE via gRPC. Super long log: https://gist.github.com/keith/53adb5908b9419443d682398d7dcb832

ulfjack · 2020-09-11T07:42:47Z

This seems to be working as expected from Bazel's point of view. You should bring this up with, err, RBE team? @EricBurnett @bergsieker

So, I created a remote execution service that artificially causes GOAWAY like the one you are seeing. Bazel retries the error 5 times (for a total of 6 calls) with exponentially increasing delays between calls. The default exponential backoff configuration is still fairly aggressive, with delays of 100ms, 200ms, 400ms, 800ms, and 1600ms. If the remote system is overloaded for a period of time, then it is not entirely surprising if it returns GOAWAY 6 times in ~3.1 seconds.

If this is a common issue, then it would look like a service misconfiguration to me. For example, the service could reduce the number of concurrent streams the client is allowed to open. Or it could use flow control to reduce the rate of incoming packets.

We could also increase the backoff times in Bazel. They look pretty aggressive to me.

bjacklyn · 2020-09-12T04:33:28Z

@ulfjack I have a reproducible example, although I haven't been able to narrow it down to a small example repro I can share here. In order to hopefully collect more information I built bazel from source and replaced the netty-all-4.1.34.jar with the latest netty-all-4.1.52.jar and the new netty, however it provides a similar stacktrace without much info.
https://mvnrepository.com/artifact/io.netty/netty-all/4.1.52.Final

200911 22:33:11.745:WT 280 [com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run] Aborting evaluation while evaluating ActionLookupData{actionLookupKey=<snip> BuildConfigurationValue.Key[fef9552d036c94cb695ac8e87cfd860d9e76bea766e0fdd0a035c9050fb3dad6], actionIndex=598}
com.google.devtools.build.lib.skyframe.ActionExecutionFunction$ActionExecutionFunctionException: com.google.devtools.build.lib.actions.AlreadyReportedActionExecutionException: Linking of rule '<snip>' failed: I/O exception during sandboxed execution: null
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:329)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: com.google.devtools.build.lib.actions.AlreadyReportedActionExecutionException: Linking of rule '<snip>' failed: I/O exception during sandboxed execution: null
        ... 6 more
Caused by: com.google.devtools.build.lib.actions.UserExecException: I/O exception during sandboxed execution: null
        at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:96)
        at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:469)
        at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:241)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:132)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:100)
        at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
        at com.google.devtools.build.lib.rules.cpp.CppLinkAction.beginExecution(CppLinkAction.java:332)
        at com.google.devtools.build.lib.actions.Action.execute(Action.java:124)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:785)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:948)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:907)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:128)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:80)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:425)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:926)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:310)
        ... 5 more
Caused by: com.google.devtools.build.lib.remote.BulkTransferException
        at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:229)
        at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:121)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:225)
        at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.runSpawn(AbstractSandboxSpawnRunner.java:122)
        at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:91)
        ... 21 more
        Suppressed: io.netty.channel.StacklessClosedChannelException
                at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
        Suppressed: io.netty.channel.StacklessClosedChannelException
                at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)

I've been able to confirm that HttpCacheClient in its get() method is trying to writeAndFlush() to a channel that is already closed after a succeedAndReset() call on that channel has closed it. I would expect that it shouldn't be possible to acquire a closed channel, but it seems like there is a race condition between a channel being released/closed and re-acquired/used.

bjacklyn · 2020-09-13T03:57:25Z

@ulfjack -- I have created a small repo that reproduces the issue by building an open source project (protobufs): https://github.com/bjacklyn/bazel-remote-download-failure-repro

I can file a separate github issue for this. It would be great if you can confirm that you are able to reproduce what I am seeing ^ though. There seem to be a lot of dials and knobs in order to hit the issue, but if my understanding is correct the root cause seems straight forward -- the channel is closed after it has been released which causes the next guy who acquires it to fail.

ulfjack · 2020-10-05T20:39:47Z

@bjacklyn, is that related to this bug? I fixed a bug in the HTTP cache uploader / downloader that was due to a race condition where it tried to acquire a closed channel. That fix is here: e6089c9

Can you still repro the issue after that change?

coeuvre · 2020-12-09T07:50:20Z

Bazel should run the local action to rebuild the missing entry. I will work on this.

rmohr · 2021-03-16T22:16:00Z

Hit this on 3.7.2 with --remote_download_outputs=toplevel:

20:35:27: ERROR: /home/prow/go/src/github.com/kubevirt/kubevirt/pkg/util/cluster/BUILD.bazel:3:11: GoCompilePkg pkg/util/cluster/go_default_library.a failed due to Exec failed due to IOException: null
20:35:27: com.google.devtools.build.lib.remote.BulkTransferException
20:35:27: 	at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:125)
20:35:27: 	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:232)
20:35:27: 	at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:243)
20:35:27: 	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:135)
20:35:27: 	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:102)
20:35:27: 	at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
20:35:27: 	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
20:35:27: 	at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:331)
20:35:27: 	at com.google.devtools.build.lib.actions.Action.execute(Action.java:127)
20:35:27: 	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:859)
20:35:27: 	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1019)
20:35:27: 	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:978)
20:35:27: 	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:129)
20:35:27: 	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:81)
20:35:27: 	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:469)
20:35:27: 	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:845)
20:35:27: 	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:314)
20:35:27: 	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
20:35:27: 	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
20:35:27: 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
20:35:27: 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
20:35:27: 	at java.base/java.lang.Thread.run(Unknown Source)
20:35:27: 	Suppressed: java.io.IOException: Failed to fetch file with hash '1a57a9024a7f44913cbb44b5453177ceeba65e50f83ae3f9e47774a5662ab3db' because it does not exist remotely. --remote_download_outputs=minimal does not work if your remote cache evicts files during builds.
20:35:27: 		at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:129)
20:35:27: 		... 21 more

We use greenhouse from kubernetes/test-infra as a cache. It also only evicts based on last-access time without anderstanding the action graph. It indeed looks like this only happens when a build is started and while it is running, the cache evicts entries.

brentleyjones · 2021-03-25T15:46:45Z

We are still hitting this with 4.0.0, with Google RBE gRPC remote cache:

ERROR: /Users/REDACTED/BUILD:3:13: Bundling REDACTED failed: Exec failed due to IOException: null

meisterT · 2021-03-25T15:49:40Z

@coeuvre is there a fix that we can cherrypick in 4.1?

coeuvre · 2021-04-01T08:12:13Z

Sorry to be late. It looks like there are different issues posted on this thread.

AC exists, but failed to download from CAS. Crash with --experimental_remote_download_outputs=toplevel #8508 (comment), Crash with --experimental_remote_download_outputs=toplevel #8508 (comment)
- This requires action rewinding and I will try to find time working on this in Q2.
load_shed issue. Crash with --experimental_remote_download_outputs=toplevel #8508 (comment)
- The changes to correctly handle GOWAY errors are included in 4.1.0rc1. (gRPC dynamic connection pool)
HTTP cache issue. Crash with --experimental_remote_download_outputs=toplevel #8508 (comment)
- Should be fixed by e6089c9 which is included in 3.6.0+. Crash with --experimental_remote_download_outputs=toplevel #8508 (comment)

@brentleyjones, can you please share more details about the error? (e.g. enable --verbose_failures)

@keith Do you mind we close this issue and track it in #8250?

brentleyjones · 2021-04-01T12:38:52Z

@brentleyjones, can you please share more details about the error? (e.g. enable --verbose_failures)

That was with --verbose_failures on. It's odd that it was a one-liner without a stacktrace.

keith · 2021-04-01T16:03:45Z

happy to track wherever you'd prefer!

coeuvre · 2021-04-07T06:27:02Z

@brentleyjones Feel free to open a new issue with your problem.

Let's track this issue in #8250. Closing.

aiuto added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels May 31, 2019

buchgr self-assigned this Jun 3, 2019

keith mentioned this issue Jun 24, 2019

Tracking issue for "Remote Builds without the Bytes" #6862

Closed

19 tasks

keith mentioned this issue Jul 16, 2019

Revert "ci: Enable bazel limited download flags" envoyproxy/envoy#7593

Merged

keith mentioned this issue Jul 17, 2019

Remote Builds without the Bytes should support remote output eviction #8250

Closed

buchgr added type: documentation (cleanup) and removed untriaged labels Jul 29, 2019

buchgr removed their assignment Jan 9, 2020

meisterT closed this as completed May 14, 2020

meisterT reopened this May 15, 2020

meisterT added the help wanted Someone outside the Bazel team could own this label May 15, 2020

coeuvre added P2 We'll consider working on this in future. (Assignee optional) type: bug and removed help wanted Someone outside the Bazel team could own this type: documentation (cleanup) labels Dec 9, 2020

coeuvre self-assigned this Dec 9, 2020

rmohr mentioned this issue Mar 19, 2021

Don't cache unit test results and run the tests multiple times to detect flakes early kubevirt/kubevirt#5263

Merged

coeuvre closed this as completed Apr 7, 2021

jscheid-ventana mentioned this issue Nov 6, 2021

Implement automatic garbage collection for the disk cache #5139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash with `--experimental_remote_download_outputs=toplevel` #8508

Crash with `--experimental_remote_download_outputs=toplevel` #8508

keith commented May 29, 2019

buchgr commented Jun 3, 2019

keith commented Jun 3, 2019 via email

buchgr commented Jun 3, 2019

buchgr commented Jul 9, 2019

keith commented Jul 9, 2019

keith commented Jul 9, 2019

buchgr commented Jul 10, 2019 •

edited

Loading

keith commented Jul 11, 2019

buchgr commented Jul 11, 2019

keith commented Jul 11, 2019

keith commented Jul 16, 2019

buchgr commented Jul 16, 2019

keith commented May 14, 2020

meisterT commented May 15, 2020

bjacklyn commented Aug 7, 2020

ulfjack commented Aug 12, 2020

ulfjack commented Sep 5, 2020

keith commented Sep 8, 2020

ulfjack commented Sep 8, 2020

keith commented Sep 8, 2020

ulfjack commented Sep 11, 2020

bjacklyn commented Sep 12, 2020

bjacklyn commented Sep 13, 2020

ulfjack commented Oct 5, 2020

coeuvre commented Dec 9, 2020

rmohr commented Mar 16, 2021 •

edited

Loading

brentleyjones commented Mar 25, 2021

meisterT commented Mar 25, 2021

coeuvre commented Apr 1, 2021

brentleyjones commented Apr 1, 2021

keith commented Apr 1, 2021

coeuvre commented Apr 7, 2021

Crash with --experimental_remote_download_outputs=toplevel #8508

Crash with --experimental_remote_download_outputs=toplevel #8508

Comments

keith commented May 29, 2019

Description of the problem / feature request:

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

buchgr commented Jun 3, 2019

keith commented Jun 3, 2019 via email

buchgr commented Jun 3, 2019

buchgr commented Jul 9, 2019

keith commented Jul 9, 2019

keith commented Jul 9, 2019

buchgr commented Jul 10, 2019 • edited Loading

keith commented Jul 11, 2019

buchgr commented Jul 11, 2019

keith commented Jul 11, 2019

keith commented Jul 16, 2019

buchgr commented Jul 16, 2019

keith commented May 14, 2020

meisterT commented May 15, 2020

bjacklyn commented Aug 7, 2020

ulfjack commented Aug 12, 2020

ulfjack commented Sep 5, 2020

keith commented Sep 8, 2020

ulfjack commented Sep 8, 2020

keith commented Sep 8, 2020

ulfjack commented Sep 11, 2020

bjacklyn commented Sep 12, 2020

bjacklyn commented Sep 13, 2020

ulfjack commented Oct 5, 2020

coeuvre commented Dec 9, 2020

rmohr commented Mar 16, 2021 • edited Loading

brentleyjones commented Mar 25, 2021

meisterT commented Mar 25, 2021

coeuvre commented Apr 1, 2021

brentleyjones commented Apr 1, 2021

keith commented Apr 1, 2021

coeuvre commented Apr 7, 2021

Crash with `--experimental_remote_download_outputs=toplevel` #8508

Crash with `--experimental_remote_download_outputs=toplevel` #8508

What's the output of `bazel info release`?

buchgr commented Jul 10, 2019 •

edited

Loading

rmohr commented Mar 16, 2021 •

edited

Loading