Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazel 5.x.x crashes if uncompressed CAS is uploaded to remote cache #14522

Open
shirchen opened this issue Jan 6, 2022 · 9 comments
Open

bazel 5.x.x crashes if uncompressed CAS is uploaded to remote cache #14522

shirchen opened this issue Jan 6, 2022 · 9 comments
Labels
help wanted Someone outside the Bazel team could own this P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@shirchen
Copy link

shirchen commented Jan 6, 2022

Description of the problem / feature request:

When experimenting with compression (#14041), I ran into crashes if the server isn't on the same page as the client. I think these are edge cases as in this case, remote cache server hasn't fully implemented compression support, but it's possible bazel may want to guard against these and not crash fatally since we are just using remote cache.

Feature requests: what underlying problem are you trying to solve with this feature?

Running with compression flag: bazel build //... --experimental_remote_cache_compression --remote_upload_local_results=true

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

If server claims to support zstd but doesn't, client (bazel) works fine. However, issue can be reproduced if we try and upload something that is compressed.

What operating system are you running Bazel on?

Mac OS.

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

origin/release-5.0.0rc3, 5.1.0

Any other information, logs, or outputs that you want to share?

This stack is if server throws an error (expected), which I think should be warning and let bazel build locally instead of crashing.

...	
Caused by: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: compressed-blobs/zstd/f7675cc879aba2e38d0d059ad60e3ed19d3d062c3fdb9b2fd37a2a8c8ed848ee/202: Url path not recognized
		at io.grpc.Status.asRuntimeException(Status.java:535)
		... 14 more
[96 / 99] GoCompilePkg src/code.uber.internal/base/io/fileutil/go_default_test_test.external.a; 0s remote-cache
FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.RuntimeException: Unrecoverable error while evaluating node 'ActionLookupData{actionLookupKey=ConfiguredTargetKey{label=//src/code.uber.internal/base/io/fileutil:go_default_test, config=BuildConfigurationValue.Key[1bb1b4d6ce810b0badbe269c1bb76d3539cd5517823c629b93b9bf6dc6f01409]}, actionIndex=1}' (requested by nodes 'ArtifactNestedSetKey{rawChildren=[File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/com_github_davecgh_go_spew/spew/spew.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/com_github_pmezard_go_difflib/difflib/difflib.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/in_gopkg_yaml_v3/yaml_v3.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/com_github_stretchr_testify/assert/assert.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/com_github_stretchr_testify/require/require.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/org_uber_go_atomic/atomic.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/org_uber_go_multierr/multierr.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]src/code.uber.internal/base/io/fileutil/go_default_test.internal.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]src/code.uber.internal/base/io/fileutil/go_default_test_test.external.a]}', ...)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:674)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	

and here is if server incorrectly passes back uncompressed CAS

	Caused by: java.io.IOException: Decompression error: Unknown frame descriptor
		at com.github.luben.zstd.ZstdInputStreamNoFinalizer.readInternal(ZstdInputStreamNoFinalizer.java:171)
		at com.github.luben.zstd.ZstdInputStreamNoFinalizer.read(ZstdInputStreamNoFinalizer.java:123)
		at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:87)
		at com.google.protobuf.ByteString.readChunk(ByteString.java:540)
		at com.google.protobuf.ByteString.readFrom(ByteString.java:517)
		at com.google.protobuf.ByteString.readFrom(ByteString.java:485)
		at com.google.devtools.build.lib.remote.zstd.ZstdDecompressingOutputStream.write(ZstdDecompressingOutputStream.java:60)
		at com.google.devtools.build.lib.remote.zstd.ZstdDecompressingOutputStream.write(ZstdDecompressingOutputStream.java:54)
		at com.google.protobuf.ByteString$LiteralByteString.writeTo(ByteString.java:1381)
		at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onNext(GrpcCacheClient.java:389)
		... 12 more
[96 / 99] GoCompilePkg src/code.uber.internal/base/io/fileutil/go_default_test_test.external.a; 0s remote-cache
FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.RuntimeException: Unrecoverable error while evaluating node 'ActionLookupData{actionLookupKey=ConfiguredTargetKey{label=//src/code.uber.internal/base/io/fileutil:go_default_test, config=BuildConfigurationValue.Key[1bb1b4d6ce810b0badbe269c1bb76d3539cd5517823c629b93b9bf6dc6f01409]}, actionIndex=1}' (requested by nodes 'ArtifactNestedSetKey{rawChildren=[File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/com_github_davecgh_go_spew/spew/spew.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/com_github_pmezard_go_difflib/difflib/difflib.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/in_gopkg_yaml_v3/yaml_v3.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/com_github_stretchr_testify/assert/assert.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/com_github_stretchr_testify/require/require.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/org_uber_go_atomic/atomic.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/org_uber_go_multierr/multierr.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]src/code.uber.internal/base/io/fileutil/go_default_test.internal.a, File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]src/code.uber.internal/base/io/fileutil/go_default_test_test.external.a]}', ...)
@Wyverald
Copy link
Member

Wyverald commented Jan 7, 2022

cc @AlessandroPatti

This likely won't count as a release blocker since it's guarded behind an experimental flag and is not a regression.

@AlessandroPatti
Copy link
Contributor

@shirchen do you use --remote_download_[toplevel|minimal] (a.k.a Builds w/o the bytes)? AFAIK, Bazel is unable to recover from cache errors when it is enabled. The issue is more largely tracked in #6862

@shirchen
Copy link
Author

shirchen commented Jan 9, 2022

@AlessandroPatti no, it was just a straight build w/out bwotb. I'm familiar with bazel being unable to recover if a promised artifact is missing, however, i believe that may be fixed with recently merged rewind feature. Even then, I have never seen bazel crash with fatal and print a stack trace (usually it errors that something is missing and bazel can't handle that case), so do you mind linking to a task where using bwotb would cause a fatal and a stack trace from bazel?

Assuming that what I'm describing is an issue related to compression, I think problem here is that zstd throws an IO exception and bazel doesn't recover if a stream isn't actually zstd (throws that exception immediately).

I am not familiar with zstd library, but from UX for bazel I think warning a user that an artifact isn't compressed and then maybe trying to download it as uncompressed or just catching IO exception and moving on w/out using remote_cache would be a slight improvement. Maybe latter would be better?

@Wyverald I agree, this isn't release blocking since this feature is hidden behind an experimental flag.

@AlessandroPatti
Copy link
Contributor

@shirchen I tried reproducing the issue but I couldn't. I modified bazel-remote to return uncompressed blobs when compressed are requested and I do see the error, but it fails only if bwotb is enabled (otherwise it just shows a warning but the builds continues locally and succeeds) and the stack trace is only printed if --verbose_failures is passed.

Assuming that what I'm describing is an issue related to compression, I think problem here is that zstd throws an IO exception and bazel doesn't recover if a stream isn't actually zstd (throws that exception immediately).

This does not seem the only place where IOExceptions are wrapped in the future and returned (e.g. here, just one call below), so I would expect bazel to crash in more cases, for example if the remote cache returns an unexpected blob.

Do you have a small example I could use to reproduce the issue you are experiencing?

@oquenchil oquenchil added team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug untriaged labels Jan 11, 2022
@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) more data needed and removed untriaged labels Jan 18, 2022
@sgowroji
Copy link
Member

Hello @shirchen, Could you please respond to the above comment. Thanks!

@sgowroji
Copy link
Member

sgowroji commented Jul 4, 2022

This issue has been marked as stale because it has not had any recent activity. It will be closed in 7 days, if no further activity occurs. Thank you.

@shirchen
Copy link
Author

Hello @shirchen, Could you please respond to the above comment. Thanks!

This does not seem the only place where IOExceptions are wrapped in the future and returned (e.g. here, just one call below), so I would expect bazel to crash in more cases, for example if the remote cache returns an unexpected blob.

I guess then maybe this bug can be reframed as better UX for client instead of a crash?

@sgowroji
Copy link
Member

As asked in the above comment. Can you provide an example to reproduce the issue. Thanks

@shirchen shirchen changed the title bazel 5.x.x crashes if uncompressed CAS is sent back to client bazel 5.x.x crashes if uncompressed CAS is uploaded to remote cace Jul 14, 2022
@shirchen shirchen changed the title bazel 5.x.x crashes if uncompressed CAS is uploaded to remote cace bazel 5.x.x crashes if uncompressed CAS is uploaded to remote cache Jul 14, 2022
@shirchen
Copy link
Author

@AlessandroPatti my bad, issue is when you try and upload, not download. once i added read support, and set --remote_upload_local_results=true only then did bazel start crashing with error ^. Since this only happens for uploads which shouldn't decide whether a build completes a not, this may be a legitimate issue.

@coeuvre coeuvre added P3 We're not considering working on this, but happy to review a PR. (No assignee) help wanted Someone outside the Bazel team could own this and removed P2 We'll consider working on this in future. (Assignee optional) labels Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Someone outside the Bazel team could own this P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

6 participants