Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CredentialHelper change broke unix domain socket support #16171

Closed
srago opened this issue Aug 26, 2022 · 10 comments
Closed

CredentialHelper change broke unix domain socket support #16171

srago opened this issue Aug 26, 2022 · 10 comments
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug untriaged

Comments

@srago
Copy link

srago commented Aug 26, 2022

We use a unix domain socket for our remote_executor and remote_cache endpoints, because we use a side-car proxy to handle SPIFFE mTLS negotiation. This stopped working with the release of bazel 5.3.0. With 5.3.0, bazel fails before issuing its first GetCapabilitiesRequest. The error message was "ERROR: Failed to query remote execution capabilities: UNAUTHENTICATED: Failed computing credential metadata." When enabling verbose failures, the stack trace pointed to a failed precondition because the host portion of the URI was null. I played around with the code and got it working again, but the fix isn't production-quality -- it doesn't handle the case where the remote cache endpoint differs from the remote executor endpoint, and might need different credentials. Because we're using unix domain sockets, this shouldn't affect us at all. Anyway, consider the following patch a place to start the conversation about what the correct fix might look like.

index a7d793ad56..d7db02bd6f 100644
--- a/src/main/java/com/google/devtools/build/lib/remote/RemoteModule.java
+++ b/src/main/java/com/google/devtools/build/lib/remote/RemoteModule.java
@@ -1055,6 +1055,22 @@ public final class RemoteModule extends BlazeModule {
       AuthAndTLSOptions authAndTlsOptions,
       RemoteOptions remoteOptions)
       throws IOException {
+
+    // TODO: this is a temporary hack.  Because remoteCache and remoteExecutor could be different, they might need
+    // different credentials.  If the endpoints used to access remote services are unix domain sockets, credential
+    // helpers aren't needed -- the service is running on the same machine.
+    //
+    // Also note that URI("unix://...") will replace "unix" with "https" for some reason that escapes me, but that
+    // makes it tricky to use.  Somewhere (not here because of the checks in the try clause below), that was
+    // happening, because by the time findCredentialHelper() is called, these URIs have been changed from "unix://"
+    // to "https://".  According to the bazel documentation, the "unix:" scheme is supposed to disable TLS.
+    if (remoteOptions.remoteCache != null && Ascii.toLowerCase(remoteOptions.remoteCache).startsWith("unix:/")) {
+           return null;
+    }
+    if (remoteOptions.remoteExecutor != null && Ascii.toLowerCase(remoteOptions.remoteExecutor).startsWith("unix:/")) {
+           return null;
+    }
+
     Credentials credentials =
         GoogleAuthUtils.newCredentials(
             credentialHelperEnvironment, commandLinePathFactory, fileSystem, authAndTlsOptions);
@fmeum
Copy link
Collaborator

fmeum commented Aug 26, 2022

@bazel-io flag

@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Aug 26, 2022
@tjgq
Copy link
Contributor

tjgq commented Aug 26, 2022

Sorry about that. This corner of Bazel is poorly tested; I guess something like this was bound to happen.

When enabling verbose failures, the stack trace pointed to a failed precondition because the host portion of the URI was null

Could you please share the full stack trace and the exact values of --remote_executor and --remote_cache you're using?

@srago
Copy link
Author

srago commented Aug 26, 2022

Here's the command line and stack trace. The line numbers might be off a bit -- I recompiled bazel from the head and added a couple printf statements.

$ bazel build --remote_cache=unix:///home/coder/.cache/fax/3a162c244f546dff591bc0d335e9c174/jail.sock --remote_executor=unix:///home/coder/.cache/fax/3a162c244f546dff591bc0d335e9c174/jail.sock :all --verbose_failures
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Invocation ID: 55dd5c7b-43ea-4a6f-a6cf-0c05ce656dbe
ERROR: java.io.IOException: io.grpc.StatusRuntimeException: UNAUTHENTICATED: Failed computing credential metadata
        at com.google.devtools.build.lib.remote.RemoteServerCapabilities.get(RemoteServerCapabilities.java:90)
        at com.google.devtools.build.lib.remote.RemoteModule.verifyServerCapabilities(RemoteModule.java:196)
        at com.google.devtools.build.lib.remote.RemoteModule.beforeCommand(RemoteModule.java:492)
        at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:378)
        at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:232)
        at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:550)
        at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:614)
        at io.grpc.Context$1.run(Context.java:566)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: UNAUTHENTICATED: Failed computing credential metadata
        at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
        at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
        at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
        at build.bazel.remote.execution.v2.CapabilitiesGrpc$CapabilitiesBlockingStub.getCapabilities(CapabilitiesGrpc.java:215)
        at com.google.devtools.build.lib.remote.RemoteServerCapabilities.lambda$get$0(RemoteServerCapabilities.java:85)
        at com.google.devtools.build.lib.remote.ReferenceCountedChannel.lambda$withChannelBlocking$2(ReferenceCountedChannel.java:85)
        at com.google.devtools.build.lib.remote.ReferenceCountedChannel.lambda$withChannel$4(ReferenceCountedChannel.java:108)
        at io.reactivex.rxjava3.internal.operators.single.SingleUsing.subscribeActual(SingleUsing.java:59)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback.onSuccess(SingleFlatMap.java:85)
        at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback$FlatMapSingleObserver.onSuccess(SingleFlatMap.java:112)
        at io.reactivex.rxjava3.internal.operators.single.SingleMap$MapSingleObserver.onSuccess(SingleMap.java:65)
        at io.reactivex.rxjava3.internal.operators.single.SingleDoOnDispose$DoOnDisposeObserver.onSuccess(SingleDoOnDispose.java:84)
        at io.reactivex.rxjava3.internal.operators.single.SingleDoOnError$DoOnError.onSuccess(SingleDoOnError.java:52)
        at io.reactivex.rxjava3.internal.operators.observable.ObservableSingleSingle$SingleElementObserver.onComplete(ObservableSingleSingle.java:110)
        at io.reactivex.rxjava3.internal.observers.DeferredScalarDisposable.complete(DeferredScalarDisposable.java:85)
        at io.reactivex.rxjava3.subjects.AsyncSubject.subscribeActual(AsyncSubject.java:233)
        at io.reactivex.rxjava3.core.Observable.subscribe(Observable.java:13176)
        at io.reactivex.rxjava3.internal.operators.observable.ObservableSingleSingle.subscribeActual(ObservableSingleSingle.java:36)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleDoOnError.subscribeActual(SingleDoOnError.java:35)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleDoOnDispose.subscribeActual(SingleDoOnDispose.java:38)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleMap.subscribeActual(SingleMap.java:35)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback.onSuccess(SingleFlatMap.java:85)
        at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.onSuccess(SingleCreate.java:68)
        at com.google.devtools.build.lib.remote.grpc.TokenBucket$1.onNext(TokenBucket.java:79)
        at io.reactivex.rxjava3.internal.util.NotificationLite.accept(NotificationLite.java:247)
        at io.reactivex.rxjava3.subjects.BehaviorSubject$BehaviorDisposable.test(BehaviorSubject.java:507)
        at io.reactivex.rxjava3.subjects.BehaviorSubject$BehaviorDisposable.emitFirst(BehaviorSubject.java:468)
        at io.reactivex.rxjava3.subjects.BehaviorSubject.subscribeActual(BehaviorSubject.java:224)
        at io.reactivex.rxjava3.core.Observable.subscribe(Observable.java:13176)
        at com.google.devtools.build.lib.remote.grpc.TokenBucket.lambda$acquireToken$0(TokenBucket.java:64)
        at io.reactivex.rxjava3.internal.operators.single.SingleCreate.subscribeActual(SingleCreate.java:40)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:37)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleDefer.subscribeActual(SingleDefer.java:43)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:37)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.core.Single.blockingGet(Single.java:3644)
        at com.google.devtools.build.lib.remote.ReferenceCountedChannel.withChannelBlocking(ReferenceCountedChannel.java:85)
        at com.google.devtools.build.lib.remote.RemoteServerCapabilities.lambda$get$1(RemoteServerCapabilities.java:84)
        at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:244)
        at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:125)
        at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:114)
        at com.google.devtools.build.lib.remote.RemoteServerCapabilities.get(RemoteServerCapabilities.java:82)
        ... 10 more
Caused by: java.lang.NullPointerException
        at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:889)
        at com.google.devtools.build.lib.authandtls.credentialhelper.CredentialHelperProvider.findCredentialHelper(CredentialHelperProvider.java:74)
        at com.google.devtools.build.lib.authandtls.credentialhelper.CredentialHelperCredentials$CredentialHelperCacheLoader.load(CredentialHelperCredentials.java:132)
        at com.google.devtools.build.lib.authandtls.credentialhelper.CredentialHelperCredentials$CredentialHelperCacheLoader.load(CredentialHelperCredentials.java:114)
        at com.github.benmanes.caffeine.cache.LocalLoadingCache.lambda$newMappingFunction$2(LocalLoadingCache.java:141)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2413)
        at java.base/java.util.concurrent.ConcurrentHashMap.compute(Unknown Source)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2411)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2394)
        at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
        at com.github.benmanes.caffeine.cache.LocalLoadingCache.get(LocalLoadingCache.java:54)
        at com.google.devtools.build.lib.authandtls.credentialhelper.CredentialHelperCredentials.getRequestMetadataFromCredentialHelper(CredentialHelperCredentials.java:90)
        at com.google.devtools.build.lib.authandtls.credentialhelper.CredentialHelperCredentials.getRequestMetadata(CredentialHelperCredentials.java:74)
        at com.google.auth.Credentials.blockingGetToCallback(Credentials.java:112)
        at com.google.auth.Credentials$1.run(Credentials.java:98)
        at io.grpc.stub.ClientCalls$ThreadlessExecutor.waitAndDrain(ClientCalls.java:740)
        at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:149)
        ... 57 more

ERROR: Failed to query remote execution capabilities: UNAUTHENTICATED: Failed computing credential metadata

@tjgq
Copy link
Contributor

tjgq commented Aug 26, 2022

Thanks, this is helpful. It appears that new URI("unix:///foo/bar).getHost() == null and new URI("unix:/foo/bar").getHost() == null, but new URI("unix://foo/bar").getHost() == "foo". I wonder if using one or two slashes after unix: instead of three is a valid workaround. Hopefully the gRPC library isn't too picky...

That being said, I totally agree that we shouldn't crash here! I think the right fix is to replace the precondition check in findCredentialHelper with an early exit if the host part is empty/null. That way we don't have to make any assumptions about URI schemes. I'll put a fix together on Monday.

@srago
Copy link
Author

srago commented Aug 26, 2022

Regardless of the number of slashes, it used to work with versions of bazel older than 5.3. I've used it on 5.2.0, 5.1.0, 5.0.0, and 4.2.2 to name a few. But I'll look into seeing how it behaves when we have only 1 slash and let you know what I find.

@srago
Copy link
Author

srago commented Aug 26, 2022

Yep, regardless of the number of slashes, it fails at the same precondition check. (I tried 1, 2, and 3.) Also, for unix domain sockets, there is no host -- the entire remainder of the string is the pathname of the socket file, so getHost() probably should return null when the scheme is "unix:".

@Wyverald
Copy link
Member

@tjgq do you think this warrants a patch release?

tjgq added a commit to tjgq/bazel that referenced this issue Aug 29, 2022
URIs such as `unix:///...` (Unix domain socket) may legitimately be missing a
host component. We should gracefully handle them when looking for a credential
helper (no such helper can exist for them since the lookup is host-based, but
this could change in the future).

Fixes bazelbuild#16171 which is a regression in 5.3. We had some tests for this in
remote_execution_test.sh, but they were disabled due to flakiness in 8e3c0df.
tjgq added a commit to tjgq/bazel that referenced this issue Aug 29, 2022
URIs such as `unix:///...` (Unix domain socket) may legitimately be missing a
host component. We should gracefully handle them when looking for a credential
helper (no such helper can exist for them since the lookup is host-based, but
this could change in the future).

Fixes bazelbuild#16171 which is a regression in 5.3. We had some tests for this in
remote_execution_test.sh, but they were disabled due to flakiness in 8e3c0df.
tjgq added a commit to tjgq/bazel that referenced this issue Aug 29, 2022
URIs such as `unix://...` (Unix domain socket) may legitimately be missing a
host component. We should gracefully handle them when looking for a credential
helper (no such helper can exist for them since the lookup is host-based, but
this could change in the future).

Fixes bazelbuild#16171 which is a regression in 5.3. We had some tests for this in
remote_execution_test.sh, but they were disabled due to flakiness in 8e3c0df.
I've filed bazelbuild#16185 to reenable them.
@tjgq
Copy link
Contributor

tjgq commented Aug 29, 2022

@Wyverald It's a regression that affects existing users and there doesn't seem to be a workaround for it, so probably yes? (I don't know what's our patch release policy.)

@srago I'd like to go with #16184 as the fix. Could you please confirm that it works for you?

cc @Yannic - until #16185 is fixed, we should manually verify that a UDS proxy works when making changes to credential helpers.

@Wyverald
Copy link
Member

@bazel-io fork 5.3.1

@bazel-io bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Aug 29, 2022
@srago
Copy link
Author

srago commented Aug 29, 2022

Confirmed #16184 fixes the problem with Unix domain sockets. Thanks for the quick response time!

tjgq added a commit to tjgq/bazel that referenced this issue Aug 29, 2022
URIs such as `unix://...` (Unix domain socket) may legitimately be missing a
host component. We should gracefully handle them when looking for a credential
helper (no such helper can exist for them since the lookup is host-based, but
this could change in the future).

Fixes bazelbuild#16171 which is a regression in 5.3. We had some tests for this in
remote_execution_test.sh, but they were disabled due to flakiness in 8e3c0df.
I've filed bazelbuild#16185 to reenable them.
@sgowroji sgowroji added type: bug untriaged team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Aug 29, 2022
tjgq added a commit to tjgq/bazel that referenced this issue Aug 29, 2022
These tests were disabled due to flakiness in 8e3c0df. As a consequence, we
introduced an undetected regression in 5.3 (bazelbuild#16171).

I've rerun the tests multiple times, both locally and remotely, and I've seen
no evidence of flakiness. Therefore, I'm declaring that this issue has
mysteriously fixed itself.

Notes:

* Use `python3` instead of `python` to run the UDS proxy, as the latter is no
  longer available in some of the environments where the tests run.
* The correct URI syntax for a Unix domain socket URI is unix:///path/to/socket
  rather than unix:/path/to/socket, as the URI technically has no host
  component. The latter form still works due to lenient parsing in the gRPC
  gRPC library, but this may change in the future, so the tests should exercise
  the correct form.

Fixes bazelbuild#16185.
ShreeM01 added a commit that referenced this issue Aug 31, 2022
URIs such as `unix://...` (Unix domain socket) may legitimately be missing a
host component. We should gracefully handle them when looking for a credential
helper (no such helper can exist for them since the lookup is host-based, but
this could change in the future).

Fixes #16171 which is a regression in 5.3. We had some tests for this in
remote_execution_test.sh, but they were disabled due to flakiness in 8e3c0df.
I've filed #16185 to reenable them.

Closes #16184.

PiperOrigin-RevId: 470724658
Change-Id: Ibd7203546f00fe2335c14e7a5fa6d5bf64868bac

Co-authored-by: Tiago Quelhas <[email protected]>
aiuto pushed a commit to aiuto/bazel that referenced this issue Oct 12, 2022
URIs such as `unix://...` (Unix domain socket) may legitimately be missing a
host component. We should gracefully handle them when looking for a credential
helper (no such helper can exist for them since the lookup is host-based, but
this could change in the future).

Fixes bazelbuild#16171 which is a regression in 5.3. We had some tests for this in
remote_execution_test.sh, but they were disabled due to flakiness in 8e3c0df.
I've filed bazelbuild#16185 to reenable them.

Closes bazelbuild#16184.

PiperOrigin-RevId: 470724658
Change-Id: Ibd7203546f00fe2335c14e7a5fa6d5bf64868bac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug untriaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants