Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't hang RPC when netty-tcnative .so fails to load due to (musl) linking errors #2599

Closed
garrettjonesgoogle opened this issue Jan 11, 2017 · 12 comments
Labels
Milestone

Comments

@garrettjonesgoogle
Copy link
Contributor

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.0.3

What JVM are you using (java -version)?

openjdk version "1.8.0_102"
OpenJDK Runtime Environment (build 1.8.0_102)
OpenJDK 64-Bit Server VM (build 25.102-b01, mixed mode)

What did you do?

If possible, provide a recipe for reproducing the error.

https://github.com/garrettjonesgoogle/gcloud-java/tree/deadline-exceeded-issue/google-cloud-example-docker-gradle-alpine

  1. ./gradlew jar shadowJar
  2. docker build .

Then deploy to a GCE instance and run it.

  1. sudo docker run -it YOUR_DOCKER_BUILD_ID_HERE sh
  2. java -Djava.util.logging.config.file=logging.properties -cp google-cloud-example-docker-gradle-alpine-all.jar com.google.cloud.pubsub.spi.v1.PublisherSmokeTest --project_id YOUR_PROJECT_ID_HERE

What did you expect to see?

An exception indicating that the netty dependency was unsatisfied

What did you see instead?

After the call times out, DEADLINE_EXCEEDED

Notes

If a user has a high timeout, it can take a long time for them to discover something is wrong. Then when they receive DEADLINE_EXCEEDED, they have no idea why - it doesn't guide them to the problem with the dependency. They have to know to turn on FINE logging and go log spelunking to fine the root cause. Example user-filed issue: googleapis/google-cloud-java#1430

@ejona86 ejona86 changed the title Throw exception instead of DEADLINE_EXCEEDED when netty-tcnative dependency not satisfied Don't hang RPC when netty-tcnative .so fails to load due to (musl) linking errors Jan 11, 2017
@garrettjonesgoogle
Copy link
Contributor Author

There are many other types of linking errors (besides musl) that can cause this situation too. We encountered it back when we upgraded google-cloud-java from 0.15.0 to 1.0.1 because the netty-tcnative dependency had to move in step with it, and we hadn't updated netty-tcnative.

@ejona86
Copy link
Member

ejona86 commented Jan 11, 2017

I agree that every linking error is equivalent for this issue.

@carl-mastrangelo
Copy link
Contributor

Usually such errors manifest as an exception that goes into some cubby, never to be read from. Did you get anything from logs that would indicate this happened?

@garrettjonesgoogle
Copy link
Contributor Author

Yes, if you turn on FINE logging, you will see errors like this:

FINE: Unable to load the library: netty-tcnative-linux-x86_64.
java.lang.UnsatisfiedLinkError: /tmp/libnetty-tcnative-linux-x86_646695900156633723588.so: Error loading shared library libcrypt.so.1: No such file or directory (needed by /tmp/libnetty-tcnative-linux-x86_646695900156633723588.so)
...
Jan 06, 2017 10:07:36 PM io.netty.handler.ssl.OpenSsl <clinit>
FINE: Failed to load netty-tcnative; OpenSslEngine will be unavailable, unless the application has already loaded the symbols by some other means. See http://netty.io/wiki/forked-tomcat-native.html for more information.
java.lang.IllegalArgumentException: Failed to load any of the given libraries: [netty-tcnative-linux-x86_64, netty-tcnative-linux-x86_64-fedora, netty-tcnative]
...
Jan 06, 2017 10:07:36 PM io.netty.handler.ssl.OpenSsl <clinit>
FINE: Failed to initialize netty-tcnative; OpenSslEngine will be unavailable. See http://netty.io/wiki/forked-tomcat-native.html for more information.
java.lang.UnsatisfiedLinkError: org.apache.tomcat.jni.Library.version(I)I

@carl-mastrangelo
Copy link
Contributor

@garrettjonesgoogle When building your server or chanel, GrpcSslContexts checks that either openssl is available, or that jetty alpn is. If both of those fail, it shouldn't even be possible to start a server or channel.

So, I guess the question is, how did you even get to handling RPCs if it should have failed much earlier?

@garrettjonesgoogle
Copy link
Contributor Author

The first attempt to make an RPC in the app encounters this issue. I'm not sure why there are no failures before that point; I would have expected gRPC to through an exception, but it's not.

@ejona86 ejona86 added the bug label Jan 21, 2017
@ejona86 ejona86 added this to the Next milestone Jan 21, 2017
@dklimkin
Copy link

We also face this issue, both with SSL lib or when a proto class is missing. So upvoting this bug.

@garrettjonesgoogle
Copy link
Contributor Author

I wanted to poke this issue. I verified that it still happens in gRPC 1.3.x. Here is a relevant stack trace:

Jun 01, 2017 11:48:40 PM io.grpc.internal.ChannelExecutor drain
WARNING: Runnable threw exception in ChannelExecutor
java.lang.IllegalArgumentException: Jetty ALPN/NPN has not been properly configured.
	at io.grpc.netty.GrpcSslContexts.selectApplicationProtocolConfig(GrpcSslContexts.java:174)
	at io.grpc.netty.GrpcSslContexts.configure(GrpcSslContexts.java:151)
	at io.grpc.netty.GrpcSslContexts.configure(GrpcSslContexts.java:139)
	at io.grpc.netty.GrpcSslContexts.forClient(GrpcSslContexts.java:109)
	at io.grpc.netty.NettyChannelBuilder.createProtocolNegotiatorByType(NettyChannelBuilder.java:335)
	at io.grpc.netty.NettyChannelBuilder.createProtocolNegotiator(NettyChannelBuilder.java:308)
	at io.grpc.netty.NettyChannelBuilder$NettyTransportFactory$DynamicNettyTransportParams.getProtocolNegotiator(NettyChannelBuilder.java:499)
	at io.grpc.netty.NettyChannelBuilder$NettyTransportFactory.newClientTransport(NettyChannelBuilder.java:448)
	at io.grpc.internal.CallCredentialsApplyingTransportFactory.newClientTransport(CallCredentialsApplyingTransportFactory.java:61)
	at io.grpc.internal.InternalSubchannel.startNewTransport(InternalSubchannel.java:209)
	at io.grpc.internal.InternalSubchannel.obtainActiveTransport(InternalSubchannel.java:186)
	at io.grpc.internal.ManagedChannelImpl$SubchannelImplImpl.obtainActiveTransport(ManagedChannelImpl.java:806)
	at io.grpc.internal.GrpcUtil.getTransportFromPickResult(GrpcUtil.java:568)
	at io.grpc.internal.DelayedClientTransport.reprocess(DelayedClientTransport.java:296)
	at io.grpc.internal.ManagedChannelImpl$LbHelperImpl$5.run(ManagedChannelImpl.java:724)
	at io.grpc.internal.ChannelExecutor.drain(ChannelExecutor.java:87)
	at io.grpc.internal.ManagedChannelImpl$LbHelperImpl.runSerialized(ManagedChannelImpl.java:715)
	at io.grpc.internal.ManagedChannelImpl$NameResolverListenerImpl.onUpdate(ManagedChannelImpl.java:752)
	at io.grpc.internal.DnsNameResolver$1.run(DnsNameResolver.java:174)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

This failure does not propagate back to the initial RPC; the initial RPC eventually throws DEADLINE_EXCEEDED because the failure isn't propagated.

Is there any way to propagate this error to the RPC?

@ejona86
Copy link
Member

ejona86 commented Jun 5, 2017

@garrettjonesgoogle, that back trace is really useful. I thought the SslContext was created in the NettyChannelBuilder before/during build(). It looks like it is accidentally being delayed. We can fix that pretty easily.

@garrettjonesgoogle
Copy link
Contributor Author

@ejona86 that's great to hear!

ejona86 added a commit to ejona86/grpc-java that referenced this issue Jun 5, 2017
Creating the SslContext can throw, generally due to broken ALPN. We want
that to propagate to the caller of build(), instead of within the
channel where it could easily cause hangs.

We still delay creation until actual build() time, since TLS is not
guaranteed to work and the application may be configuring plaintext or
similar later before calling build() where SslContext is unnecessary.

The only externally-visible change should be the exception handling.
I'd add a test, but the things throwing are static and trying to inject
them would be pretty messy.

Fixes grpc#2599
@ejona86
Copy link
Member

ejona86 commented Jun 5, 2017

@garrettjonesgoogle, I just sent out #3060 which has the pretty small fix in 1c0f826. We were going to be releasing 1.4.0 tomorrow. This has been a big enough problem for users that you'd probably like us to include it in 1.4.0, right?

@ejona86 ejona86 modified the milestones: 1.4, Next Jun 5, 2017
@garrettjonesgoogle
Copy link
Contributor Author

@ejona86 absolutely - we probably get a new issue filed for this every couple weeks, and each person who encounters this probably burns tens of minutes to possibly hours trying to figure it out.

ejona86 added a commit to ejona86/grpc-java that referenced this issue Jun 5, 2017
Creating the SslContext can throw, generally due to broken ALPN. We want
that to propagate to the caller of build(), instead of within the
channel where it could easily cause hangs.

We still delay creation until actual build() time, since TLS is not
guaranteed to work and the application may be configuring plaintext or
similar later before calling build() where SslContext is unnecessary.

The only externally-visible change should be the exception handling.
I'd add a test, but the things throwing are static and trying to inject
them would be pretty messy.

Fixes grpc#2599
ejona86 added a commit that referenced this issue Jun 6, 2017
Creating the SslContext can throw, generally due to broken ALPN. We want
that to propagate to the caller of build(), instead of within the
channel where it could easily cause hangs.

We still delay creation until actual build() time, since TLS is not
guaranteed to work and the application may be configuring plaintext or
similar later before calling build() where SslContext is unnecessary.

The only externally-visible change should be the exception handling.
I'd add a test, but the things throwing are static and trying to inject
them would be pretty messy.

Fixes #2599
ejona86 added a commit to ejona86/grpc-java that referenced this issue Jun 6, 2017
Creating the SslContext can throw, generally due to broken ALPN. We want
that to propagate to the caller of build(), instead of within the
channel where it could easily cause hangs.

We still delay creation until actual build() time, since TLS is not
guaranteed to work and the application may be configuring plaintext or
similar later before calling build() where SslContext is unnecessary.

The only externally-visible change should be the exception handling.
I'd add a test, but the things throwing are static and trying to inject
them would be pretty messy.

Fixes grpc#2599
ejona86 added a commit that referenced this issue Jun 6, 2017
Creating the SslContext can throw, generally due to broken ALPN. We want
that to propagate to the caller of build(), instead of within the
channel where it could easily cause hangs.

We still delay creation until actual build() time, since TLS is not
guaranteed to work and the application may be configuring plaintext or
similar later before calling build() where SslContext is unnecessary.

The only externally-visible change should be the exception handling.
I'd add a test, but the things throwing are static and trying to inject
them would be pretty messy.

Fixes #2599
@lock lock bot locked as resolved and limited conversation to collaborators Sep 22, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants