-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io.grpc.Context getting cancelled #4169
Comments
Tried v1.6.0 with the same result. |
Looks like a similar binding issue to: googleapis/java-logging-logback#537 |
@amitgud-doordash Are you using the java agent or library instrumentation? If the later, are you using BatchSpanProcess or a SimpleSpanProcessor? @jsuereth IIUC, assuming use of the BatchSpanProcessor it should be different from the linked logback issue since BatchSpanProcessor has its own thread, so it wouldn't use a user gRPC context with a cancellation deadline. |
@anuraaga , I'm using javaagent. |
So, I have a reproduction that uses SimpleSpanProcessor that we should likely fix anyway: open-telemetry/opentelemetry-java@main...jsuereth:wip-context-overlap I'm also investigating other issues that may have arose in gRPC around context + timeout. |
|
Shouldn't we be sending the information out that the span was cancelled? Also, when we have log-exporter, we may also run into this issue. In my mind, it makes sense to detangle the telemetry-plane (data plane?) from your "hot" network. I also suspect this is happening in BSP scenarios given the sheer number of times I'm seeing this bug reported across projects that use java gRPC, but still working on a reproducible test case. |
Any update on this? We are seeing other gRPC applications running into this. |
hi @amitgud-doordash, any chance you can produce some kind of repro that we can use to troubleshoot? |
I don't have a repro. But I believe @jsuereth does. |
I don't have a repro that uses BatchSpanProcessor (only one using the simple-span processor), and I haven't been able to reproduce that scenario here. I still think we should detach from context in our exporters using gRPC, even when using SimpleSpanProcessor. However, the broader question of should this be an issue on the agent, I can't duplicate. |
@amitgud-doordash I think we need a repro here - currently we can't find any reason for the cancelations when using the BatchSpanProcessor (the standard choice) so I'd like to understand what could be causing the issue. I actually suspect it's not related to exporters but is the gRPC instrumentation itself, presumably the context bridge. Perhaps there is a code path in there that needs to reset cancelation, though I can't through code inspection find one. It might not be the bridge but other interceptor code though. Grasping at straws right now. If you're able to experiment, then seeing if you still see the issues by setting |
ot version:1.9.1 Basically, service B must have this error whenever service A GRPC -> service B GRPC (asynchronous) -> service C links. |
Hi @chenkaiyin1201 - are you using the javaagent? And are you able to provide some code that can reproduce the issue? We currently have not been able to reproduce. |
Hi @anuraaga - we are using javaagent. service A http -> service B grpc -> service C grpc(asynchronous) -> service D grpc links. Basic SpringBoot code, no special code. Only when GRPC asynchronous, initialize the ExecutorService executor = Executors.NewFixedThreadPool(1); In addition, found -Dotel.instrumentation.grpc.enabled=false can solve this error. |
Hi @chenkaiyin1201, the above flag disables tracing for GRPC calls, right? Do you have a link that explains that flag? I've searched on this repo and on https://github.com/open-telemetry/opentelemetry-java/tree/main/sdk-extensions/autoconfigure but couldn't find related information. |
|
Hi @danhngo-lx ,I need to use Grpc instrumentation. The current solution is to consider whether to use an Executor instrumentation or fork the Grpc-context for passing |
@chenkaiyin1201 can you create a repro that shows the problem you are having? this current issue is stuck because we need a repro in order to troubleshoot and fix the issue |
Hi @trask so... what kind of information should I provide? |
@chenkaiyin1201 a standalone repro that can be used to see and troubleshoot the issue would be most helpful |
I downgraded from v1.9.1 to v1.1.0 as mentioned in digital-asset/daml#12568 and didn't see this error anymore. Will try to create a sample repo when I have more time. |
@trask sorry for the late reply.
|
hi @chenkaiyin1201, can you create a github repository with a buildable/runnable repro there, and give us instructions how to generate the error you are seeing using the repro? |
👋 Hi all, I was able to recreate the bug. I have a reproducer project available in this repo: ryandens/otel-grpc-context I'm hopeful I'll have time in the next few days to investigate and determine the source of the issue, but figured I'd give someone a starting point if they beat me to it. I was able to confirm that this error is not present in 1.1.0 but is present in 1.2.0. |
I've had some time to eyeball this in the context of the above project, and two things jump out to me. I'm hopeful that someone more experienced with this instrumentation library can confirm or deny that these two things are odd
|
Thanks a lot for the investigation! Just wanted to check understanding, you mention "library instrumentation instantiates it's own ContextStorageBridge. The GrpcSingletons class also instantiates its own ContextStorageBridge". How did you confirm both are being instantiated? The javaagent intends to be intercepting and skipping the call to So it's definitely not intended. Wondering how to verify that. |
Ah, my mistake! I can see now that they aren't both being instantiated, I misunderstood how |
I ran a I assume this means there is some missing logic in the |
Thanks for the bisect! It's hard for me too to remember the chain of code changes that connects back to that commit, but I suspect we need to call Or maybe this one If you're able to give that a try that would be really helpful though great that we have a lot better understanding now. |
I tried both of those changes and neither seemed to have an impact on this bug presenting itself. I think it would be beneficial to reproduce this in a new test case in the |
After looking into this, I have a hypothesis that the instrumentation is working as intended - but in some sense, it is working too well and perhaps we do need to change it somehow. When looking at this type of code pattern inside a gRPC handler class Greeter {
ExecutorService executor = Executors.newFixedThreadPool();
sayHello() {
responseObserver.onCompleted();
executor.execute(() -> backend.sayGoodbye());
} It is in strict terms, "incorrect" gRPC code - gRPC context hasn't been propagated into the executor. This means that deadline is not propagated as it's supposed to be, and if using OpenCensus (gRPC's native instrumentation), the trace would be broken. Similary, if using OpenTelemetry's gRPC library instrumentation, the trace would be broken due to the lack of context propagation. For this code to become correct, it needs to be something like Our problem here is that the javaagent is able to automatically propagate context for threadpool executors. Because we sync OpenTelemetry and gRPC context, this means that the gRPC deadline is also automatically propagated - so code that did not fail due to deadline issues will now fail. The early return pattern we see in the repro is a trival way of reproducing such code, but even traditional synchronous gRPC calls can fail if the client used So the javaagent is making the app "better" by propagating deadline to more places it should be :) But we still have a rule that the javaagent can't introduce exceptions to code that previously didn't. I think I have reasoned out a workaround for the javaagent and let's see how it goes. |
Thank you for the detailed analysis @anuraaga, I'll be sure to forward this on to the folks that asked me to look into this issue. |
Thanks for looking into this @anuraaga and the thorough analysis! Kudos to all who worked on getting this issue fixed! 🙇 |
Describe the bug
Seeing grpc context timeouts with 1.4.1 and 1.5.3.
Steps to reproduce
No specific recipe to share at this point. But this appears to be related to few other issues with Google client APIs.
What did you expect to see?
GRPC context timeout not getting affected by Otel instrumentation.
What did you see instead?
What version are you using?
Tried with both 1.4.1 and 1.5.3 with the same result. Issue does not happen without instrumentation.
Environment
JDK11
The text was updated successfully, but these errors were encountered: