Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io.grpc.Context getting cancelled #4169

Closed
amitgud-doordash opened this issue Sep 20, 2021 · 35 comments · Fixed by #5543
Closed

io.grpc.Context getting cancelled #4169

amitgud-doordash opened this issue Sep 20, 2021 · 35 comments · Fixed by #5543
Assignees
Labels
bug Something isn't working

Comments

@amitgud-doordash
Copy link
Contributor

Describe the bug
Seeing grpc context timeouts with 1.4.1 and 1.5.3.

Steps to reproduce
No specific recipe to share at this point. But this appears to be related to few other issues with Google client APIs.

What did you expect to see?
GRPC context timeout not getting affected by Otel instrumentation.

What did you see instead?

CANCELLED: io.grpc.Context was cancelled without error.
Stack: io.grpc.Status.asRuntimeException(Status.java:535),
io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478),
interceptor-for-ClientCall.Listener<RespT>: onClose()
interceptor-for-ForwardingClientCall.SimpleForwardingClientCall<ReqT, RespT>: onClose()
...(truncated)...
(TracingClientInterceptor.java:139),
(DelayedClientCall.java:463),
(DelayedClientCall.java:427),
(DelayedClientCall.java:460),
(ClientCallImpl.java:553),
(ClientCallImpl.java:68),
(ClientCallImpl.java:212),
(ContextRunnable.java:37),

What version are you using?
Tried with both 1.4.1 and 1.5.3 with the same result. Issue does not happen without instrumentation.

Environment
JDK11

@amitgud-doordash amitgud-doordash added the bug Something isn't working label Sep 20, 2021
@amitgud-doordash
Copy link
Contributor Author

Tried v1.6.0 with the same result.

@jsuereth jsuereth self-assigned this Sep 21, 2021
@jsuereth
Copy link
Contributor

Looks like a similar binding issue to: googleapis/java-logging-logback#537

@anuraaga
Copy link
Contributor

@amitgud-doordash Are you using the java agent or library instrumentation? If the later, are you using BatchSpanProcess or a SimpleSpanProcessor?

@jsuereth IIUC, assuming use of the BatchSpanProcessor it should be different from the linked logback issue since BatchSpanProcessor has its own thread, so it wouldn't use a user gRPC context with a cancellation deadline.

@amitgud-doordash
Copy link
Contributor Author

@anuraaga , I'm using javaagent.

@jsuereth
Copy link
Contributor

jsuereth commented Oct 1, 2021

So, I have a reproduction that uses SimpleSpanProcessor that we should likely fix anyway: open-telemetry/opentelemetry-java@main...jsuereth:wip-context-overlap

I'm also investigating other issues that may have arose in gRPC around context + timeout.

@anuraaga
Copy link
Contributor

anuraaga commented Oct 1, 2021

SimpleSpanProcessor is exactly the question - given the span is being sent on the hot thread anyways wouldn't it be cancelled along with the request if the timeout gets passed? My expectation would be yes, and either I would need to raise the timeout the accommodate hot path span exporting or switch to the BSP - this seems quite normal.

@jsuereth
Copy link
Contributor

jsuereth commented Oct 1, 2021

Shouldn't we be sending the information out that the span was cancelled? Also, when we have log-exporter, we may also run into this issue. In my mind, it makes sense to detangle the telemetry-plane (data plane?) from your "hot" network.

I also suspect this is happening in BSP scenarios given the sheer number of times I'm seeing this bug reported across projects that use java gRPC, but still working on a reproducible test case.

@amitgud-doordash
Copy link
Contributor Author

Any update on this? We are seeing other gRPC applications running into this.

@trask
Copy link
Member

trask commented Nov 1, 2021

hi @amitgud-doordash, any chance you can produce some kind of repro that we can use to troubleshoot?

@amitgud-doordash
Copy link
Contributor Author

hi @amitgud-doordash, any chance you can produce some kind of repro that we can use to troubleshoot?

I don't have a repro. But I believe @jsuereth does.

@jsuereth
Copy link
Contributor

jsuereth commented Nov 3, 2021

I don't have a repro that uses BatchSpanProcessor (only one using the simple-span processor), and I haven't been able to reproduce that scenario here.

I still think we should detach from context in our exporters using gRPC, even when using SimpleSpanProcessor. However, the broader question of should this be an issue on the agent, I can't duplicate.

@anuraaga
Copy link
Contributor

anuraaga commented Nov 4, 2021

@amitgud-doordash I think we need a repro here - currently we can't find any reason for the cancelations when using the BatchSpanProcessor (the standard choice) so I'd like to understand what could be causing the issue. I actually suspect it's not related to exporters but is the gRPC instrumentation itself, presumably the context bridge.

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/grpc-1.6/library/src/main/java/io/opentelemetry/instrumentation/grpc/v1_6/internal/ContextStorageBridge.java

Perhaps there is a code path in there that needs to reset cancelation, though I can't through code inspection find one. It might not be the bridge but other interceptor code though. Grasping at straws right now.

If you're able to experiment, then seeing if you still see the issues by setting OTEL_TRACES_EXPORTER=none could narrow down on the instrumentation being the problem.

@chenkaiyin1201
Copy link

ot version:1.9.1
grpc version: 2.9.0

Basically, service B must have this error whenever service A GRPC -> service B GRPC (asynchronous) -> service C links.

@anuraaga
Copy link
Contributor

anuraaga commented Jan 5, 2022

Hi @chenkaiyin1201 - are you using the javaagent? And are you able to provide some code that can reproduce the issue? We currently have not been able to reproduce.

@chenkaiyin1201
Copy link

Hi @anuraaga - we are using javaagent.

service A http -> service B grpc -> service C grpc(asynchronous) -> service D grpc links.

Basic SpringBoot code, no special code. Only when GRPC asynchronous, initialize the ExecutorService executor = Executors.NewFixedThreadPool(1);

In addition, found -Dotel.instrumentation.grpc.enabled=false can solve this error.

@danhngo-lx
Copy link

In addition, found -Dotel.instrumentation.grpc.enabled=false can solve this error.

Hi @chenkaiyin1201, the above flag disables tracing for GRPC calls, right? Do you have a link that explains that flag? I've searched on this repo and on https://github.com/open-telemetry/opentelemetry-java/tree/main/sdk-extensions/autoconfigure but couldn't find related information.

@trask
Copy link
Member

trask commented Jan 20, 2022

the above flag disables tracing for GRPC calls, right? Do you have a link that explains that flag?

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/suppressing-instrumentation.md#suppressing-specific-agent-instrumentation

@chenkaiyin1201
Copy link

chenkaiyin1201 commented Jan 21, 2022

Hi @danhngo-lx ,I need to use Grpc instrumentation.
Question the reason should be caused by the asynchronous thread need take grpc-context, and then the main thread has ended, the asynchronous thread because ContextStorageBridge.Current() methods lead to get grpc-context to be destroyed

The current solution is to consider whether to use an Executor instrumentation or fork the Grpc-context for passing

@trask
Copy link
Member

trask commented Jan 21, 2022

@chenkaiyin1201 can you create a repro that shows the problem you are having? this current issue is stuck because we need a repro in order to troubleshoot and fix the issue

@chenkaiyin1201
Copy link

Hi @trask so... what kind of information should I provide?

@trask
Copy link
Member

trask commented Jan 25, 2022

@chenkaiyin1201 a standalone repro that can be used to see and troubleshoot the issue would be most helpful

@danhngo-lx
Copy link

I downgraded from v1.9.1 to v1.1.0 as mentioned in digital-asset/daml#12568 and didn't see this error anymore. Will try to create a sample repo when I have more time.

@chenkaiyin1201
Copy link

chenkaiyin1201 commented Feb 10, 2022

@trask sorry for the late reply.
The following is the recurrence:


@GrpcService
public class GrpcTestService extends MyServiceGrpc.MyServiceImplBase {

  @Override
  public void sayHello(Request request, StreamObserver<Response> responseObserver) {
      String message = "Hello " + request.getName();
  
      Response build = Response.newBuilder().setMsg(message).build();
      responseObserver.onNext(build);
      responseObserver.onCompleted();
      receiveGreeting(message);
  }
  
  @GrpcClient("myService")
  private MyServiceGrpc.MyServiceBlockingStub myServiceBlockingStub;
  
  ExecutorService executor = Executors.newFixedThreadPool(1);
  public void receiveGreeting(String name) {
      executor.execute(() -> {
          Request build = Request.newBuilder().setName("in: " + name).build();
          myServiceBlockingStub.sayHello1(build).getMsg()
      });
  }
}

@trask
Copy link
Member

trask commented Feb 22, 2022

hi @chenkaiyin1201, can you create a github repository with a buildable/runnable repro there, and give us instructions how to generate the error you are seeing using the repro?

@ryandens
Copy link
Contributor

ryandens commented Mar 1, 2022

👋 Hi all, I was able to recreate the bug. I have a reproducer project available in this repo: ryandens/otel-grpc-context

I'm hopeful I'll have time in the next few days to investigate and determine the source of the issue, but figured I'd give someone a starting point if they beat me to it. I was able to confirm that this error is not present in 1.1.0 but is present in 1.2.0.

@ryandens
Copy link
Contributor

ryandens commented Mar 2, 2022

I've had some time to eyeball this in the context of the above project, and two things jump out to me. I'm hopeful that someone more experienced with this instrumentation library can confirm or deny that these two things are odd

  • The scope that is closed in ContextStorageBridge.detach(Context toAttach, Context toRestore) was changed from the scope associated with toDetach to the scope that is associated with toRestore: https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/2948/files#diff-a87df542768f5d808c2681aee36401ed51d879dc7f71d5b51bcc13548e4aed89R74.
  • The ContextStorageOverride class defined in the grpc-1.6 library instrumentation instantiates it's own ContextStorageBridge. The GrpcSingletons class also instantiates its own ContextStorageBridge resulting in two different instances of ContextStorageBridge being used. Update: although there are two usages of the ContextStorageBridge constructor, only one is ever instantiated (depending on how the instrumentation is applied) so this is not a potential source of an issue.

@anuraaga
Copy link
Contributor

anuraaga commented Mar 3, 2022

Thanks a lot for the investigation! Just wanted to check understanding, you mention "library instrumentation instantiates it's own ContextStorageBridge. The GrpcSingletons class also instantiates its own ContextStorageBridge".

How did you confirm both are being instantiated? The javaagent intends to be intercepting and skipping the call to Context$Storage.storage()

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/grpc-1.6/javaagent/src/main/java/io/opentelemetry/javaagent/instrumentation/grpc/v1_6/GrpcContextInstrumentation.java#L39

So it's definitely not intended. Wondering how to verify that.

@ryandens
Copy link
Contributor

ryandens commented Mar 3, 2022

Ah, my mistake! I can see now that they aren't both being instantiated, I misunderstood how ContextStorageOverride works in the presence of the GrpcContextInstrumentation.

@ryandens
Copy link
Contributor

ryandens commented Mar 3, 2022

I ran a git-bisect between the v1.1.0 and v1.2.0 tags and determined that a66a13f is the commit where this error first presented itself. I was surprised to find that PR 2948 was not the source of the issue (and in fact was introduced after this bug starts to appear).

I assume this means there is some missing logic in the ContextStorageBridge in order to prevent a span from leaking to the wrong thread, but I could be misunderstanding it. A fix for this doesn't immediately present itself to me, but I feel a bit better knowing what change caused it to happen

@anuraaga
Copy link
Contributor

anuraaga commented Mar 3, 2022

Thanks for the bisect! It's hard for me too to remember the chain of code changes that connects back to that commit, but I suspect we need to call io.grpc.Context.fork() at an appropriate place. It might be this line

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/grpc-1.6/library/src/main/java/io/opentelemetry/instrumentation/grpc/v1_6/internal/ContextStorageBridge.java#L37

Or maybe this one

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/grpc-1.6/library/src/main/java/io/opentelemetry/instrumentation/grpc/v1_6/internal/ContextStorageBridge.java#L81

If you're able to give that a try that would be really helpful though great that we have a lot better understanding now.

@ryandens
Copy link
Contributor

ryandens commented Mar 7, 2022

I tried both of those changes and neither seemed to have an impact on this bug presenting itself. I think it would be beneficial to reproduce this in a new test case in the AbstractGrpcTest, but not sure when I'll have the chance to do that/how feasible that is

@anuraaga
Copy link
Contributor

anuraaga commented Mar 8, 2022

Thanks @ryandens - started with #5521 to hopefully make it easier to add the repro to it

@anuraaga
Copy link
Contributor

anuraaga commented Mar 11, 2022

After looking into this, I have a hypothesis that the instrumentation is working as intended - but in some sense, it is working too well and perhaps we do need to change it somehow.

When looking at this type of code pattern inside a gRPC handler

class Greeter {
  ExecutorService executor = Executors.newFixedThreadPool();

  sayHello() {
    responseObserver.onCompleted();
    executor.execute(() -> backend.sayGoodbye());
  }

It is in strict terms, "incorrect" gRPC code - gRPC context hasn't been propagated into the executor. This means that deadline is not propagated as it's supposed to be, and if using OpenCensus (gRPC's native instrumentation), the trace would be broken. Similary, if using OpenTelemetry's gRPC library instrumentation, the trace would be broken due to the lack of context propagation.

For this code to become correct, it needs to be something like Executor executor = Context.currentContextExecutor(Executors.newFixedThreadPool());. With deadline correctly propagated into the threadpool, this would cause the backend request to fail with the cancellation exception, no OpenTelemetry involved. Presumably you would call Context.fork() in the business logic to fix this business logic bug. Note that with such a change, adding OpenTelemetry library instrumentation would also just work - we sync the context between gRPC and OpenTelemetry because we want people to be able to add in the library instrumentation interceptors and have applications using gRPC context propagation to work out of the box, especially to help with OpenCensus migration.

Our problem here is that the javaagent is able to automatically propagate context for threadpool executors. Because we sync OpenTelemetry and gRPC context, this means that the gRPC deadline is also automatically propagated - so code that did not fail due to deadline issues will now fail. The early return pattern we see in the repro is a trival way of reproducing such code, but even traditional synchronous gRPC calls can fail if the client used .withDeadline() and the deadline got propagated to a location where it wasn't intended. Usually it is intended though.

So the javaagent is making the app "better" by propagating deadline to more places it should be :) But we still have a rule that the javaagent can't introduce exceptions to code that previously didn't. I think I have reasoned out a workaround for the javaagent and let's see how it goes.

@ryandens
Copy link
Contributor

Thank you for the detailed analysis @anuraaga, I'll be sure to forward this on to the folks that asked me to look into this issue.

@amitgud-doordash
Copy link
Contributor Author

Thanks for looking into this @anuraaga and the thorough analysis! Kudos to all who worked on getting this issue fixed! 🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
7 participants