Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatusException: UNAVAILABLE on client when maxConnectionAge is set on server #9566

Open
usischev opened this issue Sep 27, 2022 · 20 comments
Open
Labels
Milestone

Comments

@usischev
Copy link

What version of gRPC-Java are you using?

1.49.0
Also grpc-kotlin 1.3.0

What is your environment?

MacOS 12.5.1 with OpenJdk 17.0.2
Ubuntu 18.04.3 with OpenJdk 17.0.3

What did you expect to see?

GRPC NettyServer is configured with maxConnectionAge of 1 second and maxConnectionAgeGrace of 300 seconds. Client sends requests with large response size (4 MB). Client receives responses successfully.

What did you see instead?

Eventually the client throws io.grpc.StatusException: UNAVAILABLE after receiving GOAWAY from the server. We first saw this behavior with large request size on our production and testing environments running on Linux. Then we were able to reproduce using a minimal project with large response size on MacOS.

Steps to reproduce the bug

@sanjaypujare
Copy link
Contributor

I looked at the logs but I don't see any server side logs. Also in "Client sends requests with large response size (4 MB)" did you by any chance mean to say "Client sends requests with large request message size (4 MB)"

@usischev
Copy link
Author

I looked at the logs but I don't see any server side logs.

The logs in the minimal project are only client-side indeed, and are there primarily to show loop progress and exceptions with stacktraces. In this project, the server is the same process with the client, all created and started in Main.kt. What kind of logs would help to investigate the problem?

Also in "Client sends requests with large response size (4 MB)" did you by any chance mean to say "Client sends requests with large request message size (4 MB)"

I meant to say: Client sends requests, to each request server sends response with large response message size (4 MB). This is what the minimal project tests.

We also observed the same exception with different code, so it also happens when client sends requests with large request message size - again, not for each request, but eventually the exception is thrown. This case is not reflected in the minimal project.

@sanjaypujare
Copy link
Contributor

Okay I am taking a look at your repo - the source code and the logs. It does look like there's a bug .

one question: in the logs I see lines like

0 6 4194000

Can you explain what each of the number is? When does the io.grpc.StatusRuntimeException: UNAVAILABLE happen after the RPC is started? Is there a timestamp?

@usischev
Copy link
Author

one question: in the logs I see lines like

0 6 4194000

Can you explain what each of the number is?

There is a nested loop in code:

repeat(10) { i ->
    repeat(100) { j ->
        ...

The first two numbers are i and j, showing loop iteration. The third number is response message length, which is always 4194000 with this code.

When does the io.grpc.StatusRuntimeException: UNAVAILABLE happen after the RPC is started? Is there a timestamp?

There is no timestamp, the exception happens at different iteration numbers on different runs of the same code. I had some runs that did not throw exception at all, but most of them do - sooner or later.

@sanjaypujare
Copy link
Contributor

the exception happens at different iteration numbers on different runs of the same code

And exception happens long before the maxConnectionAgeGrace of 300 seconds (5 minutes), is that right? In that case this is most probably an issue. CC @ejona86

@usischev
Copy link
Author

And exception happens long before the maxConnectionAgeGrace of 300 seconds (5 minutes), is that right? In that case this is most probably an issue. CC @ejona86

That is right. I have updated the code to print timestamps, and here is resulting output. Exception happened less than 5 seconds since the first request.

2022-09-28T20:51:41.841390 STARTING
2022-09-28T20:51:42.880495 0 0 4194000
2022-09-28T20:51:43.187745 0 1 4194000
2022-09-28T20:51:43.383800 0 2 4194000
2022-09-28T20:51:43.579963 0 3 4194000
2022-09-28T20:51:43.746368 0 4 4194000
2022-09-28T20:51:43.908364 0 5 4194000
2022-09-28T20:51:44.067139 0 6 4194000
2022-09-28T20:51:44.231042 0 7 4194000
2022-09-28T20:51:44.396685 0 8 4194000
2022-09-28T20:51:44.559553 0 9 4194000
2022-09-28T20:51:44.717476 0 10 4194000
2022-09-28T20:51:44.876554 0 11 4194000
2022-09-28T20:51:45.027898 0 12 4194000
2022-09-28T20:51:45.201945 0 13 4194000
Exception in thread "main" io.grpc.StatusRuntimeException: UNAVAILABLE: Connection closed after GOAWAY. HTTP/2 error code: NO_ERROR, debug data: max_age
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165)
	at io.grpc.examples.helloworld.GreeterGrpc$GreeterBlockingStub.sayHello(GreeterGrpc.java:178)
	at org.example.grpc.poc.MainKt.blockingRequestCycle(Main.kt:71)
	at org.example.grpc.poc.MainKt.main(Main.kt:110)
	at org.example.grpc.poc.MainKt.main(Main.kt)

@sanjaypujare
Copy link
Contributor

Okay, so I looked at the code and I realized this is working as expected with your values of maxConnectionAge of 1 second and maxConnectionAgeGrace of 300 seconds. Note that once 1 second has elapsed the server will only allow existing RPCs (that were started before the second elapsed) to finish. New RPCs after that 1 second may not be allowed. In this case your RPCs were allowed for about 4 to 5 seconds and that's beyond the 1 second window but I don't think that's an issue for you. Do you agree?

@ejona86
Copy link
Member

ejona86 commented Sep 28, 2022

In this case your RPCs were allowed for about 4 to 5 seconds

Why do you say that? The original connection was probably already closed. We're looking at the 2rd or 3th connection. And each connection should be having a graceful GOAWAY where it doesn't cause RPCs to fail due to the network race. There are internal gRPC races, but those should be handled by transparent retry.

In any case, this RPC seems like it was accepted by the server because we use a different error message for RPCs rejected by GOAWAY (generally mention "abrupt" goaway). This message simply means there was a GOAWAY in the past for the connection, and then later the connection happened to be closed.

@ejona86
Copy link
Member

ejona86 commented Oct 4, 2022

I'm wondering if this is caused by a TCP RST getting received before the trailers have been processed.

I used sudo tcpdump -i lo -B 10000 tcp port 50051 -w out.pcap -W 5 -C 20 to capture, and mergecap to concat files if necessary.

ejona-trial1.pcap.gz
ejona-trial2.pcap.gz

@ejona86 ejona86 added the bug label Oct 4, 2022
@ejona86 ejona86 added this to the Next milestone Dec 20, 2022
@racevedoo
Copy link

I'm seeing the same issue in our prod environment. Any updates here?

@laurovenancio
Copy link

@ejona86 I've debugged this problem using the provided reproducer and I could confirm that, sometimes, the function channelInactive is called before the final bytes of some streams are processed.

When I move the call to super.channelInactive(ctx) to the beginning of the function, I am not able to reproduce the problem anymore.
https://github.com/grpc/grpc-java/blob/master/netty/src/main/java/io/grpc/netty/NettyClientHandler.java#L462

@ejona86
Copy link
Member

ejona86 commented Apr 14, 2023

@laurovenancio, I'm suspicious of that change. super.channelInactive() is an inbound event so it calls fireChannelInactive(), and there's no other handlers later in the pipeline. That means it shouldn't matter at all when it is called, because it will be a noop. I suspect that it just changed the timing slightly to hide the race, although that doesn't make full sense to me since we have been able to semi-reliably hit the race and such a small change seems unlikely to change the race to that degree...

@laurovenancio
Copy link

@ejona86 It doesn't seem to be just a timing issue because I am still able to consistently reproduce the issue when I add a 100ms sleep at the channelInactive() function beginning.

The following stack trace shows that the super.channelInactive() call closes the HTTP2 base decoder and it ensures that all streams are properly closed.

"Test worker@1" prio=5 tid=0x1 nid=NA runnable
  java.lang.Thread.State: RUNNABLE
	  at io.grpc.netty.NettyClientHandler$2.onStreamClosed(NettyClientHandler.java:308)
	  at io.netty.handler.codec.http2.DefaultHttp2Connection.notifyClosed(DefaultHttp2Connection.java:357)
	  at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.removeFromActiveStreams(DefaultHttp2Connection.java:1007)
	  at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.deactivate(DefaultHttp2Connection.java:963)
	  at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:515)
	  at io.netty.handler.codec.http2.DefaultHttp2Connection.close(DefaultHttp2Connection.java:153)
	  at io.netty.handler.codec.http2.Http2ConnectionHandler$BaseDecoder.channelInactive(Http2ConnectionHandler.java:217)
	  at io.netty.handler.codec.http2.Http2ConnectionHandler.channelInactive(Http2ConnectionHandler.java:432)
	  at io.grpc.netty.NettyClientHandler.channelInactive(NettyClientHandler.java:434)
	  at io.grpc.netty.NettyClientHandlerTest.channelShutdownShouldFailInFlightStreams(NettyClientHandlerTest.java:627)

So, it seems to me that the root cause of this issue is the call to clientStream.transportReportStatus() in channelInactive() while netty provides no guarantee that all streams are finished at this point.

It seems, to fix it, we have to move the clientStream.transportReportStatus() call from channelInactive() to onStreamClosed().

@ejona86
Copy link
Member

ejona86 commented Apr 14, 2023

Oh, right, we are extending that HTTP/2 base class, so the call flow becomes complicated. onStreamClosed() doesn't help us at all, because the RPC will still fail but have even less information about why it was closed.

The issue at hand is "why are there active streams when the channel becomes inactive (TCP connection closes)?" And the RST I found is a smoking gun for that. The changes the question to "why is RST getting sent/how can we avoid the RST?"

When I move the call to super.channelInactive(ctx) to the beginning of the function, I am not able to reproduce the problem anymore.

I'd have to look very closely, but right now I'd be worried that such a change orphans some streams and they never get closed. So you might not being seeing an error any more because they hang instead.

@laurovenancio
Copy link

The issue at hand is "why are there active streams when the channel becomes inactive (TCP connection closes)?" And the RST I found is a smoking gun for that. The changes the question to "why is RST getting sent/how can we avoid the RST?"

The provided reproducer shutdowns both the client and the server when the issue is detected. I believe, the RST we see in the tcpdump comes from this shutdown process and not from the issue.

I've modified the reproducer to always run until the end. After this change, no RST is sent when the problem is detected.

@laurovenancio
Copy link

When I move the call to super.channelInactive(ctx) to the beginning of the function, I am not able to reproduce the problem anymore.

I'd have to look very closely, but right now I'd be worried that such a change orphans some streams and they never get closed. So you might not being seeing an error any more because they hang instead.

I agree that just moving the call to super.channelInactive() is not a proper fix. It was just a test. For instance, it breaks the channelShutdownShouldFailInFlightStreams test because transportReportStatus()is never called in some cases.

I've validated that all grpc call finished correctly. So there was no hang.

@laurovenancio
Copy link

@ejona86 You are right. To move the clientStream.transportReportStatus() call makes no difference and there is a RST packet.

After debugging a little more, I found the following timeline:

  1. Server sends first goaway and a ping
  2. Client stops creating new streams and replies the ping
  3. Server sends the second goaway
  4. When all active stream are done, the server closes the connection
  5. Before receiving the final bytes of some streams, the client sends a WINDOW_UPDATE frame
  6. As the connection is already closed, the kernel sends a RST packet and discards all buffered data.

This kind of issue was already described in some HTTP2 implementations. For instance, it was fixed by the following patch in the golang HTTP2 implementation.

golang/net@cd69bc3

@ejona86
Copy link
Member

ejona86 commented Apr 18, 2023

Oh, I really like that theory. Although looking at ejona-trial1.pcap.gz above, it doesn't entirely fit as the client wasn't sending HTTP/2 frames. The client does send a TCP ACK though, and it didn't ACK all the remaining data. So I think the general idea seems likely right.

Delaying a second seems fair, in terms of complexity. I'm not quite sure where we'd want to make this change; in Netty or gRPC.

@jinseo-jang
Copy link

was this issue resolved??

@glyiu98
Copy link

glyiu98 commented Jul 23, 2024

Hello, is a fix planned for this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants