Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASP.NET core GRPC server can terminate connections when MAX_CONCURRENT_STREAMS is breached #17484

Closed
JamesNK opened this issue Nov 29, 2019 · 14 comments
Assignees
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions Author: Migration Bot 🤖 The issue was created by a issue mover bot. The author may not be the actual author. Done This issue has been fixed feature-kestrel HTTP2

Comments

@JamesNK
Copy link
Member

JamesNK commented Nov 29, 2019

Originally created at grpc/grpc-dotnet#673 by @djluck


What version of gRPC and what language are you using?

grpc=v2.24.0
lang=C#

What operating system (Linux, Windows,...) and version?

Windows, v10.0.18362 Build 18362

What runtime / compiler are you using (e.g. .NET Core SDK version dotnet --info)

v3.0.100

What did you do?

When issuing many requests on a fresh/ idle connection, it's possible to exceed the MAX_CONCURRENT_STREAMS setting of a server, causing the connection to fault and failing successfully processed streams.
Exceeding the MAX_CONCURRENT_STREAMS on a fresh/ idle connection should be supported as the initial value is assumed to be unlimited until a peer has communicated it's prefered limit.

This psuedo-code highlights how to re-create the issue on the client-side:

// This test fails about 1/3 times (yuk). Bit hacky but let's repeat the test a large number of times
// to verify everything is working correctly.
for (int i = 0; i < 5; i++)
{
	// Must use a fresh client for each iteration of the test. The GRPC server will indicate it's max concurrency
	// limit after the first request successfully completes. However, we are interested in making sure large bursts
	// on a connection that has not discovered this limit works as expected.
	using var c = GrpcChannel.ForAddress(new Uri("https://localhost:5001"));
	var client = new Health.HealthClient(c);
	
	var tasks = Enumerable.Range(1, 5)
		.Select(x => Task.Run(async () => await client.CheckAsync(new HealthCheckRequest())))
		.ToArray();

	await Task.WhenAll(tasks);
}

MAX_CONCURRENT_STREAMS should be set to a low value in the ASP.NET grpc server (in appsettings.json):

{
  "Kestrel": {
    "Limits": {
      "Http2" : {
        "MaxStreamsPerConnection" : 2
      }
    }
  }
}

What did you expect to see?

The streams that exceed the MAX_CONCURRENT_STREAMS limit should be closed and any DATA frames received by the server for these streams should result in a STREAM_CLOSED error as per the HTTP/2 RFC:

What did you see instead?

Streams that exceed the MAX_CONCURRENT_STREAMS had DATA frames in-flight which resulted in the entire TCP connection being torn down:

[10:13:55 VRB] Connection id "0HLRKB2HNVILT" received SETTINGS frame for stream ID 0 with length 12 and flags NONE
[10:13:55 VRB] Connection id "0HLRKB2HNVILT" sending SETTINGS frame for stream ID 0 with length 0 and flags ACK
[10:13:55 VRB] Connection id "0HLRKB2HNVILT" received WINDOW_UPDATE frame for stream ID 0 with length 4 and flags 0x0
[10:13:55 VRB] Connection id "0HLRKB2HNVILT" received HEADERS frame for stream ID 1 with length 139 and flags END_HEADERS
[10:13:55 VRB] Connection id "0HLRKB2HNVILT" received HEADERS frame for stream ID 3 with length 139 and flags END_HEADERS
[10:13:55 VRB] Connection id "0HLRKB2HNVILT" received HEADERS frame for stream ID 5 with length 139 and flags END_HEADERS
[10:13:55 DBG] Connection id "0HLRKB2HNVILT": HTTP/2 stream error.
Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2StreamErrorException: HTTP/2 stream ID 5 error (REFUSED_STREAM): A new stream was refused because this connection has reached its stream limit.
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.StartStream()
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.DecodeHeadersAsync(Boolean endHeaders, ReadOnlySequence`1& payload)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.ProcessHeadersFrameAsync[TContext](IHttpApplication`1 application, ReadOnlySequence`1& payload)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.ProcessFrameAsync[TContext](IHttpApplication`1 application, ReadOnlySequence`1& payload)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.ProcessRequestsAsync[TContext](IHttpApplication`1 application)
[10:13:55 INF] Request starting HTTP/2 POST https://localhost:5001/grpc.health.v1.Health/Check application/grpc
[10:13:55 INF] Request starting HTTP/2 POST https://localhost:5001/grpc.health.v1.Health/Check application/grpc
[10:13:55 VRB] Connection id "0HLRKB2HNVILT" sending RST_STREAM frame for stream ID 5 with length 4 and flags 0x0
[10:13:55 VRB] Connection id "0HLRKB2HNVILT" received HEADERS frame for stream ID 7 with length 139 and flags END_HEADERS
[10:13:55 DBG] Connection id "0HLRKB2HNVILT": HTTP/2 stream error.
Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2StreamErrorException: HTTP/2 stream ID 7 error (REFUSED_STREAM): A new stream was refused because this connection has reached its stream limit.
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.StartStream()
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.DecodeHeadersAsync(Boolean endHeaders, ReadOnlySequence`1& payload)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.ProcessHeadersFrameAsync[TContext](IHttpApplication`1 application, ReadOnlySequence`1& payload)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.ProcessFrameAsync[TContext](IHttpApplication`1 application, ReadOnlySequence`1& payload)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.ProcessRequestsAsync[TContext](IHttpApplication`1 application)
[10:13:55 VRB] Connection id "0HLRKB2HNVILT" sending RST_STREAM frame for stream ID 7 with length 4 and flags 0x0
[10:13:55 VRB] Connection id "0HLRKB2HNVILT" received HEADERS frame for stream ID 9 with length 139 and flags END_HEADERS
[10:13:55 DBG] Connection id "0HLRKB2HNVILT": HTTP/2 stream error.
Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2StreamErrorException: HTTP/2 stream ID 9 error (REFUSED_STREAM): A new stream was refused because this connection has reached its stream limit.
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.StartStream()
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.DecodeHeadersAsync(Boolean endHeaders, ReadOnlySequence`1& payload)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.ProcessHeadersFrameAsync[TContext](IHttpApplication`1 application, ReadOnlySequence`1& payload)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.ProcessFrameAsync[TContext](IHttpApplication`1 application, ReadOnlySequence`1& payload)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.ProcessRequestsAsync[TContext](IHttpApplication`1 application)
[10:13:55 VRB] Connection id "0HLRKB2HNVILT" sending RST_STREAM frame for stream ID 9 with length 4 and flags 0x0
[10:13:55 VRB] Connection id "0HLRKB2HNVILT" received SETTINGS frame for stream ID 0 with length 0 and flags ACK
[10:13:55 VRB] Connection id "0HLRKB2HNVILT" received DATA frame for stream ID 9 with length 5 and flags NONE
[10:13:55 DBG] Connection id "0HLRKB2HNVILT": HTTP/2 connection error.
Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2ConnectionErrorException: HTTP/2 connection error (STREAM_CLOSED): The client sent a DATA frame to closed stream ID 9.
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.ProcessDataFrameAsync(ReadOnlySequence`1& payload)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.ProcessFrameAsync[TContext](IHttpApplication`1 application, ReadOnlySequence`1& payload)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.ProcessRequestsAsync[TContext](IHttpApplication`1 application)
[10:13:55 DBG] Connection id "0HLRKB2HNVILT" is closed. The last processed stream ID was 9.

The impact of faulting the entire connection is that some requests may have successfully processed and their CancellationToken may not trigger- this means it's impossible to recover from this kind of error gracefully.

Anything else we should know about your project / environment?

@JamesNK
Copy link
Member Author

JamesNK commented Nov 29, 2019

cc @Tratcher @halter73

@Tratcher
Copy link
Member

From the logs the extra streams do appear to be handled correctly (REFUSED_STREAM), it's only the data frames that cause an issue.
Http2ConnectionErrorException: HTTP/2 connection error (STREAM_CLOSED): The client sent a DATA frame to closed stream ID 9.

This looks like a duplicate of #13405 RE: gracefully draining data from refused or canceled streams.

MAX_CONCURRENT_STREAMS should be set to a low value in the ASP.NET grpc server (in appsettings.json)
"MaxStreamsPerConnection" : 2

How readily does this repro with the default value of 100? The spec discourages anything less:
It is recommended that this value be no smaller than 100, so as to not unnecessarily limit parallelism.

@JamesNK
Copy link
Member Author

JamesNK commented Nov 29, 2019

Yes, I thought of that issue as soon as I saw this.

I wasn't aware the server terminated the connection when this happened. Killing the connection and all other streams on it in progress is a pretty big side-effect.

@djluck
Copy link

djluck commented Nov 29, 2019

From the logs the extra streams do appear to be handled correctly (REFUSED_STREAM), it's only the data frames that cause an issue.

Hey @Tratcher, your summary is correct- it is only the data frames that cause the issue. However, as the client seems to transmit the header + data frames almost simultaneously, it has no ability to react to the RST_STREAM and "take back" the data frames.

How readily does this repro with the default value of 100?

This has been causing issues consistently in a production environment which uses the default value of 100- it was just easier to analyze the logs with the lower setting of 2. Our use-case may see no GRPC requests sent for hours and then bursts of 1-2k req/ sec, causing 1 request to fail with RST_STREAM and every other in-flight request to fail with a GOAWAY error.

@Tratcher
Copy link
Member

and then bursts of 1-2k req/ sec

On a single connection?

@JamesNK Yeah, it's ugly. So far it's always been triggered by another bug so the triggers were higher priority to fix. This new scenario looks like the first trigger that isn't a bug itself. The client is aggressive but still technically within the spec.

@djluck
Copy link

djluck commented Nov 29, 2019

@Tratcher correct. To add some further details about our production setup:

  • We use Grpc.Core, not Grpc.Net.Client. Still observe the same set of errors as the reproduction code however.
  • Bug can occur on idle TCP connections, not just new connections (our TCP connections will timeout after a few minutes of inactivity). This failure mode may be specific to Grpc.Core- it will not cache MAX_CONCURRENT_STREAMS values for grpc peers and when a connection times out and is reestablished, it will have to re-discover this value.
  • Not every request in the 1-2k burst may fail- only requests in-flight (sent bytes on the wire before receiving the new MAX_CONCURRENT_STREAMS setting/ GOAWAY response) will fail. However, this is often the majority (1K+ requests)- my guess is because our request bodies are small and don't take long to send to the server.

@Tratcher
Copy link
Member

  • Bug can occur on idle TCP connections, not just new connections

How could that happen? The settings are negotiated at the beginning of the connection. Is the client not honoring the limit if there's a burst of requests?

@djluck
Copy link

djluck commented Nov 29, 2019

@Tratcher I did modify my above comment but I'll expand upon it here.

When I say "idle" connections, I'm talking about TCP connections that have been terminated after a period of inactivity. In our production setup, our primary source of "idle" terminations come from the load balancer we use (it ignores TCP keepalives). When idle connections are closed, Grpc.Core will not cache any discovered MAX_CONCURRENT_STREAMS setting for peers- it must be re-discovered for each new connection, allowing the bug described in this issue to occur.

After some quick testing Grpc.Net.Client seems to behave the same way- by setting SocketsHttpHandler.PooledConnectionIdleTimeout to a short value, sending a single request and then waiting for the connection to idle before sending a bursts of requests, I was able to trigger the bug.

@Tratcher
Copy link
Member

Ah, idle connections aren't really special then, they're just closed so a new one gets automatically created.

I'm not aware of any provision in the spec for a client to cache settings across connections. It makes sense that both clients start with a clean slate on each connection.

@analogrelay analogrelay added the Author: Migration Bot 🤖 The issue was created by a issue mover bot. The author may not be the actual author. label Dec 2, 2019
@analogrelay analogrelay added this to the 5.0.0-preview1 milestone Dec 2, 2019
@JamesNK
Copy link
Member Author

JamesNK commented Dec 3, 2019

Is this 3.1 LTS worthy? It has real world impact. On the other hand I would guess the fix is non-trivial.

What is the migration here? Increasing MAX_CONCURRENT_STREAMS to a very high number would do it, alternatively on the client you would need to count and balance calls between gRPC channels (each channel represents a connection).

@Tratcher
Copy link
Member

Tratcher commented Dec 3, 2019

You'd have to throttle the client's concurrent requests. That should be pretty easy with a DelegatingHandler.

We'll have to come up with a fix before we can evaluate the risk for backporting it.

@djluck
Copy link

djluck commented Dec 3, 2019

For what it's worth my workaround for the time being has been to establish a connection and then send periodic health check requests to keep the connection alive. It's not perfect but it's worked for us so far.

@Tratcher is DelegatingHandler specific to Grpc.Net.Client? Ideally I'd like to implement something a more robust client throttling solution for Grpc.Core.

@Tratcher
Copy link
Member

@djluck Yes, DelegatingHandler is specific to the .NET HttpClient. I have a fix in #17917 for the connection stability issue, but you're still going to get REFUSED_STREAM and ENHANCE_YOUR_CALM if you flood the server. See #17842.

@Tratcher
Copy link
Member

Also keep in mind that 100 is the default stream limit for several other servers so this is not going to be an isolated issue.

Tratcher added a commit that referenced this issue Dec 18, 2019
@Tratcher Tratcher added the Done This issue has been fixed label Dec 18, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Jan 17, 2020
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions Author: Migration Bot 🤖 The issue was created by a issue mover bot. The author may not be the actual author. Done This issue has been fixed feature-kestrel HTTP2
Projects
None yet
Development

No branches or pull requests

5 participants