Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully handle GoAway #320

Closed
chrisstaite-menlo opened this issue Oct 19, 2023 · 8 comments · Fixed by #627
Closed

Gracefully handle GoAway #320

chrisstaite-menlo opened this issue Oct 19, 2023 · 8 comments · Fixed by #627

Comments

@chrisstaite-menlo
Copy link
Collaborator

When being hammered a GrpcStore can end up encountering many transport errors:

[2023-10-19T08:43:40.895Z ERROR cas_server] find_missing_blobs Resp: 0.039009552 Err(Status { code: Internal, message: "status: Internal, message: \"h2 protocol error: http2 error: connection error received: unspecific protocol error detected (b\\\"[p]req HEADERS: max concurrency reached\\\")\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::find_missing_blobs : In find_missing_blobs : Failed on find_missing_blobs() command", source: None })

It should be restricted in the number of simultaneous requests it can hammer at the upstream.

@allada
Copy link
Member

allada commented Oct 20, 2023

I cant find where that error could have came from in from tonic/hyper. Any idea if this could have come from another service (or even client)?

@chrisstaite
Copy link
Contributor

Yeah, I couldn't find it either, but it's only turbo-cache talking to the CAS everything else is proxied through a GrpcStore, so I don't see how a headers message could come from anywhere else.

@allada
Copy link
Member

allada commented Oct 21, 2023

Interesting, are you sure it's turbo-cache talking to turbo-cache? The reason I ask is because find_missing_blobs is never called inside TC except in GRPC forwarding store.

Is it possible that the client (ReClient?) sent this error over h2?

@chrisstaite
Copy link
Contributor

Actually, I have an nginx in front that forwards the CAS gRPC requests directly for legacy reasons. It may actually be Reclient sending it.

@chrisstaite-menlo chrisstaite-menlo changed the title Limit GrpcStore concurrency Gracefully handle GoAway Dec 13, 2023
@chrisstaite-menlo
Copy link
Collaborator Author

It seems like this is still an issue and is discussed in hyperium/hyper#2500. I get this issue with nativelink talking to nativelink directly GrpcStore and sometimes also stream session alloc failed and No stream session found. Getting one of these errors can often lead to a hang up in the build.

@chrisstaite-menlo
Copy link
Collaborator Author

Just looking and it appears that tonic provides both concurrency_limit and rate_limit for Endpoint. It would be good to expose these to config. It would be better if we could handle the GoAway a little nicer to perform concurrency detection, but this would at least allow us to add some kind of cap.

@chrisstaite-menlo
Copy link
Collaborator Author

Adding the rate limiting doesn't seem to help much. This is the logs from the GrpcStore server:

  2024-01-24T11:15:04.891584Z ERROR nativelink_service::bytestream_server: Write Resp: 0.004535798 None Status { code: Unavailable, message: "status: Unavailable, message: \"h2 protocol error: http2 error: stream error received: refused stream before processing any application logic\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)

  2024-01-24T11:15:04.891842Z ERROR nativelink_service::bytestream_server: Write Resp: 0.004686835 None Status { code: Unavailable, message: "status: Unavailable, message: \"h2 protocol error: http2 error: stream error received: refused stream before processing any application logic\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(14)

  2024-01-24T11:15:05.038798Z ERROR nativelink_service::bytestream_server: Write Resp: 0.051602904 None Status { code: Internal, message: "status: Internal, message: \"h2 protocol error: http2 error: connection error received: unspecific protocol error detected\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(3)

  2024-01-24T11:15:05.038955Z ERROR nativelink_service::bytestream_server: Write Resp: 0.05187966 None Status { code: Internal, message: "status: Internal, message: \"h2 protocol error: http2 error: connection error received: unspecific protocol error detected\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)

  2024-01-24T11:15:05.099769Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000070117 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 967 : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(14)

  2024-01-24T11:15:05.102512Z ERROR nativelink_service::bytestream_server: Write Resp: 0.0001212 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 335 : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(16)

  2024-01-24T11:15:05.185536Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000094612 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 813 : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)

  2024-01-24T11:15:05.244820Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000124144 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 335 : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(14)`

And these are the logs from the storage server itself:

  2024-01-24T11:15:04.957840Z ERROR nativelink: Failed running service : hyper::Error(Http2, Error { kind: GoAway(b"", PROTOCOL_ERROR, Library) })
    at src/bin/nativelink.rs:605 on tokio-runtime-worker ThreadId(4)

This is with the above concurrency limit change and the concurrency set to 200, I'll try reducing it to 100.

@chrisstaite-menlo
Copy link
Collaborator Author

With it reduced to 100 I got:

GrpcStore:

  2024-01-24T11:18:40.013285Z ERROR nativelink_service::bytestream_server: Write Resp: 0.57577896 None Status { code: Unknown, message: "status: Unknown, message: \"transport error\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)

  2024-01-24T11:18:40.013300Z ERROR nativelink_service::bytestream_server: Write Resp: 0.575339 None Status { code: Unknown, message: "status: Unknown, message: \"transport error\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(14)

  2024-01-24T11:18:40.169492Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000058742 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 1286 : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)

  2024-01-24T11:18:40.232436Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000067092 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 335 : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(16)

Storage server:

  2024-01-24T11:18:39.933106Z ERROR nativelink: Failed running service : hyper::Error(Http2, Error { kind: GoAway(b"[p]CONTINUATION: No stream session found", PROTOCOL_ERROR, Remote) })
    at src/bin/nativelink.rs:605 on tokio-runtime-worker ThreadId(3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants