Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming RPC handlers are left running after client connections closed and server is stopped #6921

Closed
sunjayBhatia opened this issue Jan 12, 2024 · 3 comments · Fixed by #6922
Assignees

Comments

@sunjayBhatia
Copy link

sunjayBhatia commented Jan 12, 2024

What version of gRPC are you using?

v1.60.1

What version of Go are you using (go version)?

1.21.6

What operating system (Linux, Windows, …) and version?

Ubuntu 22.04.3, macOS Ventura

What did you do/what did you see?

See reproduction here: https://github.com/sunjayBhatia/grpc-go-issue-repro/tree/main/shutdown-cleanup

Clone the repo above, change to the shutdown-cleanup directory, run: TEST_COUNT=10 make test or TEST_COUNT=10 make test-race (using a higher number of test runs ensure the issue shows up, usually it doesn't take more than a couple runs however)

The repro starts a simple server with a streaming RPC, starts a client connection, sends/receives a message, closes the client connection, stops the server, and exits. The server handler logs when connections are opened/closed. It is hooked up to the test logger, which panics if logs are written after a test ends. We can see in this test that the streaming handler is still running after the test ends: we see the panics with the log lines StreamEcho error in Recv ... and StreamEcho ended.

Leaving the repro code as-is, should see a failure like:

% TEST_COUNT=10 make test-race
go test -v -race -count=10 .
=== RUN   TestShutdownCleanupStream
    repro_test.go:64: test started
    repro_test.go:72: server starting
    repro_test.go:88: sending request
    repro_test.go:20: StreamEcho started
    repro_test.go:93: receiving response
    repro_test.go:98: response: text:"foo"
    repro_test.go:100: closing client connection
    repro_test.go:103: stopping server
    repro_test.go:106: server stop finished
    repro_test.go:77: server exited
    repro_test.go:109: test ended
--- PASS: TestShutdownCleanupStream (0.01s)
=== RUN   TestShutdownCleanupStream
panic: Log in goroutine after TestShutdownCleanupStream has completed: StreamEcho error in Recv error: rpc error: code = Canceled desc = context canceled
        panic: Log in goroutine after TestShutdownCleanupStream has completed: StreamEcho ended


goroutine 50 [running]:
testing.(*common).logDepth(0xc000218680, {0xc0000f0078, 0x11}, 0x3)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1022 +0x4f8
testing.(*common).log(...)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1004
testing.(*common).Log(0xc000218680, {0xc0000f93d8, 0x1, 0x1})
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1045 +0x74
panic({0x105068640?, 0xc0000b0580?})
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/runtime/panic.go:920 +0x26c
testing.(*common).logDepth(0xc000218680, {0xc0000a0780, 0x52}, 0x3)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1022 +0x4f8
testing.(*common).log(...)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1004
testing.(*common).Logf(0xc000218680, {0x104f36bc7, 0xc}, {0xc0000f9748, 0x2, 0x2})
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1055 +0x84
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup_test.(*EchoServer).StreamEcho.func1(...)
        /Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/repro_test.go:24
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup_test.(*EchoServer).StreamEcho(0xc00003ae40, {0x105135ed8, 0xc0002861c0})
        /Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/repro_test.go:37 +0x330
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup/echo._EchoService_StreamEcho_Handler({0x10509a540?, 0xc00003ae40}, {0x1051343c0?, 0xc0002be000})
        /Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/echo/echo_grpc.pb.go:100 +0xb0
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000174200, {0x1051333e8, 0xc000282270}, {0x1051360a8, 0xc0003081a0}, 0xc0002ae000, 0xc00021a8d0, 0x1054ab720, 0x0)
        /Users/sunjayb/go/pkg/mod/google.golang.org/[email protected]/server.go:1666 +0x16f4
google.golang.org/grpc.(*Server).handleStream(0xc000174200, {0x1051360a8, 0xc0003081a0}, 0xc0002ae000)
        /Users/sunjayb/go/pkg/mod/google.golang.org/[email protected]/server.go:1787 +0x12f0
google.golang.org/grpc.(*Server).serveStreams.func2.1()
        /Users/sunjayb/go/pkg/mod/google.golang.org/[email protected]/server.go:1016 +0xa0
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 37
        /Users/sunjayb/go/pkg/mod/google.golang.org/[email protected]/server.go:1027 +0x1d8
FAIL    github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup    0.663s
FAIL
make: *** [test-race] Error 1

Looks like the goroutine that spawns the stream handler is not waited for so it may still be running even after we think the server has shut down:

grpc-go/server.go

Line 1027 in 953d12a

go f()

Similarly, when using a set number of server stream workers (uncomment/comment the relevant lines in the repro code):

% TEST_COUNT=10 make test-race
go test -v -race -count=10 .
=== RUN   TestShutdownCleanupStream
    repro_test.go:64: test started
    repro_test.go:72: server starting
    repro_test.go:88: sending request
    repro_test.go:93: receiving response
    repro_test.go:20: StreamEcho started
    repro_test.go:98: response: text:"foo"
    repro_test.go:100: closing client connection
    repro_test.go:103: stopping server
    repro_test.go:106: server stop finished
    repro_test.go:109: test ended
--- PASS: TestShutdownCleanupStream (0.00s)
=== RUN   TestShutdownCleanupStream
    repro_test.go:64: test started
panic: Log in goroutine after TestShutdownCleanupStream has completed: StreamEcho error in Recv error: rpc error: code = Canceled desc = context canceled
        panic: Log in goroutine after TestShutdownCleanupStream has completed: StreamEcho ended


goroutine 20 [running]:
testing.(*common).logDepth(0xc000183520, {0xc00028e120, 0x11}, 0x3)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1022 +0x4f8
testing.(*common).log(...)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1004
testing.(*common).Log(0xc000183520, {0xc00022d378, 0x1, 0x1})
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1045 +0x74
panic({0x104d5c660?, 0xc00028a8b0?})
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/runtime/panic.go:920 +0x26c
testing.(*common).logDepth(0xc000183520, {0xc000296960, 0x52}, 0x3)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1022 +0x4f8
testing.(*common).log(...)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1004
testing.(*common).Logf(0xc000183520, {0x104c2acc7, 0xc}, {0xc00022d6e8, 0x2, 0x2})
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1055 +0x84
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup_test.(*EchoServer).StreamEcho.func1(...)
        /Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/repro_test.go:24
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup_test.(*EchoServer).StreamEcho(0xc0001a2e20, {0x104e29f78, 0xc0003821c0})
        /Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/repro_test.go:37 +0x330
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup/echo._EchoService_StreamEcho_Handler({0x104d8e560?, 0xc0001a2e20}, {0x104e28460?, 0xc0003b4000})
        /Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/echo/echo_grpc.pb.go:100 +0xb0
google.golang.org/grpc.(*Server).processStreamingRPC(0xc00025c200, {0x104e27488, 0xc00038c210}, {0x104e2a148, 0xc00008a9c0}, 0xc0003ae000, 0xc0002688d0, 0x10519f720, 0x0)
        /Users/sunjayb/go/pkg/mod/google.golang.org/[email protected]/server.go:1666 +0x16f4
google.golang.org/grpc.(*Server).handleStream(0xc00025c200, {0x104e2a148, 0xc00008a9c0}, 0xc0003ae000)
        /Users/sunjayb/go/pkg/mod/google.golang.org/[email protected]/server.go:1787 +0x12f0
google.golang.org/grpc.(*Server).serveStreams.func2.1()
        /Users/sunjayb/go/pkg/mod/google.golang.org/[email protected]/server.go:1016 +0xa0
google.golang.org/grpc.(*Server).serverWorker(0xc00025c200)
        /Users/sunjayb/go/pkg/mod/google.golang.org/[email protected]/server.go:618 +0x38
created by google.golang.org/grpc.(*Server).initServerWorkers in goroutine 19
        /Users/sunjayb/go/pkg/mod/google.golang.org/[email protected]/server.go:631 +0x1f0
FAIL    github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup    0.511s
FAIL
make: *** [test-race] Error 1

It looks like server workers are left running without full coordination on shutdown as well:

go s.serverWorker()

This issue showed up in Contour CI after we bumped from grpc-go v1.59.0 to v1.60.0+

I've tracked this issue via git bisect down to this PR: #6489 which looks like it changed how coordination of stream handlers is set up. Looks like connection draining/closing might be working as expected, but stream handlers are not

What did you expect to see?

When server is stopped (gracefully or otherwise), it waits until all stream handlers finish to exit.

@dfawley
Copy link
Member

dfawley commented Jan 16, 2024

This would seem to be caused by this suggestion from that PR.

I wonder if there was a different root cause for the failures that were the reason we removed the waitgroup.

I agree this was a behavior change, and it would be good to bring the old behavior back (along with a test) since it was done that way before and is more ideal.

@dfawley dfawley self-assigned this Jan 16, 2024
@dfawley
Copy link
Member

dfawley commented Jan 16, 2024

(Still debugging...)

It seems like there must have been a bug with (or some other exception in) the previous implementation that did allow Stop to return without all handlers completing. E.g. this test performed a blocking Stop call explicitly with a handler running:

ss.S.Stop()

So it does seem this would be a behavior change to block if handlers were still running.

Maybe only GracefulStop did this blocking, and Stop just returned immediately.

@dfawley
Copy link
Member

dfawley commented Jan 16, 2024

I believe this (legacy) code in GracefulStop that was not present in Stop is what was causing us to wait for the method handlers to exit. We'll have to think about this a bit to see if we want to continue with this behavior or make Stop block until method handlers return.

grpc-go/server.go

Lines 1908 to 1910 in 82df321

for len(s.conns) != 0 {
s.cv.Wait()
}

It's arguably most correct to block until they exit, and I can't think of any valid reason for handlers to not respect the ctx given to them, which is the only way they would not exit. But it's still a breaking behavior change.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants