-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
konnectivity-client: ensure connection read channel is closed #325
Conversation
tests/proxy_test.go
Outdated
@@ -219,6 +220,58 @@ func TestProxyHandle_SlowContext_GRPC(t *testing.T) { | |||
} | |||
} | |||
|
|||
func TestProxyHandle_ContextCancelled_GRPC(t *testing.T) { | |||
defer goleak.VerifyNone(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the bug fix in konnectivity-client/pkg/client/client.go:
$ go test ./... -run TestProxyHandle_ContextCancelled_GRPC
E0214 16:09:13.663053 847761 client.go:103] "stream read failure" err="rpc error: code = Canceled desc = context canceled"
E0214 16:09:13.663475 847761 server.go:390] "Stream read from frontend failure" err="rpc error: code = Canceled desc = context canceled"
E0214 16:09:13.664071 847761 server.go:751] "send to client stream failure" err="rpc error: code = Unavailable desc = transport is closing" serverID="846800f0-2f8e-4f29-b25a-50153d1f00cd" agentID="57a4766a-a392-4c4f-8146-c5a25b0c510f" connectionID=1
E0214 16:09:13.664223 847761 client.go:371] "could not read stream" err="rpc error: code = Unavailable desc = error reading from server: EOF"
E0214 16:09:13.664452 847761 server.go:515] "CLOSE_REQ to Backend failed" err="rpc error: code = Unavailable desc = transport is closing" serverID="846800f0-2f8e-4f29-b25a-50153d1f00cd"
E0214 16:09:13.664551 847761 server.go:669] "stream read failure" err="rpc error: code = Canceled desc = context canceled"
E0214 16:09:13.664551 847761 client.go:427] "close response failure" err="EOF"
E0214 16:09:13.665131 847761 server.go:705] "CLOSE_RSP to frontend failed" err="rpc error: code = Unavailable desc = transport is closing" serverID="846800f0-2f8e-4f29-b25a-50153d1f00cd" agentID="57a4766a-a392-4c4f-8146-c5a25b0c510f"
--- FAIL: TestProxyHandle_ContextCancelled_GRPC (2.45s)
leaks.go:78: found unexpected goroutines:
[Goroutine 19 in state chan receive, with k8s.io/klog/v2.(*loggingT).flushDaemon on top of the stack:
goroutine 19 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
/home/andrewsy/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:1169 +0x6a
created by k8s.io/klog/v2.init.0
/home/andrewsy/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:417 +0xfb
Goroutine 31 in state chan receive, with sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client.(*conn).Read on top of the stack:
goroutine 31 [chan receive]:
sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client.(*conn).Read(0xc000485310, {0xc00025e000, 0x1000, 0x101000000000000})
/home/andrewsy/go/src/sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client/conn.go:74 +0x5d
net/http.(*persistConn).Read(0xc00015cea0, {0xc00025e000, 0x40717d, 0x60})
/usr/local/go/src/net/http/transport.go:1926 +0x4e
bufio.(*Reader).fill(0xc0004de7e0)
/usr/local/go/src/bufio/bufio.go:101 +0x103
bufio.(*Reader).Peek(0xc0004de7e0, 0x1)
/usr/local/go/src/bufio/bufio.go:139 +0x5d
net/http.(*persistConn).readLoop(0xc00015cea0)
/usr/local/go/src/net/http/transport.go:2087 +0x1ac
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:1747 +0x1e05
]
FAIL
FAIL sigs.k8s.io/apiserver-network-proxy/tests 2.532s
FAIL
With the bug fix:
$ go test ./... -run TestProxyHandle_ContextCancelled_GRPC
E0214 16:16:42.596087 848745 client.go:112] "stream read failure" err="rpc error: code = Canceled desc = context canceled"
E0214 16:16:42.596265 848745 server.go:390] "Stream read from frontend failure" err="rpc error: code = Canceled desc = context canceled"
E0214 16:16:42.596361 848745 server.go:751] "send to client stream failure" err="rpc error: code = Canceled desc = context canceled" serverID="235e6686-f70f-4a1d-a6c6-dc46a811ce01" agentID="a947c418-4784-4b14-8fc9-c4cf0f47834b" connectionID=1
E0214 16:16:42.596473 848745 client.go:371] "could not read stream" err="rpc error: code = Unavailable desc = error reading from server: EOF"
E0214 16:16:42.596581 848745 server.go:515] "CLOSE_REQ to Backend failed" err="rpc error: code = Unavailable desc = transport is closing" serverID="235e6686-f70f-4a1d-a6c6-dc46a811ce01"
E0214 16:16:42.596624 848745 server.go:669] "stream read failure" err="rpc error: code = Canceled desc = context canceled"
E0214 16:16:42.596660 848745 client.go:427] "close response failure" err="EOF"
E0214 16:16:42.596694 848745 server.go:705] "CLOSE_RSP to frontend failed" err="rpc error: code = Unavailable desc = transport is closing" serverID="235e6686-f70f-4a1d-a6c6-dc46a811ce01" agentID="a947c418-4784-4b14-8fc9-c4cf0f47834b"
--- FAIL: TestProxyHandle_ContextCancelled_GRPC (2.45s)
leaks.go:78: found unexpected goroutines:
[Goroutine 6 in state chan receive, with k8s.io/klog/v2.(*loggingT).flushDaemon on top of the stack:
goroutine 6 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
/home/andrewsy/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:1169 +0x6a
created by k8s.io/klog/v2.init.0
/home/andrewsy/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:417 +0xfb
]
FAIL
FAIL sigs.k8s.io/apiserver-network-proxy/tests 2.528s
FAIL
Note that the test always fails with 1 goroutine leak coming from klog which is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we as soon as this test function starts, or if we can initialize klog early first, then we use IgnoreCurrent to make it ignore klog and keep the test?
It seems super useful this pkg, I'd love to keep it in all tests.
(btw, I wasn't aware of this module, seems quite cool. Thanks!!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we as soon as this test function starts, or if we can initialize klog early first, then we use IgnoreCurrent to make it ignore klog and keep the test?
Good call, let me try this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And just verifying the test works:
Without defer close(conn.readCh)
$ go test ./... -run TestProxyHandle_ContextCancelled_GRPC
E0216 09:32:28.416716 1191438 client.go:103] "stream read failure" err="rpc error: code = Canceled desc = context canceled"
E0216 09:32:28.417368 1191438 server.go:751] "send to client stream failure" err="rpc error: code = Unavailable desc = transport is closing" serverID="acbcdbed-f7a1-40b5-a02c-c65186c172a0" agentID="b441b2f6-c9c1-43a9-9f33-f30d1fb1306e" connectionID=1
E0216 09:32:28.417398 1191438 server.go:390] "Stream read from frontend failure" err="rpc error: code = Canceled desc = context canceled"
E0216 09:32:28.417907 1191438 client.go:371] "could not read stream" err="rpc error: code = Unavailable desc = error reading from server: read tcp [::1]:39060->[::1]:40799: read: connection reset by peer"
E0216 09:32:28.419031 1191438 client.go:427] "close response failure" err="EOF"
E0216 09:32:28.419143 1191438 server.go:515] "CLOSE_REQ to Backend failed" err="rpc error: code = Unavailable desc = transport is closing" serverID="acbcdbed-f7a1-40b5-a02c-c65186c172a0"
E0216 09:32:28.419955 1191438 server.go:669] "stream read failure" err="rpc error: code = Canceled desc = context canceled"
E0216 09:32:28.420246 1191438 server.go:705] "CLOSE_RSP to frontend failed" err="rpc error: code = Unavailable desc = transport is closing" serverID="acbcdbed-f7a1-40b5-a02c-c65186c172a0" agentID="b441b2f6-c9c1-43a9-9f33-f30d1fb1306e"
--- FAIL: TestProxyHandle_ContextCancelled_GRPC (2.46s)
leaks.go:78: found unexpected goroutines:
[Goroutine 100 in state chan receive, with sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client.(*conn).Read on top of the stack:
goroutine 100 [chan receive]:
sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client.(*conn).Read(0xc00003d450, {0xc0001a7000, 0x1000, 0x101000000000000})
/home/andrewsy/go/src/sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client/conn.go:74 +0x5d
net/http.(*persistConn).Read(0xc0004207e0, {0xc0001a7000, 0x40717d, 0x60})
/usr/local/go/src/net/http/transport.go:1926 +0x4e
bufio.(*Reader).fill(0xc000099e00)
/usr/local/go/src/bufio/bufio.go:101 +0x103
bufio.(*Reader).Peek(0xc000099e00, 0x1)
/usr/local/go/src/bufio/bufio.go:139 +0x5d
net/http.(*persistConn).readLoop(0xc0004207e0)
/usr/local/go/src/net/http/transport.go:2087 +0x1ac
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:1747 +0x1e05
]
FAIL
FAIL sigs.k8s.io/apiserver-network-proxy/tests 2.549s
FAIL
With defer close(conn.readCh):
$ go test ./... -run TestProxyHandle_ContextCancelled_GRPC -count 1
ok sigs.k8s.io/apiserver-network-proxy/tests 2.222s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks! :)
@@ -9,7 +9,8 @@ require ( | |||
github.com/prometheus/client_golang v1.7.1 | |||
github.com/spf13/cobra v0.0.3 | |||
github.com/spf13/pflag v1.0.5 | |||
golang.org/x/net v0.0.0-20201110031124-69a78807bb2b | |||
go.uber.org/goleak v1.1.12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to remove the go.mod changes later, just adding them temporary to run some tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with keeping it as long as its just for use in our tests.
Signed-off-by: Andrew Sy Kim <[email protected]>
…textDeadlineExceeded_GRPC Signed-off-by: Andrew Sy Kim <[email protected]>
Signed-off-by: Andrew Sy Kim <[email protected]>
Signed-off-by: Andrew Sy Kim <[email protected]>
acdb437
to
5badf2f
Compare
… ProxyHandle tests Signed-off-by: Andrew Sy Kim <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a maintainer here, but this LGTM, just one minor comment :)
@@ -140,7 +145,9 @@ func TestProxyHandleDialError_GRPC(t *testing.T) { | |||
} | |||
} | |||
|
|||
func TestProxyHandle_DoneContext_GRPC(t *testing.T) { | |||
func TestProxyHandle_ContextDeadlineExceeded_GRPC(t *testing.T) { | |||
defer goleak.VerifyNone(t, goleak.IgnoreCurrent()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd add a comment saying we need to IgnoreCurrent due to klog leaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will add that
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim, rata The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hmm, the |
…log spawns goroutine for flushing logs) Signed-off-by: Andrew Sy Kim <[email protected]>
Signed-off-by: Andrew Sy Kim <[email protected]>
I fixed the test -- in the happy path we do actually close the channel properly, so we should only close the readCh if we never saw the close happen |
I added the goroutine leak to a bunch of other testse in proxy_test.go and looks like we found another one :)
|
Turns out ^ is not an actual goroutine leak, we just need to ensure the client calls conn.Close by setting |
Signed-off-by: Andrew Sy Kim <[email protected]>
542dd2e
to
e4cb139
Compare
@mihivagyok if you have time, do you think you can test this PR to see if you can reproduce hanging streams / memory leaks? |
@andrewsykim Yes, I could try it tomorrow. Let you know about the results! Thanks! |
@mihivagyok I realized that this patch requires a change in kube-apiserver so it might be bit difficult to test. |
@@ -95,7 +95,28 @@ func (t *grpcTunnel) serve(c clientConn) { | |||
defer c.Close() | |||
|
|||
for { | |||
pkt, err := t.stream.Recv() | |||
done := make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihivagyok I added a timeout on the client-side for stream.Recv(), similar to #324 (and based on grpc/grpc-go#1229). Going to test it against a real cluster later today
85a1a50
to
851963a
Compare
pkt, err := t.stream.Recv() | ||
pktChan <- pkt | ||
errChan <- err | ||
close(done) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a defer close(done) at the beginning of the anonymous function.
done := make(chan struct{}) | ||
pktChan := make(chan *client.Packet, 1) | ||
errChan := make(chan error, 1) | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to spin up a separate goroutine for each packet we want to receive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not ideal but it's a side effect of stream.Recv not accept a context.
I'm on the fence about keeping this change though, I pushed it so I can test it but I haven't seen any evidence yet that it helps with the issue.
/assign @cheftako |
851963a
to
e4cb139
Compare
if err != nil { | ||
t.Error(err) | ||
} | ||
// client needs to Close the connection immediately to pass goroutine leak test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate? This seems a bit concerning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear we should be relying on the client to do this sort of things to prevent a leak...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default behavior by the http client is to re-use connections between requests to the same hosts. But relying on that default behavior leaves us with a goroutine at the end of the test run. Explicitly setting request.Close is as-if we disabled keep alives on the connection which also means the goroutine completes when the test completes.
In other words, setting req.Close = true
ensure that conn.Close() gets called, which exercises the read channel closing.
Generally seems like a good fix. Couple of minor nits it would be good to address. |
/hold |
As I mentioned here, I saw once the apiserver crash with this:
This was with this PR and #337 applied. See more details on the comment I linked, but I could not reproduce this, it only happened once. It was not on a "weird dirty state", I had just compiled k8s with #337 added, and manually run a kubectl logs command (not sure if apiserver crashed before I run the run.sh script we use for repro in #276 or not). Have anyone seen this when testing this PR? PR #337 seems unrelated (but well, was applied, it might make something more easy or just be noise, don't know yet). Do you see any case where this PR might add a double close on a channel, and therefore cause the apiserver to panic? I'm a bit concerned as the apiserver to panic'ed and as this change is on konn-client, a follow-up fix requires the customers to update the k8s version, which isn't as easy to update as the proxy-server usually (at least for us). On the other hand I can't reproduce this, so unsure what to propose... If you can't reproduce either and this looks correct to you, I'm fine merging :) |
I think you're right, I don't think we need this PR anymore because the goroutine leak was only reproducible in the test because the http client in the test was not calling conn.Close(). Will close it for now. |
/close |
@andrewsykim: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR ensures that the readCh on the connection from a single use gRPC tunnel is always closed. While running some tests I noticed that goroutines can leak if it reads
conn.readCh
. This happens when the client never receives CLOSE_RSP, which seems possible if the transport already closed for some other reason.Note: sorry I accidentally pushed this branch to upstream instead of my fork.