-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/sqlproxyccl: fix inaccurate CurConnCount metric due to goroutine leak #82652
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/ccl/sqlproxyccl/proxy_handler_test.go
line 1331 at r1 (raw file):
// Previously, there was a regression where the CurConnCount metric wasn't // decremented whenever the connections were closed due to a goroutine leak. func TestCurConnCountMetric(t *testing.T) {
I cannot reproduce this issue with a test locally. Putting this up to see how the TC CI behaves with stress. However, I managed to produce this with an external script, and confirmed that the patch does fix the issue as described.
COCKROACH_TRUST_CLIENT_PROVIDED_SQL_REMOTE_ADDR=true ./cockroach demo --insecure
...
./cockroach mt start-proxy --skip-verify --disable-connection-rebalancing --insecure --routing-rule=127.0.0.1:26257 --listen-metrics=127.0.0.1:9080 --log="{ sinks : { stderr : { filter : info } } }"
...
# Starts X connections, runs Ping() queries, and closes the connections right away.
go run demo.go
...
# The connection count would get stuck despite having the demo.go program terminated.
curl -s http://localhost:9080/_status/vars/ | grep proxy_sql_conns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/ccl/sqlproxyccl/proxy_handler_test.go
line 1331 at r1 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
I cannot reproduce this issue with a test locally. Putting this up to see how the TC CI behaves with stress. However, I managed to produce this with an external script, and confirmed that the patch does fix the issue as described.
COCKROACH_TRUST_CLIENT_PROVIDED_SQL_REMOTE_ADDR=true ./cockroach demo --insecure ... ./cockroach mt start-proxy --skip-verify --disable-connection-rebalancing --insecure --routing-rule=127.0.0.1:26257 --listen-metrics=127.0.0.1:9080 --log="{ sinks : { stderr : { filter : info } } }" ... # Starts X connections, runs Ping() queries, and closes the connections right away. go run demo.go ... # The connection count would get stuck despite having the demo.go program terminated. curl -s http://localhost:9080/_status/vars/ | grep proxy_sql_conns
Good news. Looks like the test was failing as expected without the fix, but it took a while: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests/5421651?showRootCauses=false&expandBuildProblemsSection=true&expandBuildChangesSection=true&expandBuildTestsSection=true&showLog=5421649_1266_570.1256.1266&logFilter=debug
06:08:08 182 runs so far, 0 failures, over 17m55s
06:08:08 182 runs so far, 0 failures, over 18m0s
06:08:08 182 runs so far, 0 failures, over 18m5s
...
proxy_handler_test.go:1389: condition failed to evaluate within 45s: expected CurConnCount=0, but got 1
Stack traces show that we were blocked in the waitResumed call, which was the described issue:
goroutine 3133 [sync.Cond.Wait, 13 minutes]:
sync.runtime_notifyListWait(0xc001ba60d0, 0x0)
/usr/local/go/src/runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0x45d1ce)
/usr/local/go/src/sync/cond.go:56 +0x8c
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*processor).waitResumed(0xc002ac9220, {0x57fe970, 0xc006477cc0})
/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:559 +0xb0
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).resumeProcessors(0xc001e7ebe0)
/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:319 +0x11b
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).run(0xc001e7ebe0, {0x5863950, 0xc00555aa80}, {0x5863ab0, 0xc0031fe378})
/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:191 +0x91
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*proxyHandler).handle(0xc005007e00, {0x57fea18, 0xc0050b0960}, 0xc004f4b320)
/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/proxy_handler.go:407 +0xeb9
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*Server).Serve.func2({0x57fea18, 0xc004f4b350})
/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/server.go:179 +0x196
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494 +0x16f
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:483 +0x445
b5ff88f
to
65ad80d
Compare
65ad80d
to
f9f42e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
pkg/ccl/sqlproxyccl/forwarder.go
Outdated
// stuck because waiting forever, resulting in a leak. | ||
request, response := f.getProcessors() | ||
if request != nil { | ||
request.broadcast() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this solution, but it is breaking the pattern that typically applies to condition variables: i.e. they are paired with some mutex and the condition variable is notified when the state guarded by the mutex changes.
Alternatives to consider:
- Add a "closed" state to the processors. Call .Close() in forwarder.Close() and check the closed state in the wait* functions. I know I pushed back on just such a state before, but in retrospec I think it is a mistake to mix channels (context cancellation) and condition variables.
- Remove the condition variable and use channels to synchronize instead. resume() could accept or return a channel that it signals when the resume is complete. ditto for suspend. This implies that concurrent calls should be rejected, but that is probably fine.
…leak Previously, there was a possibility where a processor can return from resuming because the client's connection was closed _before_ waitResumed even has the chance to wake up to check on the resumed field. When that happens, the connection goroutine will be blocked forever, and the CurConnCount metric will never be decremented, even if the connection has already been terminated. When the client's connection was closed, the forwarder's context will be cancelled as well. The ideal behavior would be to terminate all waiters when that happens, but the current code does not do that. This commit fixes that issue by adding a new closed state to the processors, and ensuring that the processor is closed whenever resume returns with an error. waitResumed can then check on this state before going back to wait. Release note: None
f9f42e4
to
5a4517f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeffswenson I went ahead with Alternative 1. This PR is ready for another round of review.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jaylim-crl and @pjtatlow)
@@ -495,6 +507,9 @@ func (p *processor) resume(ctx context.Context) error { | |||
} | |||
|
|||
if err := enterResume(); err != nil { | |||
if errors.Is(err, errProcessorResumed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the intent behind this change. Why do we return nil here? Is there something about the PR that introduced a new case to check, or is this cleaning up misbehavior that already existed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to clean up an issue that already existed. resumeProcessors
may be called concurrently when the connection started + when a transfer gets triggered. When that happens, resume returns errProcessorResumed, which causes the forwarder to be closed.
// resumeProcessors starts both the request and response processors
// asynchronously. The forwarder will be closed if any of the processors
// return an error while resuming. This is idempotent as resume() will return
// nil if the processor has already been started.
The comment states that resume should return nil if the processor has already been started.
TFTR! bors r=JeffSwenson |
Bazel CI failure is related to a docker image test, and is unrelated to this PR. |
Build succeeded: |
Previously, there was a possibility where a processor can return from
resuming because the client's connection was closed before waitResumed even
has the chance to wake up to check on the resumed field. When that happens,
the connection goroutine will be blocked forever, and the CurConnCount metric
will never be decremented, even if the connection has already been terminated.
When the client's connection was closed, the forwarder's context will be
cancelled as well. The ideal behavior would be to terminate all waiters when
that happens, but the current code does not do that. This commit fixes that
issue by adding a new closed state to the processors, and ensuring that the
processor is closed whenever resume returns with an error. waitResumed can
then check on this state before going back to wait.
Release note: None