ccl/sqlproxyccl: fix inaccurate CurConnCount metric due to goroutine leak #82652

jaylim-crl · 2022-06-09T05:17:08Z

Previously, there was a possibility where a processor can return from
resuming because the client's connection was closed before waitResumed even
has the chance to wake up to check on the resumed field. When that happens,
the connection goroutine will be blocked forever, and the CurConnCount metric
will never be decremented, even if the connection has already been terminated.

When the client's connection was closed, the forwarder's context will be
cancelled as well. The ideal behavior would be to terminate all waiters when
that happens, but the current code does not do that. This commit fixes that
issue by adding a new closed state to the processors, and ensuring that the
processor is closed whenever resume returns with an error. waitResumed can
then check on this state before going back to wait.

Release note: None

cockroach-teamcity · 2022-06-09T05:17:16Z

This change is

jaylim-crl

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/ccl/sqlproxyccl/proxy_handler_test.go line 1331 at r1 (raw file):

// Previously, there was a regression where the CurConnCount metric wasn't
// decremented whenever the connections were closed due to a goroutine leak.
func TestCurConnCountMetric(t *testing.T) {

I cannot reproduce this issue with a test locally. Putting this up to see how the TC CI behaves with stress. However, I managed to produce this with an external script, and confirmed that the patch does fix the issue as described.

COCKROACH_TRUST_CLIENT_PROVIDED_SQL_REMOTE_ADDR=true ./cockroach demo --insecure
...
./cockroach mt start-proxy --skip-verify --disable-connection-rebalancing --insecure --routing-rule=127.0.0.1:26257 --listen-metrics=127.0.0.1:9080 --log="{ sinks : { stderr : { filter : info } } }"
...
# Starts X connections, runs Ping() queries, and closes the connections right away.
go run demo.go
...
# The connection count would get stuck despite having the demo.go program terminated.
curl -s http://localhost:9080/_status/vars/ | grep proxy_sql_conns

jaylim-crl

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/ccl/sqlproxyccl/proxy_handler_test.go line 1331 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

I cannot reproduce this issue with a test locally. Putting this up to see how the TC CI behaves with stress. However, I managed to produce this with an external script, and confirmed that the patch does fix the issue as described.

COCKROACH_TRUST_CLIENT_PROVIDED_SQL_REMOTE_ADDR=true ./cockroach demo --insecure
...
./cockroach mt start-proxy --skip-verify --disable-connection-rebalancing --insecure --routing-rule=127.0.0.1:26257 --listen-metrics=127.0.0.1:9080 --log="{ sinks : { stderr : { filter : info } } }"
...
# Starts X connections, runs Ping() queries, and closes the connections right away.
go run demo.go
...
# The connection count would get stuck despite having the demo.go program terminated.
curl -s http://localhost:9080/_status/vars/ | grep proxy_sql_conns

Good news. Looks like the test was failing as expected without the fix, but it took a while: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests/5421651?showRootCauses=false&expandBuildProblemsSection=true&expandBuildChangesSection=true&expandBuildTestsSection=true&showLog=5421649_1266_570.1256.1266&logFilter=debug

06:08:08       182 runs so far, 0 failures, over 17m55s
06:08:08       182 runs so far, 0 failures, over 18m0s
06:08:08       182 runs so far, 0 failures, over 18m5s
...
proxy_handler_test.go:1389: condition failed to evaluate within 45s: expected CurConnCount=0, but got 1

Stack traces show that we were blocked in the waitResumed call, which was the described issue:

goroutine 3133 [sync.Cond.Wait, 13 minutes]:
sync.runtime_notifyListWait(0xc001ba60d0, 0x0)
	/usr/local/go/src/runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0x45d1ce)
	/usr/local/go/src/sync/cond.go:56 +0x8c
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*processor).waitResumed(0xc002ac9220, {0x57fe970, 0xc006477cc0})
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:559 +0xb0
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).resumeProcessors(0xc001e7ebe0)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:319 +0x11b
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).run(0xc001e7ebe0, {0x5863950, 0xc00555aa80}, {0x5863ab0, 0xc0031fe378})
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:191 +0x91
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*proxyHandler).handle(0xc005007e00, {0x57fea18, 0xc0050b0960}, 0xc004f4b320)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/proxy_handler.go:407 +0xeb9
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*Server).Serve.func2({0x57fea18, 0xc004f4b350})
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/server.go:179 +0x196
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494 +0x16f
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:483 +0x445

jeffswenson

LGTM

jeffswenson · 2022-06-09T13:37:48Z

pkg/ccl/sqlproxyccl/forwarder.go

+	//      stuck because waiting forever, resulting in a leak.
+	request, response := f.getProcessors()
+	if request != nil {
+		request.broadcast()


I like this solution, but it is breaking the pattern that typically applies to condition variables: i.e. they are paired with some mutex and the condition variable is notified when the state guarded by the mutex changes.

Alternatives to consider:

Add a "closed" state to the processors. Call .Close() in forwarder.Close() and check the closed state in the wait* functions. I know I pushed back on just such a state before, but in retrospec I think it is a mistake to mix channels (context cancellation) and condition variables.

Remove the condition variable and use channels to synchronize instead. resume() could accept or return a channel that it signals when the resume is complete. ditto for suspend. This implies that concurrent calls should be rejected, but that is probably fine.

…leak Previously, there was a possibility where a processor can return from resuming because the client's connection was closed _before_ waitResumed even has the chance to wake up to check on the resumed field. When that happens, the connection goroutine will be blocked forever, and the CurConnCount metric will never be decremented, even if the connection has already been terminated. When the client's connection was closed, the forwarder's context will be cancelled as well. The ideal behavior would be to terminate all waiters when that happens, but the current code does not do that. This commit fixes that issue by adding a new closed state to the processors, and ensuring that the processor is closed whenever resume returns with an error. waitResumed can then check on this state before going back to wait. Release note: None

jaylim-crl

@jeffswenson I went ahead with Alternative 1. This PR is ready for another round of review.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jaylim-crl and @pjtatlow)

jeffswenson · 2022-06-09T17:44:44Z

pkg/ccl/sqlproxyccl/forwarder.go

@@ -495,6 +507,9 @@ func (p *processor) resume(ctx context.Context) error {
 	}

 	if err := enterResume(); err != nil {
+		if errors.Is(err, errProcessorResumed) {


I don't understand the intent behind this change. Why do we return nil here? Is there something about the PR that introduced a new case to check, or is this cleaning up misbehavior that already existed?

This is to clean up an issue that already existed. resumeProcessors may be called concurrently when the connection started + when a transfer gets triggered. When that happens, resume returns errProcessorResumed, which causes the forwarder to be closed.

// resumeProcessors starts both the request and response processors
// asynchronously. The forwarder will be closed if any of the processors
// return an error while resuming. This is idempotent as resume() will return
// nil if the processor has already been started.

The comment states that resume should return nil if the processor has already been started.

pkg/ccl/sqlproxyccl/forwarder.go

jaylim-crl · 2022-06-09T20:26:56Z

TFTR!

bors r=JeffSwenson

jaylim-crl · 2022-06-09T20:28:28Z

Bazel CI failure is related to a docker image test, and is unrelated to this PR.

craig · 2022-06-09T21:41:20Z

Build succeeded:

GitHub CI (Cockroach)

jaylim-crl commented Jun 9, 2022

View reviewed changes

jaylim-crl force-pushed the jay/220608-fix-slow-leak branch from b5ff88f to 65ad80d Compare June 9, 2022 11:51

jaylim-crl marked this pull request as ready for review June 9, 2022 11:59

jaylim-crl requested review from a team as code owners June 9, 2022 11:59

jaylim-crl requested review from jeffswenson and pjtatlow and removed request for a team June 9, 2022 11:59

jaylim-crl force-pushed the jay/220608-fix-slow-leak branch from 65ad80d to f9f42e4 Compare June 9, 2022 12:06

jeffswenson approved these changes Jun 9, 2022

View reviewed changes

jaylim-crl force-pushed the jay/220608-fix-slow-leak branch from f9f42e4 to 5a4517f Compare June 9, 2022 17:12

jaylim-crl commented Jun 9, 2022

View reviewed changes

jeffswenson reviewed Jun 9, 2022

View reviewed changes

jaylim-crl mentioned this pull request Jun 9, 2022

release-22.1: ccl/sqlproxyccl: fix inaccurate CurConnCount metric due to goroutine leak #82685

Merged

craig bot merged commit 7b88642 into cockroachdb:master Jun 9, 2022

jaylim-crl deleted the jay/220608-fix-slow-leak branch June 9, 2022 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccl/sqlproxyccl: fix inaccurate CurConnCount metric due to goroutine leak #82652

ccl/sqlproxyccl: fix inaccurate CurConnCount metric due to goroutine leak #82652

jaylim-crl commented Jun 9, 2022 •

edited

Loading

cockroach-teamcity commented Jun 9, 2022

jaylim-crl left a comment

jaylim-crl left a comment •

edited

Loading

jeffswenson left a comment

jeffswenson Jun 9, 2022 •

edited

Loading

jaylim-crl left a comment

jeffswenson Jun 9, 2022

jaylim-crl Jun 9, 2022

jaylim-crl commented Jun 9, 2022

jaylim-crl commented Jun 9, 2022

craig bot commented Jun 9, 2022

ccl/sqlproxyccl: fix inaccurate CurConnCount metric due to goroutine leak #82652

ccl/sqlproxyccl: fix inaccurate CurConnCount metric due to goroutine leak #82652

Conversation

jaylim-crl commented Jun 9, 2022 • edited Loading

cockroach-teamcity commented Jun 9, 2022

jaylim-crl left a comment

Choose a reason for hiding this comment

jaylim-crl left a comment • edited Loading

Choose a reason for hiding this comment

jeffswenson left a comment

Choose a reason for hiding this comment

jeffswenson Jun 9, 2022 • edited Loading

Choose a reason for hiding this comment

jaylim-crl left a comment

Choose a reason for hiding this comment

jeffswenson Jun 9, 2022

Choose a reason for hiding this comment

jaylim-crl Jun 9, 2022

Choose a reason for hiding this comment

jaylim-crl commented Jun 9, 2022

jaylim-crl commented Jun 9, 2022

craig bot commented Jun 9, 2022

jaylim-crl commented Jun 9, 2022 •

edited

Loading

jaylim-crl left a comment •

edited

Loading

jeffswenson Jun 9, 2022 •

edited

Loading