Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/sqlproxyccl: fix inaccurate CurConnCount metric due to goroutine leak #82652

Merged
merged 1 commit into from
Jun 9, 2022

Conversation

jaylim-crl
Copy link
Collaborator

@jaylim-crl jaylim-crl commented Jun 9, 2022

Previously, there was a possibility where a processor can return from
resuming because the client's connection was closed before waitResumed even
has the chance to wake up to check on the resumed field. When that happens,
the connection goroutine will be blocked forever, and the CurConnCount metric
will never be decremented, even if the connection has already been terminated.

When the client's connection was closed, the forwarder's context will be
cancelled as well. The ideal behavior would be to terminate all waiters when
that happens, but the current code does not do that. This commit fixes that
issue by adding a new closed state to the processors, and ensuring that the
processor is closed whenever resume returns with an error. waitResumed can
then check on this state before going back to wait.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/ccl/sqlproxyccl/proxy_handler_test.go line 1331 at r1 (raw file):

// Previously, there was a regression where the CurConnCount metric wasn't
// decremented whenever the connections were closed due to a goroutine leak.
func TestCurConnCountMetric(t *testing.T) {

I cannot reproduce this issue with a test locally. Putting this up to see how the TC CI behaves with stress. However, I managed to produce this with an external script, and confirmed that the patch does fix the issue as described.

COCKROACH_TRUST_CLIENT_PROVIDED_SQL_REMOTE_ADDR=true ./cockroach demo --insecure
...
./cockroach mt start-proxy --skip-verify --disable-connection-rebalancing --insecure --routing-rule=127.0.0.1:26257 --listen-metrics=127.0.0.1:9080 --log="{ sinks : { stderr : { filter : info } } }"
...
# Starts X connections, runs Ping() queries, and closes the connections right away.
go run demo.go
...
# The connection count would get stuck despite having the demo.go program terminated.
curl -s http://localhost:9080/_status/vars/ | grep proxy_sql_conns

Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/ccl/sqlproxyccl/proxy_handler_test.go line 1331 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

I cannot reproduce this issue with a test locally. Putting this up to see how the TC CI behaves with stress. However, I managed to produce this with an external script, and confirmed that the patch does fix the issue as described.

COCKROACH_TRUST_CLIENT_PROVIDED_SQL_REMOTE_ADDR=true ./cockroach demo --insecure
...
./cockroach mt start-proxy --skip-verify --disable-connection-rebalancing --insecure --routing-rule=127.0.0.1:26257 --listen-metrics=127.0.0.1:9080 --log="{ sinks : { stderr : { filter : info } } }"
...
# Starts X connections, runs Ping() queries, and closes the connections right away.
go run demo.go
...
# The connection count would get stuck despite having the demo.go program terminated.
curl -s http://localhost:9080/_status/vars/ | grep proxy_sql_conns

Good news. Looks like the test was failing as expected without the fix, but it took a while: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests/5421651?showRootCauses=false&expandBuildProblemsSection=true&expandBuildChangesSection=true&expandBuildTestsSection=true&showLog=5421649_1266_570.1256.1266&logFilter=debug

06:08:08       182 runs so far, 0 failures, over 17m55s
06:08:08       182 runs so far, 0 failures, over 18m0s
06:08:08       182 runs so far, 0 failures, over 18m5s
...
proxy_handler_test.go:1389: condition failed to evaluate within 45s: expected CurConnCount=0, but got 1

Stack traces show that we were blocked in the waitResumed call, which was the described issue:

goroutine 3133 [sync.Cond.Wait, 13 minutes]:
sync.runtime_notifyListWait(0xc001ba60d0, 0x0)
	/usr/local/go/src/runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0x45d1ce)
	/usr/local/go/src/sync/cond.go:56 +0x8c
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*processor).waitResumed(0xc002ac9220, {0x57fe970, 0xc006477cc0})
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:559 +0xb0
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).resumeProcessors(0xc001e7ebe0)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:319 +0x11b
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).run(0xc001e7ebe0, {0x5863950, 0xc00555aa80}, {0x5863ab0, 0xc0031fe378})
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:191 +0x91
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*proxyHandler).handle(0xc005007e00, {0x57fea18, 0xc0050b0960}, 0xc004f4b320)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/proxy_handler.go:407 +0xeb9
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*Server).Serve.func2({0x57fea18, 0xc004f4b350})
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/server.go:179 +0x196
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494 +0x16f
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:483 +0x445

@jaylim-crl jaylim-crl force-pushed the jay/220608-fix-slow-leak branch from b5ff88f to 65ad80d Compare June 9, 2022 11:51
@jaylim-crl jaylim-crl marked this pull request as ready for review June 9, 2022 11:59
@jaylim-crl jaylim-crl requested review from a team as code owners June 9, 2022 11:59
@jaylim-crl jaylim-crl requested review from jeffswenson and pjtatlow and removed request for a team June 9, 2022 11:59
@jaylim-crl jaylim-crl force-pushed the jay/220608-fix-slow-leak branch from 65ad80d to f9f42e4 Compare June 9, 2022 12:06
Copy link
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// stuck because waiting forever, resulting in a leak.
request, response := f.getProcessors()
if request != nil {
request.broadcast()
Copy link
Collaborator

@jeffswenson jeffswenson Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution, but it is breaking the pattern that typically applies to condition variables: i.e. they are paired with some mutex and the condition variable is notified when the state guarded by the mutex changes.

Alternatives to consider:

  1. Add a "closed" state to the processors. Call .Close() in forwarder.Close() and check the closed state in the wait* functions. I know I pushed back on just such a state before, but in retrospec I think it is a mistake to mix channels (context cancellation) and condition variables.
  2. Remove the condition variable and use channels to synchronize instead. resume() could accept or return a channel that it signals when the resume is complete. ditto for suspend. This implies that concurrent calls should be rejected, but that is probably fine.

…leak

Previously, there was a possibility where a processor can return from
resuming because the client's connection was closed _before_ waitResumed even
has the chance to wake up to check on the resumed field. When that happens,
the connection goroutine will be blocked forever, and the CurConnCount metric
will never be decremented, even if the connection has already been terminated.

When the client's connection was closed, the forwarder's context will be
cancelled as well. The ideal behavior would be to terminate all waiters when
that happens, but the current code does not do that. This commit fixes that
issue by adding a new closed state to the processors, and ensuring that the
processor is closed whenever resume returns with an error. waitResumed can
then check on this state before going back to wait.

Release note: None
@jaylim-crl jaylim-crl force-pushed the jay/220608-fix-slow-leak branch from f9f42e4 to 5a4517f Compare June 9, 2022 17:12
Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffswenson I went ahead with Alternative 1. This PR is ready for another round of review.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jaylim-crl and @pjtatlow)

@@ -495,6 +507,9 @@ func (p *processor) resume(ctx context.Context) error {
}

if err := enterResume(); err != nil {
if errors.Is(err, errProcessorResumed) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the intent behind this change. Why do we return nil here? Is there something about the PR that introduced a new case to check, or is this cleaning up misbehavior that already existed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to clean up an issue that already existed. resumeProcessors may be called concurrently when the connection started + when a transfer gets triggered. When that happens, resume returns errProcessorResumed, which causes the forwarder to be closed.

// resumeProcessors starts both the request and response processors
// asynchronously. The forwarder will be closed if any of the processors
// return an error while resuming. This is idempotent as resume() will return
// nil if the processor has already been started.

The comment states that resume should return nil if the processor has already been started.

pkg/ccl/sqlproxyccl/forwarder.go Show resolved Hide resolved
@jaylim-crl
Copy link
Collaborator Author

TFTR!

bors r=JeffSwenson

@jaylim-crl
Copy link
Collaborator Author

Bazel CI failure is related to a docker image test, and is unrelated to this PR.

@craig
Copy link
Contributor

craig bot commented Jun 9, 2022

Build succeeded:

@craig craig bot merged commit 7b88642 into cockroachdb:master Jun 9, 2022
@jaylim-crl jaylim-crl deleted the jay/220608-fix-slow-leak branch June 9, 2022 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants