sql: fix span leak in connExecutor.close #61438

asubiotto · 2021-03-04T00:01:52Z

Release justification: high benefit change to existing functionality. This PR fixes a tracing leak that severely reduces the benefit of the new tracing registry for in-flight spans since we previously had to bypass adding these important spans to the regsitry to avoid memory blowups.

PTAL at the individual commits. The first one is the leak fix, the second removes the option to bypass the tracing registry.

Release note: None (no user-observable change)

cc @angelapwen @irfansharif

Fixes #59315

cockroach-teamcity · 2021-03-04T00:01:59Z

This change is

asubiotto · 2021-03-04T16:25:04Z

@tbg would you mind helping me out with the contention_event failure? Removing WithBypassRegistry and the logictest config makes contention_event fail with:

    logic.go:2292:

        testdata/logic_test/contention_event:50: WITH spans AS (
          SELECT span_id
          FROM crdb_internal.node_inflight_trace_spans
          WHERE trace_id = crdb_internal.trace_id()
        ), payloads AS (
          SELECT *
          FROM spans, LATERAL crdb_internal.payloads_for_span(spans.span_id)
        ) SELECT count(*) > 0
          FROM payloads
          WHERE payload_type = 'roachpb.ContentionEvent'
          AND crdb_internal.pretty_key(decode(payload_jsonb->>'key', 'base64'), 1) LIKE '/1/"k"/%'
        expected:
            true

        but found (query options: "") :
            false

The statements following that (including checking the contention virtual events table) also fail. The logictest runs fine on the first commit, so it looks like it's the second commit that causes these failures (which is weird because all we're doing is removing the bypass registry option).

andreimatei

I think what you've done here is quite right. If I do RELEASE SAVEPOINT cockroach_restart; select, I think we want to continue staying in commit wait after 'select', not in no txn. A transaction needs to finish with commit/rollback; I think we should stick to this.

Let me ask you this first - you're trying to fix something specifically for commit wait, but don't we have the exact same problem in aborted? There's no finishSQLTxn done before going to aborted, is there? We're similarly waiting for a rollback to come and cleanup.

Assuming we indeed need to fix the leak for both aborted and commit wait, I think what we should do is introduce a new event dedicated to the connection being closed, separate from error events coming from running a statement. WDYT?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @tbg)

pkg/sql/conn_fsm.go, line 419 at r1 (raw file):

			Next:        stateNoTxn{},
			Action: func(args fsm.Args) error {
				return args.Extended.(*txnState).finishTxn(txnRollback)

This txnRollback feels wrong. It doesn't make sense to talk about rollbacks when you're in commit wait. Why not txnCommit?

pkg/sql/txn_state_test.go, line 641 at r1 (raw file):

			expState:  stateNoTxn{},
			expAdv: expAdvance{
				expCode: advanceOne,

isn't skipBatch what we want here? I think skipBatch is right for all errors, isn't it?

andreimatei

And when I say that we have the problem in aborted, by extension I mean that we also have it in open - cause, on conn close, we just transition open -> aborted.

I think to fix without introducing a new event type, we need to come as close as possible to executing a rollback statement when the connection is closed (or literally execute the statement unless there's technical difficulties). As you've seen, the rollback results in a eventTxnFinishCommitted when we're in commit wait.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @tbg)

asubiotto · 2021-03-04T17:27:16Z

I feel like I'm missing a lot of expertise here, making it difficult to have a productive conversation.

Let me ask you this first - you're trying to fix something specifically for commit wait, but don't we have the exact same problem in aborted? There's no finishSQLTxn done before going to aborted, is there? We're similarly waiting for a rollback to come and cleanup.

We do have the same problem for aborted as evidenced by this issue (#60915) and looking at conn_fsm. Postgres does what @tbg expects to happen over there.

Assuming we indeed need to fix the leak for both aborted and commit wait, I think what we should do is introduce a new event dedicated to the connection being closed, separate from error events coming from running a statement. WDYT?

I think this makes most sense to me and happy to do that, but the event should be more like "cleanup this transaction" and transition to state no txn, right? And by event you mean a transactionEvent, not an fsm.Event, right?

andreimatei · 2021-03-04T17:49:47Z

I feel like I'm missing a lot of expertise here, making it difficult to have a productive conversation.

Next week I can be all yours.

Let me ask you this first - you're trying to fix something specifically for commit wait, but don't we have the exact same problem in aborted? There's no finishSQLTxn done before going to aborted, is there? We're similarly waiting for a rollback to come and cleanup.

We do have the same problem for aborted as evidenced by this issue (#60915) and looking at conn_fsm. Postgres does what @tbg expects to happen over there.

That's different tho, isn't it? That's about when exactly we rollback the kv txn. This is about how we sometimes never close the "sql txn" - which is pretty much just a span (and also the txnState.cancel which needs to be called). Basically finishSQLTxn()` is what you want. The fact that we rollback the kv txn on connection close is not in question. Right?

Assuming we indeed need to fix the leak for both aborted and commit wait, I think what we should do is introduce a new event dedicated to the connection being closed, separate from error events coming from running a statement. WDYT?

I think this makes most sense to me and happy to do that, but the event should be more like "cleanup this transaction" and transition to state no txn, right? And by event you mean a transactionEvent, not an fsm.Event, right?

I mean an fsm.Event. The span you're trying to close is txnState.mu.sp (right?). The txnState is supposed to only be mutated by the state machine (conn_fsm.go).
But actually, what if we simply call finishSQLTxn here:

cockroach/pkg/sql/conn_executor.go

Line 828 in 67099fa

This would be an exception from the rule that only the conn_fsm changes it, but since the connection is shutting down, who cares. You'd only call it if we're in states aborted or commit wait (and assert that we're in one of those states, because the ApplyWithPayload above is supposed to always transition you to one of these two states.

asubiotto · 2021-03-04T18:15:58Z

I feel like I'm missing a lot of expertise here, making it difficult to have a productive conversation.

Next week I can be all yours.

🙌🏼

Let me ask you this first - you're trying to fix something specifically for commit wait, but don't we have the exact same problem in aborted? There's no finishSQLTxn done before going to aborted, is there? We're similarly waiting for a rollback to come and cleanup.

We do have the same problem for aborted as evidenced by this issue (#60915) and looking at conn_fsm. Postgres does what @tbg expects to happen over there.

That's different tho, isn't it? That's about when exactly we rollback the kv txn. This is about how we sometimes never close the "sql txn" - which is pretty much just a span (and also the txnState.cancel which needs to be called). Basically finishSQLTxn()` is what you want. The fact that we rollback the kv txn on connection close is not in question. Right?

Right. I mentioned it because I think they do overlap in terms of state transitions.

Assuming we indeed need to fix the leak for both aborted and commit wait, I think what we should do is introduce a new event dedicated to the connection being closed, separate from error events coming from running a statement. WDYT?

I think this makes most sense to me and happy to do that, but the event should be more like "cleanup this transaction" and transition to state no txn, right? And by event you mean a transactionEvent, not an fsm.Event, right?

I mean an fsm.Event. The span you're trying to close is txnState.mu.sp (right?). The txnState is supposed to only be mutated by the state machine (conn_fsm.go).
But actually, what if we simply call finishSQLTxn here:

cockroach/pkg/sql/conn_executor.go

Line 828 in 67099fa

This would be an exception from the rule that only the conn_fsm changes it, but since the connection is shutting down, who cares. You'd only call it if we're in states aborted or commit wait (and assert that we're in one of those states, because the ApplyWithPayload above is supposed to always transition you to one of these two states.

Good point. That was more or less my initial approach but using the fsm seemed the "right" way to do it. I'll prototype this and update the commit.

asubiotto · 2021-03-04T18:39:44Z

Updated the first commit, still need to figure out that contention_event failure.

irfansharif · 2021-03-04T22:09:35Z

Still grokking how logictests work, but previous to this we only ran the particular logictest under 5node, which is a configuration that does work and should probably be what the second commit reverts to:

ac7b6da#diff-0db9c066ae369f681e8878a52ea6917e0d7e8d5adaf92beb84b9e753d6635032R1-R2

irfansharif · 2021-03-04T22:46:54Z

Seems like that logictest doesn't work with any of the fakedist configurations. I'm not really sure why, or what how this fake SpanResolver guy comes into play. I can look at it again tomorrow if nobody has any ideas but for this PR, relegating contention_event to only run under 5node seems fine to me. Also, we could apply the following diff:

diff --git i/pkg/sql/logictest/testdata/logic_test/contention_event w/pkg/sql/logictest/testdata/logic_test/contention_event
index e1f4f9ff84..6416b4f109 100644
--- i/pkg/sql/logictest/testdata/logic_test/contention_event
+++ w/pkg/sql/logictest/testdata/logic_test/contention_event
@@ -44,9 +44,6 @@ user root
 #
 # NB: the contention event is not in our trace span but in one of its
 # children, so it wouldn't be found if we filtered by the trace span ID.
-#
-# NB: this needs the 5node-pretend59315 config because otherwise the span is not
-# tracked.
 query B
 WITH spans AS (
   SELECT span_id

irfansharif · 2021-03-04T22:47:38Z

(Second commit LGTM mod the discussion above.)

tbg · 2021-03-05T09:37:04Z

Thanks @irfansharif for looking! Good catch that this probably just never worked under fakedist. I have no idea why that is, injecting the fakedist seems.. innocuous? But it definitely is exactly that thing that breaks it. I locally (on master) am comparing these two configs

	{
		name:                "1node-fakedist-pretend59315IsFixed",
		numNodes:            1,
		useFakeSpanResolver: true,
		overrideDistSQLMode: "on",
		overrideAutoStats:   "false",
		pretend59315IsFixed: true,
	},
	{
		name:                "1node-pretend59315IsFixed",
		numNodes:            1,
		useFakeSpanResolver: false,
		overrideDistSQLMode: "on",
		overrideAutoStats:   "false",
		pretend59315IsFixed: true,
	},

and the fakedist one fails. 👁️ 👁️ 👁️

tbg · 2021-03-05T09:46:00Z

So the only thing that fake resolver does is inject pretend-splits. The logic test also already adds its own splits; I wonder if those somehow don't play together well. I disabled the split generation in the span resolver (at least tried to - setting numSplits = 0 in fake_span_resolver.go) but it didn't change anything.
Interestingly though, if I orthogonally disable "our own" splits in contention_event, the test passes the statement it fails at above and only fails here:

# Check that there is at least 1 contention event for kv table in the contention
# virtual tables.
query B
SELECT count(*) > 0 FROM crdb_internal.cluster_contention_events WHERE table_id = 53
----
true

Feels like there's something real going on here.

tbg · 2021-03-05T09:54:38Z

Oh nevermind. Passes with this diff:

diff --git a/pkg/sql/physicalplan/fake_span_resolver.go b/pkg/sql/physicalplan/fake_span_resolver.go
index 5be1d92a96..0c16b2c6d4 100644
--- a/pkg/sql/physicalplan/fake_span_resolver.go
+++ b/pkg/sql/physicalplan/fake_span_resolver.go
@@ -80,7 +80,15 @@ func (fit *fakeSpanResolverIterator) Seek(
        }
 
        // Scan the range and keep a list of all potential split keys.
-       kvs, err := fit.txn.Scan(ctx, span.Key, span.EndKey, 0)
+       //
+       // TODO(someone): this scan can have undesired side effects. For
+       // example, it can change when contention occurs and swallows tracing
+       // payloads, leading to unexpected test outcomes as observed in:
+       //
+       // https://github.com/cockroachdb/cockroach/pull/61438
+       // This should use an inconsistent span outside of the txn instead.
+       //kvs, err := fit.txn.Scan(ctx, span.Key, span.EndKey, 0)
+       var kvs []roachpb.KeyValue
+       var err error
        if err != nil {
                log.Errorf(ctx, "error in fake span resolver scan: %s", err)
                fit.err = err

So this is a limitation of fakedist.

@asubiotto mind adding the TODO comment and also a comment here:

cockroach/pkg/sql/logictest/logic.go

Line 436 in 9c24267

useFakeSpanResolver bool

asubiotto · 2021-03-06T00:44:01Z

Thanks for investigating! Updated and RFAL.

tbg

The second commit (removal of the option) looks good to me. Unfortunately, I can't speak to the other commit.

Reviewed 5 of 5 files at r2, 7 of 7 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)

Release justification: low risk, high benefit change to existing functionality. This commit adds a call to finishSQLTxn in connExecutor.close. Previously, not doing so could result in resource leaks, most recently observed as unfinished spans. Release note: None (no user-visible change)

Release justification: low risk, high benefit change to existing functionality. Release note: None (transparent to the end-user)

asubiotto · 2021-03-08T20:40:31Z

Friendly ping @andreimatei for a review of the first commit

andreimatei

but no test? :) Sounds like with this new span registry we can nicely test for leaks?

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto and @tbg)

asubiotto · 2021-03-09T12:30:10Z

TFTRs! We already check for leaks, which is why this span had to be created with the WithBypassRegistry option until we tracked down the leak. Now that's removed, there should be no regression.

bors r=tbg,andreimatei

craig · 2021-03-09T13:11:26Z

Build succeeded:

GitHub CI (Cockroach)

asubiotto added the do-not-merge bors won't merge a PR with this label. label Mar 4, 2021

asubiotto changed the title ~~[WIP] sql: fix tracing span leak when cancelling a transaction in stateCommitWait~~ sql: fix span leak when calling cancelSession in stateCommitWait Mar 4, 2021

asubiotto requested review from tbg, andreimatei and a team March 4, 2021 16:22

andreimatei suggested changes Mar 4, 2021

View reviewed changes

andreimatei reviewed Mar 4, 2021

View reviewed changes

asubiotto changed the title ~~sql: fix span leak when calling cancelSession in stateCommitWait~~ sql: fix span leak in connExecutor.close Mar 4, 2021

asubiotto removed the do-not-merge bors won't merge a PR with this label. label Mar 4, 2021

tbg reviewed Mar 8, 2021

View reviewed changes

asubiotto added 2 commits March 8, 2021 13:24

tracing: remove WithBypassRegistry

7f4400d

Release justification: low risk, high benefit change to existing functionality. Release note: None (transparent to the end-user)

andreimatei reviewed Mar 8, 2021

View reviewed changes

asubiotto added the backport-21.1.x label Mar 8, 2021

craig bot merged commit 83a177c into cockroachdb:master Mar 9, 2021

asubiotto mentioned this pull request Mar 9, 2021

release-21.1: sql: fix span leak in connExecutor.close #61704

Merged

asubiotto deleted the fix59315 branch March 17, 2021 15:21

rafiss added this to the 21.1 milestone Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: fix span leak in connExecutor.close #61438

sql: fix span leak in connExecutor.close #61438

asubiotto commented Mar 4, 2021 •

edited

Loading

cockroach-teamcity commented Mar 4, 2021

asubiotto commented Mar 4, 2021

andreimatei left a comment •

edited

Loading

andreimatei left a comment

asubiotto commented Mar 4, 2021

andreimatei commented Mar 4, 2021

asubiotto commented Mar 4, 2021

asubiotto commented Mar 4, 2021

irfansharif commented Mar 4, 2021

irfansharif commented Mar 4, 2021

irfansharif commented Mar 4, 2021

tbg commented Mar 5, 2021

tbg commented Mar 5, 2021 •

edited

Loading

tbg commented Mar 5, 2021

asubiotto commented Mar 6, 2021

tbg left a comment

asubiotto commented Mar 8, 2021

andreimatei left a comment

asubiotto commented Mar 9, 2021

craig bot commented Mar 9, 2021

sql: fix span leak in connExecutor.close #61438

sql: fix span leak in connExecutor.close #61438

Conversation

asubiotto commented Mar 4, 2021 • edited Loading

cockroach-teamcity commented Mar 4, 2021

asubiotto commented Mar 4, 2021

andreimatei left a comment • edited Loading

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

asubiotto commented Mar 4, 2021

andreimatei commented Mar 4, 2021

asubiotto commented Mar 4, 2021

asubiotto commented Mar 4, 2021

irfansharif commented Mar 4, 2021

irfansharif commented Mar 4, 2021

irfansharif commented Mar 4, 2021

tbg commented Mar 5, 2021

tbg commented Mar 5, 2021 • edited Loading

tbg commented Mar 5, 2021

asubiotto commented Mar 6, 2021

tbg left a comment

Choose a reason for hiding this comment

asubiotto commented Mar 8, 2021

andreimatei left a comment

Choose a reason for hiding this comment

asubiotto commented Mar 9, 2021

craig bot commented Mar 9, 2021

asubiotto commented Mar 4, 2021 •

edited

Loading

andreimatei left a comment •

edited

Loading

tbg commented Mar 5, 2021 •

edited

Loading