Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: fix span leak in connExecutor.close #61438

Merged
merged 2 commits into from
Mar 9, 2021
Merged

sql: fix span leak in connExecutor.close #61438

merged 2 commits into from
Mar 9, 2021

Conversation

asubiotto
Copy link
Contributor

@asubiotto asubiotto commented Mar 4, 2021

Release justification: high benefit change to existing functionality. This PR fixes a tracing leak that severely reduces the benefit of the new tracing registry for in-flight spans since we previously had to bypass adding these important spans to the regsitry to avoid memory blowups.

PTAL at the individual commits. The first one is the leak fix, the second removes the option to bypass the tracing registry.

Release note: None (no user-observable change)

cc @angelapwen @irfansharif

Fixes #59315

@asubiotto asubiotto added the do-not-merge bors won't merge a PR with this label. label Mar 4, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@asubiotto asubiotto changed the title [WIP] sql: fix tracing span leak when cancelling a transaction in stateCommitWait sql: fix span leak when calling cancelSession in stateCommitWait Mar 4, 2021
@asubiotto asubiotto requested review from tbg, andreimatei and a team March 4, 2021 16:22
@asubiotto
Copy link
Contributor Author

@tbg would you mind helping me out with the contention_event failure? Removing WithBypassRegistry and the logictest config makes contention_event fail with:

    logic.go:2292:

        testdata/logic_test/contention_event:50: WITH spans AS (
          SELECT span_id
          FROM crdb_internal.node_inflight_trace_spans
          WHERE trace_id = crdb_internal.trace_id()
        ), payloads AS (
          SELECT *
          FROM spans, LATERAL crdb_internal.payloads_for_span(spans.span_id)
        ) SELECT count(*) > 0
          FROM payloads
          WHERE payload_type = 'roachpb.ContentionEvent'
          AND crdb_internal.pretty_key(decode(payload_jsonb->>'key', 'base64'), 1) LIKE '/1/"k"/%'
        expected:
            true

        but found (query options: "") :
            false

The statements following that (including checking the contention virtual events table) also fail. The logictest runs fine on the first commit, so it looks like it's the second commit that causes these failures (which is weird because all we're doing is removing the bypass registry option).

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you've done here is quite right. If I do RELEASE SAVEPOINT cockroach_restart; select, I think we want to continue staying in commit wait after 'select', not in no txn. A transaction needs to finish with commit/rollback; I think we should stick to this.

Let me ask you this first - you're trying to fix something specifically for commit wait, but don't we have the exact same problem in aborted? There's no finishSQLTxn done before going to aborted, is there? We're similarly waiting for a rollback to come and cleanup.

Assuming we indeed need to fix the leak for both aborted and commit wait, I think what we should do is introduce a new event dedicated to the connection being closed, separate from error events coming from running a statement. WDYT?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @tbg)


pkg/sql/conn_fsm.go, line 419 at r1 (raw file):

			Next:        stateNoTxn{},
			Action: func(args fsm.Args) error {
				return args.Extended.(*txnState).finishTxn(txnRollback)

This txnRollback feels wrong. It doesn't make sense to talk about rollbacks when you're in commit wait. Why not txnCommit?


pkg/sql/txn_state_test.go, line 641 at r1 (raw file):

			expState:  stateNoTxn{},
			expAdv: expAdvance{
				expCode: advanceOne,

isn't skipBatch what we want here? I think skipBatch is right for all errors, isn't it?

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And when I say that we have the problem in aborted, by extension I mean that we also have it in open - cause, on conn close, we just transition open -> aborted.

I think to fix without introducing a new event type, we need to come as close as possible to executing a rollback statement when the connection is closed (or literally execute the statement unless there's technical difficulties). As you've seen, the rollback results in a eventTxnFinishCommitted when we're in commit wait.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @tbg)

@asubiotto
Copy link
Contributor Author

I feel like I'm missing a lot of expertise here, making it difficult to have a productive conversation.

Let me ask you this first - you're trying to fix something specifically for commit wait, but don't we have the exact same problem in aborted? There's no finishSQLTxn done before going to aborted, is there? We're similarly waiting for a rollback to come and cleanup.

We do have the same problem for aborted as evidenced by this issue (#60915) and looking at conn_fsm. Postgres does what @tbg expects to happen over there.

Assuming we indeed need to fix the leak for both aborted and commit wait, I think what we should do is introduce a new event dedicated to the connection being closed, separate from error events coming from running a statement. WDYT?

I think this makes most sense to me and happy to do that, but the event should be more like "cleanup this transaction" and transition to state no txn, right? And by event you mean a transactionEvent, not an fsm.Event, right?

@andreimatei
Copy link
Contributor

I feel like I'm missing a lot of expertise here, making it difficult to have a productive conversation.

Next week I can be all yours.

Let me ask you this first - you're trying to fix something specifically for commit wait, but don't we have the exact same problem in aborted? There's no finishSQLTxn done before going to aborted, is there? We're similarly waiting for a rollback to come and cleanup.

We do have the same problem for aborted as evidenced by this issue (#60915) and looking at conn_fsm. Postgres does what @tbg expects to happen over there.

That's different tho, isn't it? That's about when exactly we rollback the kv txn. This is about how we sometimes never close the "sql txn" - which is pretty much just a span (and also the txnState.cancel which needs to be called). Basically finishSQLTxn()` is what you want. The fact that we rollback the kv txn on connection close is not in question. Right?

Assuming we indeed need to fix the leak for both aborted and commit wait, I think what we should do is introduce a new event dedicated to the connection being closed, separate from error events coming from running a statement. WDYT?

I think this makes most sense to me and happy to do that, but the event should be more like "cleanup this transaction" and transition to state no txn, right? And by event you mean a transactionEvent, not an fsm.Event, right?

I mean an fsm.Event. The span you're trying to close is txnState.mu.sp (right?). The txnState is supposed to only be mutated by the state machine (conn_fsm.go).
But actually, what if we simply call finishSQLTxn here:


This would be an exception from the rule that only the conn_fsm changes it, but since the connection is shutting down, who cares. You'd only call it if we're in states aborted or commit wait (and assert that we're in one of those states, because the ApplyWithPayload above is supposed to always transition you to one of these two states.

@asubiotto
Copy link
Contributor Author

I feel like I'm missing a lot of expertise here, making it difficult to have a productive conversation.

Next week I can be all yours.

🙌🏼

Let me ask you this first - you're trying to fix something specifically for commit wait, but don't we have the exact same problem in aborted? There's no finishSQLTxn done before going to aborted, is there? We're similarly waiting for a rollback to come and cleanup.

We do have the same problem for aborted as evidenced by this issue (#60915) and looking at conn_fsm. Postgres does what @tbg expects to happen over there.

That's different tho, isn't it? That's about when exactly we rollback the kv txn. This is about how we sometimes never close the "sql txn" - which is pretty much just a span (and also the txnState.cancel which needs to be called). Basically finishSQLTxn()` is what you want. The fact that we rollback the kv txn on connection close is not in question. Right?

Right. I mentioned it because I think they do overlap in terms of state transitions.

Assuming we indeed need to fix the leak for both aborted and commit wait, I think what we should do is introduce a new event dedicated to the connection being closed, separate from error events coming from running a statement. WDYT?

I think this makes most sense to me and happy to do that, but the event should be more like "cleanup this transaction" and transition to state no txn, right? And by event you mean a transactionEvent, not an fsm.Event, right?

I mean an fsm.Event. The span you're trying to close is txnState.mu.sp (right?). The txnState is supposed to only be mutated by the state machine (conn_fsm.go).
But actually, what if we simply call finishSQLTxn here:

This would be an exception from the rule that only the conn_fsm changes it, but since the connection is shutting down, who cares. You'd only call it if we're in states aborted or commit wait (and assert that we're in one of those states, because the ApplyWithPayload above is supposed to always transition you to one of these two states.

Good point. That was more or less my initial approach but using the fsm seemed the "right" way to do it. I'll prototype this and update the commit.

@asubiotto asubiotto changed the title sql: fix span leak when calling cancelSession in stateCommitWait sql: fix span leak in connExecutor.close Mar 4, 2021
@asubiotto asubiotto removed the do-not-merge bors won't merge a PR with this label. label Mar 4, 2021
@asubiotto
Copy link
Contributor Author

Updated the first commit, still need to figure out that contention_event failure.

@irfansharif
Copy link
Contributor

Still grokking how logictests work, but previous to this we only ran the particular logictest under 5node, which is a configuration that does work and should probably be what the second commit reverts to:

ac7b6da#diff-0db9c066ae369f681e8878a52ea6917e0d7e8d5adaf92beb84b9e753d6635032R1-R2

@irfansharif
Copy link
Contributor

Seems like that logictest doesn't work with any of the fakedist configurations. I'm not really sure why, or what how this fake SpanResolver guy comes into play. I can look at it again tomorrow if nobody has any ideas but for this PR, relegating contention_event to only run under 5node seems fine to me. Also, we could apply the following diff:

diff --git i/pkg/sql/logictest/testdata/logic_test/contention_event w/pkg/sql/logictest/testdata/logic_test/contention_event
index e1f4f9ff84..6416b4f109 100644
--- i/pkg/sql/logictest/testdata/logic_test/contention_event
+++ w/pkg/sql/logictest/testdata/logic_test/contention_event
@@ -44,9 +44,6 @@ user root
 #
 # NB: the contention event is not in our trace span but in one of its
 # children, so it wouldn't be found if we filtered by the trace span ID.
-#
-# NB: this needs the 5node-pretend59315 config because otherwise the span is not
-# tracked.
 query B
 WITH spans AS (
   SELECT span_id

@irfansharif
Copy link
Contributor

(Second commit LGTM mod the discussion above.)

@tbg
Copy link
Member

tbg commented Mar 5, 2021

Thanks @irfansharif for looking! Good catch that this probably just never worked under fakedist. I have no idea why that is, injecting the fakedist seems.. innocuous? But it definitely is exactly that thing that breaks it. I locally (on master) am comparing these two configs

	{
		name:                "1node-fakedist-pretend59315IsFixed",
		numNodes:            1,
		useFakeSpanResolver: true,
		overrideDistSQLMode: "on",
		overrideAutoStats:   "false",
		pretend59315IsFixed: true,
	},
	{
		name:                "1node-pretend59315IsFixed",
		numNodes:            1,
		useFakeSpanResolver: false,
		overrideDistSQLMode: "on",
		overrideAutoStats:   "false",
		pretend59315IsFixed: true,
	},

and the fakedist one fails. 👁️ 👁️ 👁️

@tbg
Copy link
Member

tbg commented Mar 5, 2021

So the only thing that fake resolver does is inject pretend-splits. The logic test also already adds its own splits; I wonder if those somehow don't play together well. I disabled the split generation in the span resolver (at least tried to - setting numSplits = 0 in fake_span_resolver.go) but it didn't change anything.
Interestingly though, if I orthogonally disable "our own" splits in contention_event, the test passes the statement it fails at above and only fails here:

# Check that there is at least 1 contention event for kv table in the contention
# virtual tables.
query B
SELECT count(*) > 0 FROM crdb_internal.cluster_contention_events WHERE table_id = 53
----
true

Feels like there's something real going on here.

@tbg
Copy link
Member

tbg commented Mar 5, 2021

Oh nevermind. Passes with this diff:

diff --git a/pkg/sql/physicalplan/fake_span_resolver.go b/pkg/sql/physicalplan/fake_span_resolver.go
index 5be1d92a96..0c16b2c6d4 100644
--- a/pkg/sql/physicalplan/fake_span_resolver.go
+++ b/pkg/sql/physicalplan/fake_span_resolver.go
@@ -80,7 +80,15 @@ func (fit *fakeSpanResolverIterator) Seek(
        }
 
        // Scan the range and keep a list of all potential split keys.
-       kvs, err := fit.txn.Scan(ctx, span.Key, span.EndKey, 0)
+       //
+       // TODO(someone): this scan can have undesired side effects. For
+       // example, it can change when contention occurs and swallows tracing
+       // payloads, leading to unexpected test outcomes as observed in:
+       //
+       // https://github.com/cockroachdb/cockroach/pull/61438
+       // This should use an inconsistent span outside of the txn instead.
+       //kvs, err := fit.txn.Scan(ctx, span.Key, span.EndKey, 0)
+       var kvs []roachpb.KeyValue
+       var err error
        if err != nil {
                log.Errorf(ctx, "error in fake span resolver scan: %s", err)
                fit.err = err

So this is a limitation of fakedist.

@asubiotto mind adding the TODO comment and also a comment here:

useFakeSpanResolver bool

@asubiotto
Copy link
Contributor Author

Thanks for investigating! Updated and RFAL.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second commit (removal of the option) looks good to me. Unfortunately, I can't speak to the other commit.

Reviewed 5 of 5 files at r2, 7 of 7 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)

Release justification: low risk, high benefit change to existing functionality.

This commit adds a call to finishSQLTxn in connExecutor.close. Previously, not
doing so could result in resource leaks, most recently observed as unfinished
spans.

Release note: None (no user-visible change)
Release justification: low risk, high benefit change to existing functionality.

Release note: None (transparent to the end-user)
@asubiotto
Copy link
Contributor Author

Friendly ping @andreimatei for a review of the first commit

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: but no test? :) Sounds like with this new span registry we can nicely test for leaks?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto and @tbg)

@asubiotto
Copy link
Contributor Author

TFTRs! We already check for leaks, which is why this span had to be created with the WithBypassRegistry option until we tracked down the leak. Now that's removed, there should be no regression.

bors r=tbg,andreimatei

@craig
Copy link
Contributor

craig bot commented Mar 9, 2021

Build succeeded:

@craig craig bot merged commit 83a177c into cockroachdb:master Mar 9, 2021
@asubiotto asubiotto deleted the fix59315 branch March 17, 2021 15:21
@rafiss rafiss added this to the 21.1 milestone Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tracing,sql: remove option to bypass the tracing registry
6 participants