-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: fix span leak in connExecutor.close #61438
Conversation
@tbg would you mind helping me out with the
The statements following that (including checking the contention virtual events table) also fail. The logictest runs fine on the first commit, so it looks like it's the second commit that causes these failures (which is weird because all we're doing is removing the bypass registry option). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you've done here is quite right. If I do RELEASE SAVEPOINT cockroach_restart; select
, I think we want to continue staying in commit wait
after 'select', not in no txn
. A transaction needs to finish with commit/rollback
; I think we should stick to this.
Let me ask you this first - you're trying to fix something specifically for commit wait
, but don't we have the exact same problem in aborted
? There's no finishSQLTxn
done before going to aborted
, is there? We're similarly waiting for a rollback
to come and cleanup.
Assuming we indeed need to fix the leak for both aborted
and commit wait
, I think what we should do is introduce a new event dedicated to the connection being closed, separate from error events coming from running a statement. WDYT?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @tbg)
pkg/sql/conn_fsm.go, line 419 at r1 (raw file):
Next: stateNoTxn{}, Action: func(args fsm.Args) error { return args.Extended.(*txnState).finishTxn(txnRollback)
This txnRollback
feels wrong. It doesn't make sense to talk about rollbacks when you're in commit wait
. Why not txnCommit
?
pkg/sql/txn_state_test.go, line 641 at r1 (raw file):
expState: stateNoTxn{}, expAdv: expAdvance{ expCode: advanceOne,
isn't skipBatch
what we want here? I think skipBatch
is right for all errors, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And when I say that we have the problem in aborted
, by extension I mean that we also have it in open
- cause, on conn close, we just transition open
-> aborted
.
I think to fix without introducing a new event type, we need to come as close as possible to executing a rollback
statement when the connection is closed (or literally execute the statement unless there's technical difficulties). As you've seen, the rollback
results in a eventTxnFinishCommitted
when we're in commit wait
.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @tbg)
I feel like I'm missing a lot of expertise here, making it difficult to have a productive conversation.
We do have the same problem for aborted as evidenced by this issue (#60915) and looking at
I think this makes most sense to me and happy to do that, but the event should be more like "cleanup this transaction" and transition to state no txn, right? And by event you mean a |
Next week I can be all yours.
That's different tho, isn't it? That's about when exactly we rollback the kv txn. This is about how we sometimes never close the "sql txn" - which is pretty much just a span (and also the
I mean an cockroach/pkg/sql/conn_executor.go Line 828 in 67099fa This would be an exception from the rule that only the conn_fsm changes it, but since the connection is shutting down, who cares. You'd only call it if we're in states aborted or commit wait (and assert that we're in one of those states, because the ApplyWithPayload above is supposed to always transition you to one of these two states.
|
🙌🏼
Right. I mentioned it because I think they do overlap in terms of state transitions.
Good point. That was more or less my initial approach but using the fsm seemed the "right" way to do it. I'll prototype this and update the commit. |
Updated the first commit, still need to figure out that |
Still grokking how logictests work, but previous to this we only ran the particular logictest under 5node, which is a configuration that does work and should probably be what the second commit reverts to: ac7b6da#diff-0db9c066ae369f681e8878a52ea6917e0d7e8d5adaf92beb84b9e753d6635032R1-R2 |
Seems like that logictest doesn't work with any of the fakedist configurations. I'm not really sure why, or what how this fake SpanResolver guy comes into play. I can look at it again tomorrow if nobody has any ideas but for this PR, relegating contention_event to only run under 5node seems fine to me. Also, we could apply the following diff: diff --git i/pkg/sql/logictest/testdata/logic_test/contention_event w/pkg/sql/logictest/testdata/logic_test/contention_event
index e1f4f9ff84..6416b4f109 100644
--- i/pkg/sql/logictest/testdata/logic_test/contention_event
+++ w/pkg/sql/logictest/testdata/logic_test/contention_event
@@ -44,9 +44,6 @@ user root
#
# NB: the contention event is not in our trace span but in one of its
# children, so it wouldn't be found if we filtered by the trace span ID.
-#
-# NB: this needs the 5node-pretend59315 config because otherwise the span is not
-# tracked.
query B
WITH spans AS (
SELECT span_id |
(Second commit LGTM mod the discussion above.) |
Thanks @irfansharif for looking! Good catch that this probably just never worked under
and the fakedist one fails. 👁️ 👁️ 👁️ |
So the only thing that fake resolver does is inject pretend-splits. The logic test also already adds its own splits; I wonder if those somehow don't play together well. I disabled the split generation in the span resolver (at least tried to - setting
Feels like there's something real going on here. |
Oh nevermind. Passes with this diff: diff --git a/pkg/sql/physicalplan/fake_span_resolver.go b/pkg/sql/physicalplan/fake_span_resolver.go
index 5be1d92a96..0c16b2c6d4 100644
--- a/pkg/sql/physicalplan/fake_span_resolver.go
+++ b/pkg/sql/physicalplan/fake_span_resolver.go
@@ -80,7 +80,15 @@ func (fit *fakeSpanResolverIterator) Seek(
}
// Scan the range and keep a list of all potential split keys.
- kvs, err := fit.txn.Scan(ctx, span.Key, span.EndKey, 0)
+ //
+ // TODO(someone): this scan can have undesired side effects. For
+ // example, it can change when contention occurs and swallows tracing
+ // payloads, leading to unexpected test outcomes as observed in:
+ //
+ // https://github.com/cockroachdb/cockroach/pull/61438
+ // This should use an inconsistent span outside of the txn instead.
+ //kvs, err := fit.txn.Scan(ctx, span.Key, span.EndKey, 0)
+ var kvs []roachpb.KeyValue
+ var err error
if err != nil {
log.Errorf(ctx, "error in fake span resolver scan: %s", err)
fit.err = err So this is a limitation of fakedist. @asubiotto mind adding the TODO comment and also a comment here: cockroach/pkg/sql/logictest/logic.go Line 436 in 9c24267
|
Thanks for investigating! Updated and RFAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second commit (removal of the option) looks good to me. Unfortunately, I can't speak to the other commit.
Reviewed 5 of 5 files at r2, 7 of 7 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)
Release justification: low risk, high benefit change to existing functionality. This commit adds a call to finishSQLTxn in connExecutor.close. Previously, not doing so could result in resource leaks, most recently observed as unfinished spans. Release note: None (no user-visible change)
Release justification: low risk, high benefit change to existing functionality. Release note: None (transparent to the end-user)
Friendly ping @andreimatei for a review of the first commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but no test? :) Sounds like with this new span registry we can nicely test for leaks?
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto and @tbg)
TFTRs! We already check for leaks, which is why this span had to be created with the bors r=tbg,andreimatei |
Build succeeded: |
Release justification: high benefit change to existing functionality. This PR fixes a tracing leak that severely reduces the benefit of the new tracing registry for in-flight spans since we previously had to bypass adding these important spans to the regsitry to avoid memory blowups.
PTAL at the individual commits. The first one is the leak fix, the second removes the option to bypass the tracing registry.
Release note: None (no user-observable change)
cc @angelapwen @irfansharif
Fixes #59315