-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv,kvcoord,sql: poison txnCoordSender after a retryable error #74563
Conversation
4e3d064
to
e23a68a
Compare
b7c0300
to
5c7dfef
Compare
6ded62d
to
05b749d
Compare
05b749d
to
39718c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job! It will be a big step forward to finally resolve this issue.
Reviewed 10 of 10 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @lidorcarmel)
pkg/kv/txn.go, line 978 at r1 (raw file):
if err == nil { txn.mu.Lock() txnErr := txn.mu.sender.GetTxnRetryableErr(ctx)
Why do we need to explicitly check this here? Shouldn't the call to txn.Commit()
below and the logic in maybeRejectClientLocked
be enough to detect a poisoned txn and return an error?
pkg/kv/txn.go, line 1421 at r1 (raw file):
} // Verify we have an initialized Transaction before creating a new root sender. if newTxn.ID == (uuid.UUID{}) {
Is this a valid case to hit?
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 115 at r1 (raw file):
txnState txnState // storedRetryableErr is set when tsnState == txnRetryableError. This
s/tsnState/txnState/
pkg/sql/conn_fsm.go, line 377 at r1 (raw file):
Description: "ROLLBACK TO SAVEPOINT (not cockroach_restart) failed because txn needs restart", Next: stateAborted{}, Action: prepareTxnForRetry,
Are you sure this is where we wanted the call to PrepareForRetry
, as opposed to the next event which handles the cockroach_restart
savepoint and advances the txnState
with a txnRestart
event? I'm a little surprised to not see a 1:1 mapping between uses of txnRestart
and calls to PrepareForRetry
.
Is this related to the need to provide an error to PrepareForRetry
?
pkg/kv/db_test.go, line 724 at r1 (raw file):
// Only write if we have not written before, because otherwise we will keep aborting // the other txn forever. r, e := hpTxn.Get(ctx, "aa")
nit: s/e/err/
throughout.
pkg/kv/db_test.go, line 755 at r1 (raw file):
return e }) require.NoError(t, err)
Want to add require.Equal(t, 1, runNumber)
here to assert that we performed exactly one retry?
pkg/kv/db_test.go, line 758 at r1 (raw file):
err1 := db.Txn(context.Background(), func(ctx context.Context, txn *kv.Txn) error { // The high priority txn was overridden by the successful retry.
nit: extra space in "successful retry". Also, I think s/overridden/overwritten/
is slightly more accurate.
pkg/kv/db_test.go, line 822 at r1 (raw file):
// db.Txn should return the retryable error that poisoned txn. expectedErr := (*roachpb.TransactionRetryWithProtoRefreshError)(nil) require.True(t, errors.As(err, &expectedErr))
Same thing here. Consider a require.Equal(t, 0, runNumber)
.
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 705 at r1 (raw file):
{ // On abort, nothing changes but we get a new priority to use for // the next attempt.
It looks like we broke what the test was trying to demonstrate. Can we revive it by adding a call to PrepareForRetry
?
cdccc69
to
75223e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @nvanbenschoten , looks like that test failure in rttanalysis was real, the issue was that when things are slow (for some reason only on bazel) then the txn may get aborted by a heartbeat failure, BUT, when we try to PrepareForRetry without an error we fail because we cannot find the error - a retryable error created by the heartbeat failure did not poison the handle, so it's a real bug! thanks bazel :)
I tried adding a test, please feel free to tell me that there is a much better/easier way to do this, but this test does the job.
BTW I think that regardless of this change we should call handleRetryableErrLocked() when we see that heartbeat failure.
Last - the lock/unlock stuff is not related to the test failure.
PTAL, thanks!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lidorcarmel, @nvanbenschoten, and @yuzefovich)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 5 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lidorcarmel, and @yuzefovich)
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 731 at r6 (raw file):
abortedErr := roachpb.NewErrorWithTxn( roachpb.NewTransactionAbortedError(roachpb.ABORT_REASON_CLIENT_REJECT), &tc.mu.txn) if tc.typ == kv.LeafTxn {
LeafTxns don't use txnHeartbeaters
, so I don't see how this case could be possible. I think I just missed that when this was last touched in 1a84480. Do you mind instrumenting it with a fatal and seeing whether we ever hit it in unit tests? If not, I think we should remove it.
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 737 at r6 (raw file):
} // Root txns handle retriable errors. return roachpb.NewError(tc.handleRetryableErrLocked(ctx, abortedErr))
This is a good catch. I agree with you that we should make this change. Can we do so in a separate commit though? That way, we can better pinpoint any fallout from it.
pkg/sql/conn_fsm.go, line 506 at r6 (raw file):
ts := args.Extended.(*txnState) ts.mu.Lock() ts.mu.txn.PrepareForRetry(ts.Ctx)
I don't think it needs to hold up this PR, but one thing in here that I'd be interested to get @andreimatei's take on is whether it is more appropriate for these calls to PrepareForRetry
to live in these fsm Action functions or whether they should be moved into txnStateTransitionsApplyWrapper
and handled on the txnRestart
event. I don't think I ever understood what belongs where and why all connExecutor state updates weren't handled directly in an fsm Action.
pkg/kv/db_test.go, line 17 at r6 (raw file):
"context" "fmt" "github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord"
nit: move this into the grouping below. crlfmt should do this automatically, so I wonder whether you have that set up: https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher. If not, I'd recommend it.
pkg/kv/db_test.go, line 802 at r6 (raw file):
} // Verify the txn sees a retryable error without using the handle: Normally the
This might be a more appropriate test for pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go
.
pkg/kv/db_test.go, line 810 at r6 (raw file):
// handle stays in txnPending then this means we do not have the error. This // error is needed for resetting the handle in PrepareForRetry. func TestDB_PrepareForRetryAfterHeartbeatFailure(t *testing.T) {
How long does this test to run? At least 1 second, right? If you replace setup
with a call to serverutils.StartServer
directly, you should be able to configure a shorter txn heartbeat interval?
pkg/kv/db_test.go, line 833 at r6 (raw file):
// This is here because we want to call maybeRejectClientLocked, which will // poison the handle if there is a heartbeat failure. tc.GetLeafTxnFinalState(ctx, kv.OnlyPending)
nit: _, _ = tc.GetLeafTxnFinalState(ctx, kv.OnlyPending)
to show that you're intentionally ignoring the result. CI might actually be complaining about this already.
pkg/kv/db_test.go, line 841 at r6 (raw file):
return errors.New("the handle is not poisoned yet") } return nil
Should we assert on the contents of the error after we've checked that it's non-nil?
35d5423
to
cdf18bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lidorcarmel, @nvanbenschoten, and @yuzefovich)
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 731 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
LeafTxns don't use
txnHeartbeaters
, so I don't see how this case could be possible. I think I just missed that when this was last touched in 1a84480. Do you mind instrumenting it with a fatal and seeing whether we ever hit it in unit tests? If not, I think we should remove it.
Tested locally with some unit tests, and also pushed to test in CI.
Removing this in a separate commit (I'm getting used to thinking about commits!).
CI is still running, if it fails I'll revert.
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 737 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This is a good catch. I agree with you that we should make this change. Can we do so in a separate commit though? That way, we can better pinpoint any fallout from it.
Done.
pkg/sql/conn_fsm.go, line 377 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
These changes look very good. Thank you for making them.
Thank you!
pkg/sql/conn_fsm.go, line 506 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't think it needs to hold up this PR, but one thing in here that I'd be interested to get @andreimatei's take on is whether it is more appropriate for these calls to
PrepareForRetry
to live in these fsm Action functions or whether they should be moved intotxnStateTransitionsApplyWrapper
and handled on thetxnRestart
event. I don't think I ever understood what belongs where and why all connExecutor state updates weren't handled directly in an fsm Action.
Looks like we're good here, verified with Andrei, conn_fsm should manage the txn object.
pkg/kv/db_test.go, line 17 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: move this into the grouping below. crlfmt should do this automatically, so I wonder whether you have that set up: https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher. If not, I'd recommend it.
Done. Strange! I had crlfmt.. and I'm sure it was working until recently but now it was disabled.... mmm.
pkg/kv/db_test.go, line 802 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This might be a more appropriate test for
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go
.
Done.
pkg/kv/db_test.go, line 810 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
How long does this test to run? At least 1 second, right? If you replace
setup
with a call toserverutils.StartServer
directly, you should be able to configure a shorter txn heartbeat interval?
Done.
pkg/kv/db_test.go, line 833 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
_, _ = tc.GetLeafTxnFinalState(ctx, kv.OnlyPending)
to show that you're intentionally ignoring the result. CI might actually be complaining about this already.
Done.
No idea why it didn't complain :(
Code quote (from pkg/kv/kvclient/kvcoord/txn_coord_sender.go):
newTxn := roachpb.PrepareTransactionForRetry(
ctx, abortedErr, roachpb.NormalUserPriority, tc.clock)
return roachpb.NewError(roachpb.NewTransactionRetryWithProtoRefreshError(
abortedErr.String(), tc.mu.txn.ID, newTxn))
pkg/kv/db_test.go, line 841 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we assert on the contents of the error after we've checked that it's non-nil?
I added
require.Equal(t, txn.ID(), pErr.TxnID)
I'm happy to add more..
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 323 at r8 (raw file):
// happens due to a heartbeat failure. On the other hand // GetLeafTxnFinalState() calls maybeRejectClientLocked() without trying a // Send().
Please take a look at this comment Nathan, I forgot this when we talked - this is why Get() doesn't work here. We can do a sleep and then Get() but this is cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This , although I think we can clean up the test just a bit further!
Reviewed 13 of 13 files at r7, 12 of 13 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @lidorcarmel, @nvanbenschoten, and @yuzefovich)
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 272 at r8 (raw file):
// Create a DB with a short heartbeat interval. s := createTestDBWithKnobs(t, &kvserver.StoreTestingKnobs{ DisableScanner: true,
Do we need these three knobs?
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 323 at r8 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
Please take a look at this comment Nathan, I forgot this when we talked - this is why Get() doesn't work here. We can do a sleep and then Get() but this is cleaner.
Interesting. I wasn't expecting the Get
to notice that the txn was aborted and return an error. This must be because we check the abort span during the read, which is populated on the target range when the high-priority txn aborts our intent. See WriteAbortSpanOnResolve and the pkg/kv/kvserver/abortspan
package for a bit of an explanation. The high-level role of the "abort span" is to prevent zombie transactions from failing to read their own writes after those writes have been aborted.
One way to get around this is to send the put and the get to different ranges. For instance, you could perform a split on key "b" (keyB := roachpb.Key("b")
) using s.DB.AdminSplit
and then perform Get("b")
requests in this SucceedsSoon
loop. What do you think about trying that out?
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 336 at r8 (raw file):
pErr := tc.GetTxnRetryableErr(ctx) require.NotNil(t, pErr) require.Equal(t, txn.ID(), pErr.TxnID)
Should we assert the error type here as well?
cdf18bf
to
9b5eea7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, thanks, PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @nvanbenschoten, and @yuzefovich)
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 272 at r8 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we need these three knobs?
No, removed.
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 323 at r8 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Interesting. I wasn't expecting the
Get
to notice that the txn was aborted and return an error. This must be because we check the abort span during the read, which is populated on the target range when the high-priority txn aborts our intent. See WriteAbortSpanOnResolve and thepkg/kv/kvserver/abortspan
package for a bit of an explanation. The high-level role of the "abort span" is to prevent zombie transactions from failing to read their own writes after those writes have been aborted.One way to get around this is to send the put and the get to different ranges. For instance, you could perform a split on key "b" (
keyB := roachpb.Key("b")
) usings.DB.AdminSplit
and then performGet("b")
requests in thisSucceedsSoon
loop. What do you think about trying that out?
Done, works like a charm.
I did like my "creative" approach but this is probably better, thanks for the details!
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 336 at r8 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we assert the error type here as well?
so, GetTxnRetryableErr() returns *roachpb.TransactionRetryWithProtoRefreshError, not "error", so the type will always be fine, right? or maybe I'm missing something..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to merge if/when you feel good about the last two comments. No need for another round of reviews.
Reviewed all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @lidorcarmel, @nvanbenschoten, and @yuzefovich)
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 336 at r8 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
so, GetTxnRetryableErr() returns *roachpb.TransactionRetryWithProtoRefreshError, not "error", so the type will always be fine, right? or maybe I'm missing something..
Ah, ok, then maybe just add an assertion that pErr.Transaction.ID
is not equal to pErr.TxnID
, which indicates an aborted txn.
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 300 at r10 (raw file):
// Note that if we read from the same range then we will cause the txn to be // poisoned immediately (Send() will fail), even before the heartbeat failure, // which is not the case we want to test here.
Let's mention the abort span here to explain why we need to send the Get to a different range.
9b5eea7
to
8c92071
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! done.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @nvanbenschoten, and @yuzefovich)
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 336 at r8 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Ah, ok, then maybe just add an assertion that
pErr.Transaction.ID
is not equal topErr.TxnID
, which indicates an aborted txn.
Done.
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 300 at r10 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Let's mention the abort span here to explain why we need to send the Get to a different range.
Done.
Instead of calling handleRetryableErrLocked() we did similar work when the transaction sees a retryable error from a heartbeat failure. Instead, it's better to call the existing handleRetryableErrLocked() to take care of those errors in one place. Release note: None
Previously kv users could lose parts of a transaction without getting an error. After Send() returned a retryable error the state of txn got reset which made it usable again. If the caller ignored the error they could continue applying more operations without realizing the first part of the transaction was discarded. See more details in the issue (cockroachdb#22615). The simple case example is where the retryable closure of DB.Txn() returns nil instead of returning the retryable error back to the retry loop - in this case the retry loop declares success without realizing we lost the first part of the transaction (all the operations before the retryable error). This PR leaves the txn in a "poisoned" state after encountering an error, so that all future operations fail fast. The caller is therefore expected to reset the txn handle back to a usable state intentionally, by calling Txn.PrepareForRetry(). In the simple case of DB.Txn() the retry loop will reset the handle and run the retry even if the callback returned nil. Closes cockroachdb#22615 Release note: None
LeafTxns don't use heartbeats, therefore we should never hit this condition. Release note: None
8c92071
to
a098962
Compare
TFTR Nathan! |
Build succeeded: |
Previously kv users could lose parts of a transaction without getting an
error. After Send() returned a retryable error the state of txn got reset
which made it usable again. If the caller ignored the error they could
continue applying more operations without realizing the first part of the
transaction was discarded. See more details in the issue (#22615).
The simple case example is where the retryable closure of DB.Txn() returns
nil instead of returning the retryable error back to the retry loop - in this
case the retry loop declares success without realizing we lost the first part
of the transaction (all the operations before the retryable error).
This PR leaves the txn in a "poisoned" state after encountering an error, so
that all future operations fail fast. The caller is therefore expected to
reset the txn handle back to a usable state intentionally, by calling
Txn.PrepareForRetry(). In the simple case of DB.Txn() the retry loop will
reset the handle and run the retry even if the callback returned nil.
Closes #22615
Release note: None