-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv,storage: add support for child transactions #56588
base: master
Are you sure you want to change the base?
Conversation
91c9080
to
03f7a56
Compare
d771dd9
to
fd6b310
Compare
@nvanbenschoten I still haven't added the flag to disallow interactions between children and their ancestor intents but I'd love if you could give this a pass as I think there's some controversial stuff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good. Sorry for the long delay on the review - it got lost in my inbox.
What were the controversial parts that you wanted to discuss further? Did I catch them in the review or were there others you'd also like to raise and discuss?
Reviewed 5 of 5 files at r1, 5 of 5 files at r2, 6 of 6 files at r3, 3 of 3 files at r4, 4 of 4 files at r5, 4 of 4 files at r6, 6 of 6 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/kv/sender.go, line 302 at r7 (raw file):
// transaction and must be committed. If the set of read spans // of the transaction overlaps with the writes of the child (as accounted // for in InFlightWrites and LockSpans), an error will be returned.
Oh, that's interesting. So we don't even want to support cases where the parent has already read keys that the child ends up writing? If so, does that simplify the lock table handling?
pkg/kv/sender.go, line 303 at r7 (raw file):
// of the transaction overlaps with the writes of the child (as accounted // for in InFlightWrites and LockSpans), an error will be returned. ForwardToChild(ctx context.Context, child *roachpb.Transaction) error
My immediate reaction to this signature is "why isn't this ForwardTimestampAndRefresh
?"
pkg/kv/txn.go, line 1290 at r3 (raw file):
} // ChildTxn executes retryable in the context of a distributed transaction. The
We'll want to say more here about the interaction between the parent txn and the child txn. What happens when the child txn reads a key that the parent txn wrote to? What happens when it writes to the same key as the parent txn? Similarly, what are the answers to those questions in reverse, once the child commits and the parent begins operating again?
pkg/kv/txn.go, line 1314 at r3 (raw file):
// TODO(ajwerner): Validate the state of the parent transaction. It should be // open (non-terminal), and a root transaction at the very least.
Yeah, this seems like a good idea.
pkg/kv/txn.go, line 1317 at r3 (raw file):
kvTxn := txn.Sender().NewChildTransaction() childTxn := NewTxnFromProto(ctx, txn.db, txn.gatewayNodeID, kvTxn.ReadTimestamp, RootTxn, kvTxn)
I'm not quite sure what's going on with txn.mu.userPriority
in this method, but I'm not sure it's going to behave correctly if kvTxn
has a min or max priority.
pkg/kv/txn.go, line 1324 at r3 (raw file):
if paErr := (*roachpb.ParentAbortedError)(nil); errors.As(err, &paErr) && paErr.ParentTxn.ID == txn.ID() { return roachpb.NewTransactionRetryWithProtoRefreshError(
Do we need to call txn.mu.sender.ManualRestart
or GenerateForcedRetryableError
to ensure that the parent txn's TxnCoordSender
itself is restarted?
pkg/kv/txn.go, line 1332 at r3 (raw file):
// a higher-level txn to be retried. We don't do this in any of the other // functions in DB; I guess we should. if errors.HasType(err, (*roachpb.TransactionRetryWithProtoRefreshError)(nil)) {
Shouldn't the child transaction have retried in childTxn.exec
if it saw one of these errors?
Also, is errors.Wrapf
enough to terminate the error? Won't errors.HasType
still find the error cause?
pkg/kv/txn.go, line 1348 at r7 (raw file):
// the child. This will also detect whether the child illegally wrote over // the parent's reads. return txn.Sender().ForwardToChild(ctx, childTxn.TestingCloneTxn())
nit: we shouldn't be using TestingCloneTxn
in non-testing code.
pkg/kv/txn_external_test.go, line 229 at r3 (raw file):
// that requests issued in a child transaction carry the TxnMeta of the // parent. func TestChildTransactionMetadataPropagates(t *testing.T) {
Do we need another test for various retry conditions around child and parent txns?
pkg/kv/txn_external_test.go, line 362 at r4 (raw file):
// // Without any transaction timeouts, the two possible outcomes should be // either that A and its child success or that B and its child success.
s/success/succeed/
pkg/kv/txn_external_test.go, line 365 at r4 (raw file):
// // There are some extreme scenarios which can happen under transaction // expiration whereby A ends up getting aborted because of a timeout and A1
What kind of timeout?
pkg/kv/txn_external_test.go, line 369 at r4 (raw file):
// could happen for both of them. We do not anticipate or allow these // scenarios. if (errA == nil && errB == nil) || (errA != nil && errB != nil) {
nit: (errA == nil) == (errB == nil)
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1153 at r3 (raw file):
// TODO(ajwerner): Consider stripping information from the parent proto like // LockSpans or perhaps introducing a new proto altogether for parents. The // latter is likely to be onerous.
A TxnMeta
doesn't work?
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1172 at r3 (raw file):
// deadlock. It seems that priority levels should only increase and thus // we shouldn't have these scenarios so long as the priority starts at at // least that of the parent.
I don't quite understand this comment. Is it assuming that priorities immediately resolve pushes one way or another? That isn't really the case – they mostly just impact who gets aborted once a deadlock has been detected.
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):
for _, readSpan := range tc.interceptorAlloc.refreshFootprint.asSlice() { if writes.Overlaps(readSpan.AsRange()) { return errors.New("cannot forward provisional commit timestamp " +
If we refreshed the parent immediately to the new write timestamp, could we detect conflicts without all this?
Or does that not work for cases where the parent and child txns have the same timestamp? Actually, maybe that's fine for the same reason we can short circuit on child.WriteTimestamp.LessEq(tc.mu.txn.ReadTimestamp)
.
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):
log.Fatalf(ctx, "%v", err) } for cur := g.Req.Txn; cur != nil && parentIntentError == nil; cur = cur.Parent {
Do we need this if we detect the conflict in the lock table? I guess we do in cases where the lock table is disabled (!added
)?
If that's so, then should we detect this before even calling AddDiscoveredLock
?
pkg/kv/kvserver/concurrency/lock_table.go, line 432 at r6 (raw file):
// the system. // // case | parent | child
This is pretty interesting - and a little scary. How many of these combinations do we actually need to support? The first thing that jumps out to me is that this should probably be talking in terms of locking strength, not locking durability. It's understandable that you didn't do this because we only currently support a single locking strength.
Once we do that, I wonder if we can simplify this a little by saying that a lock held by a parent txn also implies that the lock is held by the child txn, up to the locking strength. So a child transaction only needs to acquire a lock that conflicts with one that its parent acquired if it needs a stronger lock.
Viewing this problem as one of lock promotion is nice because it pretty clearly raises the question of what happens when the parent shares a SHARED lock with another txn and then the child tries to upgrade to an exclusive lock. This creates the deadlock scenario that you've been thinking through.
It also raises the question of what to do if the parent already holds an exclusive lock (all four of these cases). In such cases, do we need to support the child being able to perform a write? It sounds like we don't for replicated/replicated. Do we need to for any other case?
pkg/kv/kvserver/concurrency/lock_table_test.go, line 50 at r6 (raw file):
Creates a lockTable. The lockTable is initially enabled. new-txn txn=<name> ts=<int>[,<int>] epoch=<int> [seq=<int>]
Add a comment about the new new-child-txn
directive.
pkg/kv/kvserver/concurrency/lock_table_test.go, line 154 at r6 (raw file):
guardsByReqName = make(map[string]lockTableGuard) return "" case "new-child-txn":
nit: drop this below new-txn
pkg/kv/kvserver/concurrency/testdata/lock_table/child_txn, line 2 at r6 (raw file):
# Test that we do not block on a parent's intent. # Test that we do not block on a parent's read lock.
s/read/unreplicated exclusive/
pkg/kv/kvserver/concurrency/testdata/lock_table/child_txn, line 3 at r6 (raw file):
# Test that we do not block on a parent's intent. # Test that we do not block on a parent's read lock. # TODO(ajwerner): We need to test some of the more pernicious issues.
👍
pkg/kv/kvserver/txnwait/queue.go, line 140 at r4 (raw file):
set := map[uuid.UUID]struct{}{} for _, push := range pt.waitingPushes { for cur := &push.req.PusherTxn; cur != nil; cur = cur.Parent {
Give this line a comment. It's pretty subtle.
pkg/kv/kvserver/txnwait/queue.go, line 484 at r4 (raw file):
readyCh = make(chan struct{}, 1) queryPusherCh, queryPusherErrCh = q.startQueryPusherTxn(ctx, push, readyCh) // Ensure that the pusher querying goroutine is complete at exit.
Let's keep/adapt this comment.
pkg/kv/kvserver/txnwait/queue.go, line 500 at r4 (raw file):
for i, cur := 0, &req.PusherTxn; cur != nil; cur = cur.Parent { if cur.ID == uuid.Nil || !cur.IsLocking() {
nit: consider pulling this into a helper and sharing that with getNumPusherTxnsToTrack
.
pkg/kv/kvserver/txnwait/queue.go, line 653 at r4 (raw file):
// not the immediate parent? Ideally we'd abort all of the parents. // I think this will work so long as we propagate this error and // only convert it to a retry at the relevant parent.
Yeah, that's how I figured this would work too. But we'll need to make sure of it in the client.
pkg/kv/kvserver/txnwait/queue.go, line 683 at r4 (raw file):
// TODO(ajwerner): What should be the semantics here? Why would we release // if we're going to forcePush ourselves? Should we release if it isn't // the pusher which is updated but rather some parent?
I think we still want to release if it isn't the pusher which is updated, because we may still have noticed new dependents.
pkg/kv/kvserver/txnwait/queue.go, line 705 at r4 (raw file):
} // Signal the pusher query txn loop to continue. // TODO(ajwerner): Do we need all of these to be notified?
It doesn't look like it. Only the goroutine that sent on queryPusherCh
needs to be notified.
pkg/roachpb/data.proto, line 423 at r1 (raw file):
// // TODO(ajwerner): come back and describe why we want a TxnMeta here as it // pertains to read visibility of the parent.
Do we want a TxnMeta
or a Transaction
?
pkg/roachpb/errors.proto, line 482 at r2 (raw file):
// behavior. message ParentAbortedError { optional roachpb.Transaction parent_txn = 1 [(gogoproto.nullable) = false];
Is this the direct parent or the ancestor that was aborted? If the latter (which I suspect it is so that we can catch the retry error at the right level), should we rename this to AncestorAbortedError
? Or at least spell this all out?
pkg/storage/mvcc_history_test.go, line 1007 at r5 (raw file):
} func (e *evalCtx) newChildTxn(
nit: move this below newTxn
pkg/storage/pebble_mvcc_scanner.go, line 471 at r5 (raw file):
if epoch < p.meta.Txn.Epoch { // 12. We're reading our own txn's intent but the current txn has
12
?
pkg/storage/pebble_mvcc_scanner.go, line 489 at r5 (raw file):
// ownsIntent return true if the current intent is owned by the reading // transaction or one of its parents. If it is owned by the reading transaction,
The second sentence makes it sound like the return values are only set if the intent is owned by the reading transaction and not one of its parents. Consider updating this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look. Generally agree with your testing comments. Also agree with being more explicit about some properties in comments. Curious to explore some locking upgrade concepts to make that code clearer.
I'll come back and take another pass on the code after we discuss some of the things here.
I think the most controversial things were:
- The protobuf design - using the transaction directly and chaining them
- The forwarding APIs - it started as simply as you imagined but being able to detect the difference between a true refresh failure and a real problem due to parent-child conflicts drove me to this API
- The various error types and handling in:
- lock table when encountering a parent's write in a write-write conflict (or read lock-write conflict I suppose)
- When the child detects an aborted ancestor in the txnwait.Queue
Another note is that we've come up with a case where we think we're going to want to use these child transactions in 21.1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/kv/sender.go, line 302 at r7 (raw file):
If the parent read the keys and then the child writes, it seems that the parent should be doomed to fail. In particular, we want to make sure that the parent never causally precedes the child by the time it commits. If the parent read, then the child wrote, the child will get pushed above the parent because of the timestamp cache. That will force the parent to refresh here leading to certain failure.
If so, does that simplify the lock table handling?
What do you mean?
pkg/kv/sender.go, line 303 at r7 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
My immediate reaction to this signature is "why isn't this
ForwardTimestampAndRefresh
?"
It originally was however there's a funny cases I ran into making sure we can detect bad client behavior.
pkg/kv/txn.go, line 1317 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm not quite sure what's going on with
txn.mu.userPriority
in this method, but I'm not sure it's going to behave correctly ifkvTxn
has a min or max priority.
The child needs to have priority at least as high as its parent. Otherwise there are some funny live-lock cases that can arise where children get aborted but parents don't and these things end up looping.
pkg/kv/txn.go, line 1324 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we need to call
txn.mu.sender.ManualRestart
orGenerateForcedRetryableError
to ensure that the parent txn'sTxnCoordSender
itself is restarted?
ManualRestart I think makes sense. It felt like the SQL layer was dealing with this properly but I may just be misunderstanding how this should work.
pkg/kv/txn.go, line 1332 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Shouldn't the child transaction have retried in
childTxn.exec
if it saw one of these errors?Also, is
errors.Wrapf
enough to terminate the error? Won'terrors.HasType
still find the error cause?
I agree this is a problem. I don't really know what error should be returned here. I don't think the parent can handle this.
pkg/kv/txn.go, line 1348 at r7 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: we shouldn't be using
TestingCloneTxn
in non-testing code.
Agreed, I think I want to make it non-testing. Any issues with that or alternative suggestions?
pkg/kv/txn_external_test.go, line 229 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we need another test for various retry conditions around child and parent txns?
Surely
pkg/kv/txn_external_test.go, line 365 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
What kind of timeout?
expiration like because heartbeats aren't working
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1153 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
A
TxnMeta
doesn't work?
Well, for one, we need the ignored sequence numbers and we need to propagate the entire chain of ancestors. My other concern was that it would make the code that walks the ancestor chain messier.
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1172 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't quite understand this comment. Is it assuming that priorities immediately resolve pushes one way or another? That isn't really the case – they mostly just impact who gets aborted once a deadlock has been detected.
Correct, this is entirely in the context of deadlocks. Nevertheless, imagine the above scenario. When deciding to break a deadlock, it's the pusher that makes the determination to do the force push. If both pushers have lower priorities than the transactions which they are pushing, then there's a problem, right?
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
If we refreshed the parent immediately to the new write timestamp, could we detect conflicts without all this?
Or does that not work for cases where the parent and child txns have the same timestamp? Actually, maybe that's fine for the same reason we can short circuit on
child.WriteTimestamp.LessEq(tc.mu.txn.ReadTimestamp)
.
I originally just pushed the parent's write timestamp. That will eventually force a refresh. The problem is that if that refresh fails, we won't know whether that's a legitimate serializable restart or a programming bug (this case). Because we can't know which transaction caused the refresh failure, we can't just refresh unless we want all refreshes due to child pushes always fail, which seems unacceptable.
A test that I wrote ran into this where the parent transaction just retried forever when a child had invalidated one of its reads.
One way out of this would be if we forced parents to run in a pessimistic mode with replicated locks on all reads. Then we could detect any such conflict before it happened rather than after. Of course, that doesn't seem like something that's going to happen.
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we need this if we detect the conflict in the lock table? I guess we do in cases where the lock table is disabled (
!added
)?If that's so, then should we detect this before even calling
AddDiscoveredLock
?
I feel like I'm missing something here.
pkg/kv/kvserver/concurrency/lock_table.go, line 432 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This is pretty interesting - and a little scary. How many of these combinations do we actually need to support? The first thing that jumps out to me is that this should probably be talking in terms of locking strength, not locking durability. It's understandable that you didn't do this because we only currently support a single locking strength.
Once we do that, I wonder if we can simplify this a little by saying that a lock held by a parent txn also implies that the lock is held by the child txn, up to the locking strength. So a child transaction only needs to acquire a lock that conflicts with one that its parent acquired if it needs a stronger lock.
Viewing this problem as one of lock promotion is nice because it pretty clearly raises the question of what happens when the parent shares a SHARED lock with another txn and then the child tries to upgrade to an exclusive lock. This creates the deadlock scenario that you've been thinking through.
It also raises the question of what to do if the parent already holds an exclusive lock (all four of these cases). In such cases, do we need to support the child being able to perform a write? It sounds like we don't for replicated/replicated. Do we need to for any other case?
Fundamentally I think if the parent holds an exclusive lock then we should have an assertion failure. I had a hard time understanding how to make that happen so instead I allowed it to happen and then forced the assertion failure in other places. I'd love to make this interaction more explicit and fix any problems here.
pkg/roachpb/data.proto, line 423 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we want a
TxnMeta
or aTransaction
?
Right, s/TxnMeta/Transaction/.
Discussed elsewhere but the issues are the ignored sequence numbers and also the chain of ancestors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):
Previously, ajwerner wrote…
I originally just pushed the parent's write timestamp. That will eventually force a refresh. The problem is that if that refresh fails, we won't know whether that's a legitimate serializable restart or a programming bug (this case). Because we can't know which transaction caused the refresh failure, we can't just refresh unless we want all refreshes due to child pushes always fail, which seems unacceptable.
A test that I wrote ran into this where the parent transaction just retried forever when a child had invalidated one of its reads.
One way out of this would be if we forced parents to run in a pessimistic mode with replicated locks on all reads. Then we could detect any such conflict before it happened rather than after. Of course, that doesn't seem like something that's going to happen.
Alright, thinking a bit more about this, it seems like if the parent is forced to do a refresh before the child commits and proves that that succeeds, then it feels like that could just work. It might be the case that the child gets pushed even further between the refresh and its commit but that's okay because we'll have a proof that the child didn't invalidate any of the parent reads. This also would be better because then the child wouldn't have to commit first.
This would require:
- Building the machinery to force the parent to refresh eagerly
- Make sure to propagate a reasonable error when that refresh hits a child's intent
- Find a way to ensure that the EndTxn cannot happen before the parent does its stuff. Perhaps the way that happens is by having the child require that the read timestamp of the parent is at least as high as the timestamp of any intent it has written. I don't know how trivial this will be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):
Previously, ajwerner wrote…
Alright, thinking a bit more about this, it seems like if the parent is forced to do a refresh before the child commits and proves that that succeeds, then it feels like that could just work. It might be the case that the child gets pushed even further between the refresh and its commit but that's okay because we'll have a proof that the child didn't invalidate any of the parent reads. This also would be better because then the child wouldn't have to commit first.
This would require:
- Building the machinery to force the parent to refresh eagerly
- Make sure to propagate a reasonable error when that refresh hits a child's intent
- Find a way to ensure that the EndTxn cannot happen before the parent does its stuff. Perhaps the way that happens is by having the child require that the read timestamp of the parent is at least as high as the timestamp of any intent it has written. I don't know how trivial this will be.
If we do want to push and refresh prior to committing the child, we'll need a way to do that for all of the ancestors. Perhaps this implies a need for tigher coupling of children to parents inside of the txnCoordSender.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The protobuf design - using the transaction directly and chaining them
I'm 👍 on this.
The forwarding APIs - it started as simply as you imagined but being able to detect the difference between a true refresh failure and a real problem due to parent-child conflicts drove me to this API
Yeah, I'm still not clear what the best option for this is either. It's a tricky problem. I like the idea of using refreshing to detect the conflicts so that we don't need to rely on the client having a precise memory of parent reads and child writes, but getting that to work properly is tough.
The various error types and handling in:
lock table when encountering a parent's write in a write-write conflict (or read lock-write conflict I suppose)
Seems like this should just be an assertion failure, right?
When the child detects an aborted ancestor in the txnwait.Queue
I like the solution you've come up with here for this problem (ParentAbortedError
+ client-side unwinding).
Reviewed 1 of 1 files at r8, 3 of 3 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/kv/sender.go, line 302 at r7 (raw file):
What do you mean?
This was in regard to the discussion we were having in lock_table.go
. I thought that we had to handle complex cases like Shared locks held by the parent being promoted to Exclusive locks by the child, but it sounds like we don't. In that case, we can throw an error in the lockTable whenever we find conflicts between ancestors, regardless of locking strength.
pkg/kv/txn.go, line 1317 at r3 (raw file):
Previously, ajwerner wrote…
The child needs to have priority at least as high as its parent. Otherwise there are some funny live-lock cases that can arise where children get aborted but parents don't and these things end up looping.
Right, so what's up with txn.mu.userPriority = roachpb.NormalUserPriority
in NewTxnFromProto
? Is that going to break things?
pkg/kv/txn.go, line 1348 at r7 (raw file):
Previously, ajwerner wrote…
Agreed, I think I want to make it non-testing. Any issues with that or alternative suggestions?
Nope, that SGTM.
pkg/kv/txn_external_test.go, line 365 at r4 (raw file):
Previously, ajwerner wrote…
expiration like because heartbeats aren't working
👍 maybe add that as an example here.
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1153 at r3 (raw file):
Previously, ajwerner wrote…
Well, for one, we need the ignored sequence numbers and we need to propagate the entire chain of ancestors. My other concern was that it would make the code that walks the ancestor chain messier.
Ack. I thought the ignored sequence numbers were in the TxnMeta, but it appears that they are not.
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1172 at r3 (raw file):
Previously, ajwerner wrote…
Correct, this is entirely in the context of deadlocks. Nevertheless, imagine the above scenario. When deciding to break a deadlock, it's the pusher that makes the determination to do the force push. If both pushers have lower priorities than the transactions which they are pushing, then there's a problem, right?
Yeah, I see what you're saying. So it is not safe for a child to have a lower priority than its parent. I guess that intuitively makes sense from a behavioral standpoint, but it's a good catch that this could actually cause a stall in deadlock resolution.
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):
If we do want to push and refresh prior to committing the child, we'll need a way to do that for all of the ancestors.
We have this problem with the current approach as well, right? We're calling ForwardToChild
on the child's immediate parent, but we should really be doing this on all of the child's ancestors.
The idea of refreshing each of a child's ancestors before the child commits seems pretty complex, but your point about infinite retries is valid. Is the idea that refreshing earlier gives us the opportunity to detect conflicts while the child's writes are still intents (i.e. still contain its ID) and therefore can still be distinguished from other transactions' intents?
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):
Previously, ajwerner wrote…
I feel like I'm missing something here.
Me too. My thought was that if we detect the illegal conflict in the lock table already, why do so here as well? We just added the intent to the lock table and we're just about to go back and scan it again.
pkg/kv/kvserver/concurrency/lock_table.go, line 432 at r6 (raw file):
Previously, ajwerner wrote…
Fundamentally I think if the parent holds an exclusive lock then we should have an assertion failure. I had a hard time understanding how to make that happen so instead I allowed it to happen and then forced the assertion failure in other places. I'd love to make this interaction more explicit and fix any problems here.
Well all locks today are exclusive, regardless of whether they're replicated or not. I agree that we should try to make this interaction more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the look. I'm thinking that the biggest sticking point now is the forwarding behavior. I've commented on that thread. I think we should rework it to do a preemptive refresh of all ancestors prior to the committing of the child. We'll still need to forward the immediate ancestor's read timestamp after, just in case the commit still gets pushed, but we don't need that to be preemptive. Any other thoughts before I move in that direction?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):
Yeah, that seems right. We currently aren't detecting issues where a grandchild invalidates a grand-parent's read. This seems like a hint that this deferred checking isn't going to cut it.
Is the idea that refreshing earlier gives us the opportunity to detect conflicts while the child's writes are still intents (i.e. still contain its ID) and therefore can still be distinguished from other transactions' intents?
Exactly. I agree it's pretty complex but it feels like the right thing to do. I'm okay with the child committing and the parent not, there's no way to avoid that. However, I'm not that keen on the possibility of the child committing and the parent not in the case of this egregious assertion failure. I'm starting to think that the right way to spin this is to do some coupling of the child TxnCoordSender to its parent.
Are you okay with my moving forward with the child TxnCoordSender
holding a reference to the parent and having an interceptor which grabs the EndTxn
and forces the parents to refresh? That will simplify the lifecycle handling in the kv
layer and ensure that any of these bogus conflicts are detected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any other thoughts before I move in that direction?
I don't think so. Let's move this PR in that direction and see how it looks.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):
Previously, ajwerner wrote…
Yeah, that seems right. We currently aren't detecting issues where a grandchild invalidates a grand-parent's read. This seems like a hint that this deferred checking isn't going to cut it.
Is the idea that refreshing earlier gives us the opportunity to detect conflicts while the child's writes are still intents (i.e. still contain its ID) and therefore can still be distinguished from other transactions' intents?
Exactly. I agree it's pretty complex but it feels like the right thing to do. I'm okay with the child committing and the parent not, there's no way to avoid that. However, I'm not that keen on the possibility of the child committing and the parent not in the case of this egregious assertion failure. I'm starting to think that the right way to spin this is to do some coupling of the child TxnCoordSender to its parent.
Are you okay with my moving forward with the child
TxnCoordSender
holding a reference to the parent and having an interceptor which grabs theEndTxn
and forces the parents to refresh? That will simplify the lifecycle handling in thekv
layer and ensure that any of these bogus conflicts are detected.
So is the proposal that we'll augment RefreshRequest
and RefreshRangeRequest
with a new "invalid intent ID" field? Or introduce a structured error for it to return in cases where a conflicting value/intent is detected (with a way to distinguish between the two) and then detect the illegal conflicts in the client? All things considered, the second option seems cleaner to me.
Other than that, I think child TxnCoordSender
s holding a reference to their parent seems ok. Or at least, some slim handle to avoid the possibility of too much coupling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Let's move this PR in that direction and see how it looks.
👍 let's see how it goes
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):
Or introduce a structured error for it to return in cases where a conflicting value/intent is detected (with a way to distinguish between the two) and then detect the illegal conflicts in the client?
A structured error for sure.
Or at least, some slim handle to avoid the possibility of too much coupling.
Yep, I was thinking the child would hold onto an interface that allows it to preemptively push the parent and run a refresh.
ea2d440
to
18b6645
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvanbenschoten I reworked the pushing bits to have the child push its ancestors. It's honestly pretty messy. I haven't yet gone back to deal with the lock table. I ended up augmenting the refresh requests to include the set of descendants. It's not terrible. I'd love to get your take on this as it is before making another pass. The interfaces are not making me very happy. You tend to have a way with these things.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/kv/txn.go, line 1317 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Right, so what's up with
txn.mu.userPriority = roachpb.NormalUserPriority
inNewTxnFromProto
? Is that going to break things?
Done.
pkg/kv/kvserver/concurrency/lock_table_test.go, line 154 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: drop this below
new-txn
Done.
pkg/storage/mvcc_history_test.go, line 1007 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: move this below
newTxn
Done.
pkg/storage/pebble_mvcc_scanner.go, line 471 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
12
?
Done.
pkg/storage/pebble_mvcc_scanner.go, line 489 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The second sentence makes it sound like the return values are only set if the intent is owned by the reading transaction and not one of its parents. Consider updating this.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'm trying to determine is whether we need to take into consideration server-side refreshes, which could raise the commit timestamp of a child transaction after the point at which we refresh all ancestors. If that's an issue, then this kind of pre-commit hook isn't going to work out.
I see why we arrived where we did here, but it's all starting to become pretty heavyweight. This makes me think you may have had things right earlier on, when you compared the write footprint of the child transaction after committing with the refresh spans of each ancestor transaction and called it a day. Since this is all in service of detecting programming errors, it feels a little wrong to be doing something so expensive and to add so much new logic in. Which way are you leaning on all of this? It sounds like you aren't too happy with the recursion in the txn_coord_sender.
Reviewed 1 of 29 files at r10, 3 of 6 files at r12, 3 of 4 files at r15, 3 of 4 files at r17, 3 of 4 files at r18, 1 of 1 files at r19, 1 of 2 files at r20.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 183 at r20 (raw file):
return nil } updated, err := tc.interceptorAlloc.forceRefreshLocked(ctx, txn, childrenIDs...)
nit here and below: calling the method/access fields on the interceptorAlloc
itself is pretty confusing. Consider referencing the specific interceptor directly, like we do elsewhere in this file.
pkg/roachpb/api.proto, line 1591 at r20 (raw file):
// their descendants. Encountering such an intent must trigger an assertion // failure as it is a programming error. repeated bytes descendant_txns = 4 [(gogoproto.customtype) = "github.com/cockroachdb/cockroach/pkg/util/uuid.UUID"];
If we're already returning the conflicting intent in a structured RefreshFailedError
, why do we also need to send this in the request itself? Can't we make a determination about this on the client by looking at the RefreshFailedError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'm trying to determine is whether we need to take into consideration server-side refreshes, which could raise the commit timestamp of a child transaction after the point at which we refresh all ancestors. If that's an issue, then this kind of pre-commit hook isn't going to work out.
Makes sense. I was worried about something like that and was thinking about splitting off the EndTxn
to deal with it. This is especially a problem if the final batch of contains a write that would do the invalidating. I'm somewhat unhappy with the post-commit hook to forward the parent. I'm not upset about the recursion per se, but I'm upset that the recursion potentially runs recursive RPCs.
What if we did a hybrid of all of this where the child hangs on to a handle to its parent (and thus all of its ancestors) and then, upon commit EndTxn
, ensures that the set of writes do not invalidate any parent reads. It's almost the same as the old approach except that it removes the crufty API from the kv.TxnSender
.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/roachpb/api.proto, line 1591 at r20 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
If we're already returning the conflicting intent in a structured
RefreshFailedError
, why do we also need to send this in the request itself? Can't we make a determination about this on the client by looking at theRefreshFailedError
?
The issue is that there might be multiple violating intents and we want to make sure to discover the right one. That might just be premature optimization given this is all about detecting a programming error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we did a hybrid of all of this where the child hangs on to a handle to its parent (and thus all of its ancestors) and then, upon commit EndTxn, ensures that the set of writes do not invalidate any parent reads. It's almost the same as the old approach except that it removes the crufty API from the kv.TxnSender.
This approach sounds reasonable to me.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/roachpb/api.proto, line 1591 at r20 (raw file):
Previously, ajwerner wrote…
The issue is that there might be multiple violating intents and we want to make sure to discover the right one. That might just be premature optimization given this is all about detecting a programming error.
Yeah, I would think that we would only need to reject due to a programming error if we only ran into our child's intent. Otherwise, we're also seeing a real conflict, so we're either not in an infinite retry loop (meaning we'll find the programming error after a retry) or we have other real issues.
18b6645
to
81cc1e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, this is RFAL.
The lock table stuff is now simplified and more explicit but it still deserves more scrutiny. Maybe I can get @sumeerbhola to sign off.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/txn.go, line 1332 at r3 (raw file):
Previously, ajwerner wrote…
I agree this is a problem. I don't really know what error should be returned here. I don't think the parent can handle this.
I just removed this. It wasn't doing anything valuable.
pkg/kv/txn_external_test.go, line 362 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/success/succeed/
Done.
pkg/kv/txn_external_test.go, line 365 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
👍 maybe add that as an example here.
Done.
pkg/kv/txn_external_test.go, line 369 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
(errA == nil) == (errB == nil)
Done.
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1172 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yeah, I see what you're saying. So it is not safe for a child to have a lower priority than its parent. I guess that intuitively makes sense from a behavioral standpoint, but it's a good catch that this could actually cause a stall in deadlock resolution.
This has been pulled into the TxnCoordSender.
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 183 at r20 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit here and below: calling the method/access fields on the
interceptorAlloc
itself is pretty confusing. Consider referencing the specific interceptor directly, like we do elsewhere in this file.
No longer relevant.
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Me too. My thought was that if we detect the illegal conflict in the lock table already, why do so here as well? We just added the intent to the lock table and we're just about to go back and scan it again.
I moved this above adding the discovered lock. I do think this still needs to be here as the lock table doesn't have a complete view.
pkg/kv/kvserver/concurrency/lock_table.go, line 432 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Well all locks today are exclusive, regardless of whether they're replicated or not. I agree that we should try to make this interaction more explicit.
Okay, made it all more explicit.
pkg/kv/kvserver/concurrency/lock_table_test.go, line 50 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Add a comment about the new
new-child-txn
directive.
Done.
pkg/kv/kvserver/concurrency/testdata/lock_table/child_txn, line 3 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
👍
Added some more testing and clarified the behavior here.
pkg/kv/kvserver/txnwait/queue.go, line 140 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Give this line a comment. It's pretty subtle.
Done.
pkg/kv/kvserver/txnwait/queue.go, line 484 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Let's keep/adapt this comment.
Done.
pkg/kv/kvserver/txnwait/queue.go, line 500 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: consider pulling this into a helper and sharing that with
getNumPusherTxnsToTrack
.
Done.
pkg/kv/kvserver/txnwait/queue.go, line 683 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think we still want to release if it isn't the pusher which is updated, because we may still have noticed new dependents.
Removed the comment.
pkg/kv/kvserver/txnwait/queue.go, line 705 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It doesn't look like it. Only the goroutine that sent on
queryPusherCh
needs to be notified.
Done.
pkg/roachpb/api.proto, line 1591 at r20 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yeah, I would think that we would only need to reject due to a programming error if we only ran into our child's intent. Otherwise, we're also seeing a real conflict, so we're either not in an infinite retry loop (meaning we'll find the programming error after a retry) or we have other real issues.
This is gone.
pkg/roachpb/errors.proto, line 482 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is this the direct parent or the ancestor that was aborted? If the latter (which I suspect it is so that we can catch the retry error at the right level), should we rename this to
AncestorAbortedError
? Or at least spell this all out?
Done.
c016b77
to
7182dc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r25, 1 of 7 files at r26, 1 of 6 files at r27.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/kv/txn.go, line 1291 at r28 (raw file):
// from recoverable internal errors, and is automatically committed // otherwise. The retryable function should have no side effects which could // cause problems in the event it must be run more than once.
does this "run more than once" include "run and committed more than once", i.e., do the writes need to be idempotent?
pkg/kv/txn.go, line 1300 at r28 (raw file):
// // * No concurrent operations may be performed on the parent transaction. // * The child transaction must not invalidate any reads of the parent.
so it can't write to any keys that the parent has already read? Is this beyond the protection that the ts cache gives us? I am assuming this includes the case where the child txn is writing at a higher timestamp, which the ts cache will allow, but we are explicitly saying it should not do so.
But it could write to keys that the parent will read after the child commits?
Without some documented constraints on the "read set compression", I am not sure how the child txn can do any writes. Can we document them here?
What if the parent has acquired an unreplicated exclusive lock that has been forgotten. We can't guarantee an assertion failure.
In practice do we need many levels of ancestors or will this be used with only 2-levels? If the latter, can the overall code be simplified under that constraint.
pkg/kv/txn.go, line 1318 at r28 (raw file):
// if the parent is aborted during deadlock detection and the child observes // that it has been aborted, the child will also be aborted and the parent // will encounter a retry.
This behavior is not needed for correctness, yes?
pkg/kv/txn.go, line 1323 at r28 (raw file):
// timestamp of the parent. // * The child may commit values which the parent subsequently reads. This is // fine and can be desirable.
These committed values are observable by other unrelated txns that are concurrent, yes?
So is the only thing that makes the parent-child relationship special the one-way information transfer from the parent to the child of writes by the parent being observed by the child? The other direction of information is just whether the child committed or not, which the parent would know about if it used a non-child transaction?
(I'm asking things that were probably documented in the RFC, but I've swapped out some of that state, and I am guessing some things may have changed since the RFC)
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):
Previously, ajwerner wrote…
I moved this above adding the discovered lock. I do think this still needs to be here as the lock table doesn't have a complete view.
This seems reasonable.
Will the parentIntentError
cause this Request
to stop evaluating and call FinishReq
? Do we want the call to AddDiscoveredLock
when this error occurs?
pkg/kv/kvserver/concurrency/lock_table.go, line 454 at r28 (raw file):
// will result in an assertion failure error. This is consistent with the // rules of child transactions which state that descendants my not acquire // exclusive locks or write underneath the reads or writes of any ancestors.
What if the descendent is writing above a read of the ancestor? We can't detect it here, yes?
Since the lock table is incomplete wrt both replicated and unreplicated locks, and this write-write conflict seems mainly an error check against our internal use of child transactions, I am wondering whether we need to make any changes to the lock table at all and just treat an ancestor txn lock as equivalent to the child txn already holding the lock. Then the evaluation will pick up any errors via the parentIntentError
path.
We're also introducing optimistic locking where the checking happens after evaluation #58670, so I'm trying to see what changes are essential to the lock table and what are optional (to reduce its complexity as much as possible).
I'm probably missing some subtlety here.
pkg/storage/pebble_mvcc_scanner.go, line 92 at r28 (raw file):
// Go port of mvccScanner in libroach/mvcc.h. Stores all variables relating to // one MVCCGet / MVCCScan call. type pebbleMVCCScanner struct {
Do we only need changes to pebbleMVCCScanner
in the MVCC path since the only special handling for child txns is the write => read data transfer from parent to child?
I keep forgetting that even MVCCGet
uses pebbleMVCCScanner
. The code comment for this struct is both dated and insufficient. Could you add another sentence saying that pebbleMVCCScanner
must be used for all Gets and Scans since it correctly handles child transactions.
pkg/storage/pebble_mvcc_scanner.go, line 373 at r28 (raw file):
} ownIntent, epoch, seqNum, ignoredSeqNums := p.ownsIntent()
How does the epoch
and seqNum
of the child txn relate to that of the parent? I am guessing it is derived from the parent since we are continuing to do comparisons like seqNum >= p.meta.Txn.Sequence
and epoch == p.meta.Txn.Epoch
. Is this already documented elsewhere in the code (I didn't look at the whole PR)?
And this works because of the "No concurrent operations may be performed on the parent transaction"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumeerbhola I'm replying to your comments before addressing the ones which indicate the clear action to keep the conversation going.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @sumeerbhola)
pkg/kv/txn.go, line 1291 at r28 (raw file):
Previously, sumeerbhola wrote…
does this "run more than once" include "run and committed more than once", i.e., do the writes need to be idempotent?
No, it does not mean run and commit more than once. It's trying to express the same behavior as (*kv.DB).Txn()
.
pkg/kv/txn.go, line 1300 at r28 (raw file):
Without some documented constraints on the "read set compression", I am not sure how the child txn can do any writes. Can we document them here?
This is a fair and valid point. If the refresh_span_bytes were to be set to zero, child transactions which get pushed can never commit. If the child never gets pushed, there's no problem. If the child were to invalidate the write of an ancestor, it would have to be pushed.
If I read the span condensing logic correctly, we'll never add span bounds which we've not added. For example, we won't pick up spans which are outside of the spans which were read. Say we had AB-AC
,AD-AE
,AF-AG
We might compress that to AB-AG
but I don't think we'll ever change the bounds to A-B
.
The principle this ends up with is that if you want to be safe, never write to any of the same table spans in child transactions that the parent has read from and even then don't read from tables which might be in-between tables that the parent has read from. This ends up being both quite restrictive and yet good enough for the use-cases I currently envision. We could probably tighten this up to change the span compression logic to not cross table boundaries.
In practice do we need many levels of ancestors or will this be used with only 2-levels? If the latter, can the overall code be simplified under that constraint.
I don't have an immediate use-case for more than one layer. @nvanbenschoten proposed the generalization. I'm not sure that it does too much to simplify any of the code.
pkg/kv/txn.go, line 1318 at r28 (raw file):
Previously, sumeerbhola wrote…
This behavior is not needed for correctness, yes?
Correct.
pkg/kv/txn.go, line 1323 at r28 (raw file):
These committed values are observable by other unrelated txns that are concurrent, yes?
Right
So is the only thing that makes the parent-child relationship special the one-way information transfer from the parent to the child of writes by the parent being observed by the child? The other direction of information is just whether the child committed or not, which the parent would know about if it used a non-child transaction?
Yes, well that and the parent will be pushed to the child's commit timestamp so that it cannot commit under the parent. Ideally if code needs information to transfer from the child back to the parent, it will do it through the closure.
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):
Previously, sumeerbhola wrote…
This seems reasonable.
Will theparentIntentError
cause thisRequest
to stop evaluating and callFinishReq
? Do we want the call toAddDiscoveredLock
when this error occurs?
Before this most recent revision, we were calling AddDiscoveredLock
. I don't think it's a correctness problem either way, right?
pkg/kv/kvserver/concurrency/lock_table.go, line 454 at r28 (raw file):
Previously, sumeerbhola wrote…
What if the descendent is writing above a read of the ancestor? We can't detect it here, yes?
Since the lock table is incomplete wrt both replicated and unreplicated locks, and this write-write conflict seems mainly an error check against our internal use of child transactions, I am wondering whether we need to make any changes to the lock table at all and just treat an ancestor txn lock as equivalent to the child txn already holding the lock. Then the evaluation will pick up any errors via the
parentIntentError
path.We're also introducing optimistic locking where the checking happens after evaluation #58670, so I'm trying to see what changes are essential to the lock table and what are optional (to reduce its complexity as much as possible).
I'm probably missing some subtlety here.
We detect the descendant writing over an ancestor's read at commit time by comparing the writes of the descendant to the read spans of its ancestors. All of this is just a protection against programming errors so I'd like it to be minimally disruptive.
pkg/storage/pebble_mvcc_scanner.go, line 92 at r28 (raw file):
Previously, sumeerbhola wrote…
Do we only need changes to
pebbleMVCCScanner
in the MVCC path since the only special handling for child txns is the write => read data transfer from parent to child?
I keep forgetting that evenMVCCGet
usespebbleMVCCScanner
. The code comment for this struct is both dated and insufficient. Could you add another sentence saying thatpebbleMVCCScanner
must be used for all Gets and Scans since it correctly handles child transactions.
Will do.
pkg/storage/pebble_mvcc_scanner.go, line 373 at r28 (raw file):
I'll comment ownsIntent()
to better explain what's going on. The idea is that the descendant is viewing the parent's writes as though it were the ancestor as of the snapshot of the ancestor when the child is created.
And this works because of the "No concurrent operations may be performed on the parent transaction"?
Indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks good to me now! Thanks for iteratively refining this.
Reviewed 2 of 32 files at r21, 6 of 6 files at r22, 3 of 7 files at r23, 3 of 4 files at r24, 4 of 4 files at r25, 6 of 7 files at r26, 5 of 6 files at r27, 2 of 2 files at r28.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @sumeerbhola)
pkg/kv/txn.go, line 1324 at r3 (raw file):
Previously, ajwerner wrote…
ManualRestart I think makes sense. It felt like the SQL layer was dealing with this properly but I may just be misunderstanding how this should work.
Did you get a chance to confirm that the SQL layer is handling this correctly?
pkg/kv/txn.go, line 1318 at r28 (raw file):
Previously, sumeerbhola wrote…
This behavior is not needed for correctness, yes?
I don't think so. It would be impossible to guarantee.
pkg/kv/txn.go, line 1323 at r28 (raw file):
Previously, sumeerbhola wrote…
These committed values are observable by other unrelated txns that are concurrent, yes?
So is the only thing that makes the parent-child relationship special the one-way information transfer from the parent to the child of writes by the parent being observed by the child? The other direction of information is just whether the child committed or not, which the parent would know about if it used a non-child transaction?(I'm asking things that were probably documented in the RFC, but I've swapped out some of that state, and I am guessing some things may have changed since the RFC)
The other thing that makes the relationship special is its impact on deadlock detection between transactions. Parent transactions have an implicit dependency on their children transactions.
pkg/kv/txn_external_test.go, line 233 at r28 (raw file):
func TestChildTransactionMetadataPropagates(t *testing.T) { defer leaktest.AfterTest(t)() //defer log.Scope(t).Close(t)
Did you mean to uncomment this?
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 159 at r28 (raw file):
both in the txnSpanRefresher
This part isn't true anymore, right?
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 170 at r28 (raw file):
// error. // // Note that if the commitCommitTimestamp is equal to the parent's current
childCommitTimestamp
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 828 at r28 (raw file):
tc.mu.txn.Update(br.Txn) if justCommitted { return roachpb.NewError(tc.pushParentUponCommitLocked(ctx))
It seems like this should live near maybeSleepForLinearizable
, maybe in some method that handles post-commit actions. And commitReadOnlyTxnLocked
should also call this post-commit hook.
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 2415 at r28 (raw file):
require.NotNil(t, child.Parent) require.Equal(t, parent.ID, child.Parent.ID) require.Equal(t, parent.Name+" child", child.Name)
nit: are we ok with the recursive " child"s on grandchildren?
pkg/kv/kvserver/concurrency/lock_table.go, line 275 at r28 (raw file):
ts hlc.Timestamp spans *spanset.SpanSet readTS hlc.Timestamp
I think these are leftovers from a bad rebase. They were removed a few weeks go.
pkg/kv/kvserver/concurrency/lock_table.go, line 453 at r28 (raw file):
// ReadWrite requests due to descendants seeking to acquire locks over ancestors // will result in an assertion failure error. This is consistent with the // rules of child transactions which state that descendants my not acquire
"may"
pkg/kv/kvserver/concurrency/lock_table_test.go, line 55 at r28 (raw file):
Creates a Transaction. new-child-txn parent=<name> txn=<name> ts=<int>[,<int>] epoch=<int> [seq=<int>]
Feel free to ignore, but would it be easier and result in less duplicate code to add an optional parent
field to new-txn
?
pkg/kv/kvserver/txnwait/queue_test.go, line 491 at r28 (raw file):
} // noopT implements assert.TestingT so that its tests can be used without
:expandingbrain:
pkg/roachpb/data.proto, line 410 at r28 (raw file):
// STAGING to COMMITTED. Because of this, the set will only ever contain // entries when the transaction is STAGING. For more, see txnCommitter. It // may be the case that InFlightWrites is populated in a transaction which
Mind changing "in a transaction which" to "in a transaction object which"? This took me a second to understand.
pkg/roachpb/data.proto, line 431 at r28 (raw file):
// parent with this metadata. // // TODO(ajwerner): come back and describe why we want a TxnMeta here as it
Do you intend to address this TODO here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @sumeerbhola)
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):
Previously, ajwerner wrote…
Before this most recent revision, we were calling
AddDiscoveredLock
. I don't think it's a correctness problem either way, right?
Yes, its harmless to call AddDiscoveredLock
even if this request is never going to go and wait at that lock.
pkg/kv/kvserver/concurrency/lock_table.go, line 454 at r28 (raw file):
Previously, ajwerner wrote…
We detect the descendant writing over an ancestor's read at commit time by comparing the writes of the descendant to the read spans of its ancestors. All of this is just a protection against programming errors so I'd like it to be minimally disruptive.
Are you saying we don't actually need any changes to lock_table.go
other than making the parent txn appear equivalent to the child wrt the child not waiting on the parent's locks? That is, is my supposition that the detection of bad behavior via parentIntentError
, actually sufficient?
00101e6
to
d260fce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the sticking point here is going to be the lock table changes. I don't feel strongly either way. I think the error propagation changes in the PR are sort of ugly, but they do at least make some of the interactions more explicit. Until we offer lock strengths weaker than exclusive, I think just letting children pass through under locks held by ancestors works well. If the ancestor were to hold a shared lock, I think the interactions get less clear.
Dismissed @sumeerbhola from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/kv/txn.go, line 1324 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Did you get a chance to confirm that the SQL layer is handling this correctly?
Added a test that shows that it behaves properly.
pkg/kv/txn.go, line 1323 at r28 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The other thing that makes the relationship special is its impact on deadlock detection between transactions. Parent transactions have an implicit dependency on their children transactions.
Right, that's sort of the most important part.
pkg/kv/txn_external_test.go, line 233 at r28 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Did you mean to uncomment this?
Yes, done.
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 159 at r28 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
both in the txnSpanRefresher
This part isn't true anymore, right?
Fixed.
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 170 at r28 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
childCommitTimestamp
Done.
pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 828 at r28 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It seems like this should live near
maybeSleepForLinearizable
, maybe in some method that handles post-commit actions. AndcommitReadOnlyTxnLocked
should also call this post-commit hook.
Done.
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 2415 at r28 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: are we ok with the recursive " child"s on grandchildren?
I am. It's at least pretty clear. I don't know when we're going to have grandchildren.
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):
Previously, sumeerbhola wrote…
Yes, its harmless to call
AddDiscoveredLock
even if this request is never going to go and wait at that lock.
Reverted.
pkg/kv/kvserver/concurrency/lock_table.go, line 275 at r28 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think these are leftovers from a bad rebase. They were removed a few weeks go.
Done.
pkg/kv/kvserver/concurrency/lock_table.go, line 453 at r28 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"may"
Done.
pkg/kv/kvserver/concurrency/lock_table.go, line 454 at r28 (raw file):
Previously, sumeerbhola wrote…
Are you saying we don't actually need any changes to
lock_table.go
other than making the parent txn appear equivalent to the child wrt the child not waiting on the parent's locks? That is, is my supposition that the detection of bad behavior viaparentIntentError
, actually sufficient?
My original implementation was exactly what I think you're saying. It just made child transactions appear equivalent to the parent in the lock table and then any issues were caught in the HandleWriteIntentError
. @nvanbenschoten felt that this was too implicit so I added the more explicit error checking here.
This is what that commit looked like (the testing was bogus but otherwise, I do think it was okay and simpler): 30037f1
pkg/kv/kvserver/concurrency/lock_table_test.go, line 55 at r28 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Feel free to ignore, but would it be easier and result in less duplicate code to add an optional
parent
field tonew-txn
?
Ack.
pkg/roachpb/data.proto, line 410 at r28 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Mind changing "in a transaction which" to "in a transaction object which"? This took me a second to understand.
Done.
pkg/roachpb/data.proto, line 431 at r28 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do you intend to address this TODO here?
Removed. That was very old.
pkg/storage/pebble_mvcc_scanner.go, line 92 at r28 (raw file):
Previously, ajwerner wrote…
Will do.
Done.
0a017e4
to
00abaee
Compare
58979: schemachanger,*: introduce new schema changer r=lucy-zhang a=ajwerner This PR introduces the new schema changer, for which a partially written RFC (including a "guide-level explanation") can be found in #59288. It contains a somewhat complete implementation for a subset of `ALTER TABLE ... ADD COLUMN`. Like the existing implementation, these schema changes are planned and partially executed in the user transaction, but most of the work happens in an asynchronous job that runs after the transaction has committed. We use the index backfiller to build new columns, as described in #47989. See the RFC and individual commits for details. Missing things that will be in later PRs: - Making execution changes to support concurrent writes for columns added using the index backfiller. We currently don't have a way to write column values which are not part of the primary index. See #59149. (#58990 takes care of the backfill part). - Ensuring compatible coexistence with the old schema changer; in particular, polling for queued mutations to finish before any schema changes can start, using child transactions within the user transaction (#56588). The new schema changer assumes it has exclusive rights to perform schema changes on the table, which anticipates the transactional schema changes of the future. - Reverting schema changes when failed or canceled. - Properly checkpointing and reading job progress for the backfill. Currently there's a placeholder implementation that doesn't interact with the job at all. Co-authored-by: Andrew Werner <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 31 files at r29, 6 of 6 files at r30, 3 of 7 files at r31, 3 of 4 files at r32, 4 of 4 files at r33, 7 of 7 files at r34, 5 of 6 files at r35, 2 of 2 files at r36.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/kv/kvserver/concurrency/lock_table.go, line 454 at r28 (raw file):
Previously, ajwerner wrote…
My original implementation was exactly what I think you're saying. It just made child transactions appear equivalent to the parent in the lock table and then any issues were caught in the
HandleWriteIntentError
. @nvanbenschoten felt that this was too implicit so I added the more explicit error checking here.This is what that commit looked like (the testing was bogus but otherwise, I do think it was okay and simpler): 30037f1
The hazard I was concerned with was allowing a child txn to acquire a replicated lock on top of their parent's unreplicated lock, the (3) | u | r
case from that ref. The semantics of such a lock acquisition are not clear to me. Do we replace the parent's unreplicated lock with the child's replicated lock? What then happens when the child commits? Does the parent lose its lock? It's not clear how the lockHolderInfo
structure can even represent such a lock state.
Maybe I missed an implied "we don't care about corrupting a lockHolderInfo
because this is an error case anyway and no one should be able to commit".
I agree that we should get to a place that we're all happy with here, as @sumeerbhola's point about this needing to interact well with optimistic locking is very valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r33, 6 of 7 files at r34.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @sumeerbhola)
pkg/kv/kvserver/concurrency/concurrency_control.go, line 464 at r36 (raw file):
// The returned error indicates illegal operations by descendant transactions // interacting with their ancestors' locks. It is illegal for a descendant to // try to acquire a lock that overlaps with any of its descendants.
I think the "it is illegal" statement should say that the descendant should not have a SpanReadWrite
span that overlaps with an Exclusive lock held by an ancestor.
pkg/kv/kvserver/concurrency/concurrency_control.go, line 601 at r36 (raw file):
// CurState returns the latest waiting state. It returns an error if the // lock which was encountered is not allowed as it is due to an ancestor // transaction.
(nit: to be more consistent with the earlier comment)
... was encountered is illegal for this descendant transaction to conflict with, since it is held by an ancestor transaction.
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 168 at r36 (raw file):
// Scan for conflicting locks. log.Event(ctx, "scanning lock table for conflicting locks") g.ltg, _ = m.lt.ScanAndEnqueue(g.Req, g.ltg)
why is this ignoring the error return value?
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 274 at r36 (raw file):
wait = true } for cur := g.Req.Txn; cur != nil && parentIntentError == nil; cur = cur.Parent {
Can you add a comment before this for loop
// The lock table detects an illegal conflict with the parent, in a manner similar to this parentIntentError computation. So usually going back and waiting at the lock table after calling AddDiscoveredLock would be good enough, and we would not need to do this computation here. However, the lockTable may not be enabled, which would prevent that detection. So we make sure by computing parentIntentError.
pkg/kv/kvserver/concurrency/lock_table.go, line 454 at r28 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The hazard I was concerned with was allowing a child txn to acquire a replicated lock on top of their parent's unreplicated lock, the
(3) | u | r
case from that ref. The semantics of such a lock acquisition are not clear to me. Do we replace the parent's unreplicated lock with the child's replicated lock? What then happens when the child commits? Does the parent lose its lock? It's not clear how thelockHolderInfo
structure can even represent such a lock state.Maybe I missed an implied "we don't care about corrupting a
lockHolderInfo
because this is an error case anyway and no one should be able to commit".I agree that we should get to a place that we're all happy with here, as @sumeerbhola's point about this needing to interact well with optimistic locking is very valid.
I see. In light of this I am fine with the change here. It won't complicate optimistic locking, I think. An optimistically evaluated request by a descendant will discover an error due to an unreplicated lock by the ancestor, after it evaluates (when it goes to check the lock table).
pkg/kv/kvserver/concurrency/lock_table.go, line 446 at r36 (raw file):
} // isAncestorTxn returns true if txn corresponds any of the ancestors of the
... if txn is an ancestor of the txn in g.
pkg/kv/kvserver/concurrency/lock_table.go, line 456 at r36 (raw file):
// Note that parent transactions may have replicated exclusive locks on keys // due to writes (intents) which are not discovered in the lock table. That is // fine and dealt with in HandleWriteIntentError. This is not handled perfectly
I worry about this "This is not handled perfectly ..." stuff.
Having a descendants unreplicated lock over an ancestor's replicated lock seems as confusing as the hazard @nvanbenschoten was mentioning where the ancestor had the unreplicated lock and the descendant had the replicated one. Given we control what these descendants are doing, can we just ensure (while the only locks are Exclusive locks -- we can revisit when adding support for other strengths) that the descendants don't unnecessarily scan with locks, unless they plan to write to that span. Then we can declare this case as illegal too.
Hmm, maybe I'm misreading this comment. The code in tryActiveWait
is doing the straightforward thing and not caring whether the currently held lock is replicated or unreplicated.
pkg/storage/pebble_mvcc_scanner.go, line 441 at r36 (raw file):
if seqNum >= p.meta.Txn.Sequence && !enginepb.TxnSeqIsIgnored(p.meta.Txn.Sequence, ignoredSeqNums) { // 9. We're reading our own txn's intent at an equal or higher sequence.
would be nice to update all these comments that say "our own".
00abaee
to
7f1efd8
Compare
Release note: None
Adds an error to deal with the case where a child detects that an ancestor is aborted. We can then turn this into a restart error for the appropriate ancestor and abort all of the other relevant descendents of that ancestor. Release note: None
This commit introduces the interface to child transactions. The method, `ChildTxn()` is symmetrical with `Txn()` in that it takes a function which will be handed the child transaction. This method additionally will deal with translating certain errors for the parent and with pushing the parent as makes sense. Release note: None
This commit adds logic to detect deadlocks involving child transactions. Child transactions are interpretted as being implicitly pushed by their parents. Release note: None
This commit augments the MVCC layer to understand child transactions. A child transaction can read data written by the parent as though it was the parent.a Release note: None
This commit adds support for child transactions to the lockTable and deals with unexpected WriteIntentErrors due to write-write conflicts between the parent and child transaction. It also disallows attempts by child transactions to acquire exclusive locks over unreplicated exclusive locks of their ancestors. Note that because the lock table does not have a complete view of locks, some locks will be discovered during evaluation. That case is handled during HandleWriteIntentError. Release note: None
This commit adds logic to ensure that the child transaction pushes its ancestors to at least its commit timestamp. It also ensures that it can detect illegal writes performed by a child which would invalidate the reads of an ancestor and result in an infinite retry loop. Release note: None
Release note: None
7f1efd8
to
fd39881
Compare
This PR adds support for child transactions. Still missing is a flag to ensure that parent writes are not encountered.
Fixes #54633