Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv,storage: add support for child transactions #56588

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Nov 11, 2020

This PR adds support for child transactions. Still missing is a flag to ensure that parent writes are not encountered.

Fixes #54633

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/child-transactions branch 3 times, most recently from 91c9080 to 03f7a56 Compare November 12, 2020 15:42
@ajwerner ajwerner force-pushed the ajwerner/child-transactions branch 3 times, most recently from d771dd9 to fd6b310 Compare December 1, 2020 15:32
@ajwerner ajwerner marked this pull request as ready for review December 1, 2020 15:32
@ajwerner
Copy link
Contributor Author

ajwerner commented Dec 1, 2020

@nvanbenschoten I still haven't added the flag to disallow interactions between children and their ancestor intents but I'd love if you could give this a pass as I think there's some controversial stuff.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. Sorry for the long delay on the review - it got lost in my inbox.

What were the controversial parts that you wanted to discuss further? Did I catch them in the review or were there others you'd also like to raise and discuss?

Reviewed 5 of 5 files at r1, 5 of 5 files at r2, 6 of 6 files at r3, 3 of 3 files at r4, 4 of 4 files at r5, 4 of 4 files at r6, 6 of 6 files at r7.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/kv/sender.go, line 302 at r7 (raw file):

	// transaction and must be committed. If the set of read spans
	// of the transaction overlaps with the writes of the child (as accounted
	// for in InFlightWrites and LockSpans), an error will be returned.

Oh, that's interesting. So we don't even want to support cases where the parent has already read keys that the child ends up writing? If so, does that simplify the lock table handling?


pkg/kv/sender.go, line 303 at r7 (raw file):

	// of the transaction overlaps with the writes of the child (as accounted
	// for in InFlightWrites and LockSpans), an error will be returned.
	ForwardToChild(ctx context.Context, child *roachpb.Transaction) error

My immediate reaction to this signature is "why isn't this ForwardTimestampAndRefresh?"


pkg/kv/txn.go, line 1290 at r3 (raw file):

}

// ChildTxn executes retryable in the context of a distributed transaction. The

We'll want to say more here about the interaction between the parent txn and the child txn. What happens when the child txn reads a key that the parent txn wrote to? What happens when it writes to the same key as the parent txn? Similarly, what are the answers to those questions in reverse, once the child commits and the parent begins operating again?


pkg/kv/txn.go, line 1314 at r3 (raw file):

	// TODO(ajwerner): Validate the state of the parent transaction. It should be
	// open (non-terminal), and a root transaction at the very least.

Yeah, this seems like a good idea.


pkg/kv/txn.go, line 1317 at r3 (raw file):

	kvTxn := txn.Sender().NewChildTransaction()
	childTxn := NewTxnFromProto(ctx, txn.db, txn.gatewayNodeID, kvTxn.ReadTimestamp, RootTxn, kvTxn)

I'm not quite sure what's going on with txn.mu.userPriority in this method, but I'm not sure it's going to behave correctly if kvTxn has a min or max priority.


pkg/kv/txn.go, line 1324 at r3 (raw file):

	if paErr := (*roachpb.ParentAbortedError)(nil); errors.As(err, &paErr) &&
		paErr.ParentTxn.ID == txn.ID() {
		return roachpb.NewTransactionRetryWithProtoRefreshError(

Do we need to call txn.mu.sender.ManualRestart or GenerateForcedRetryableError to ensure that the parent txn's TxnCoordSender itself is restarted?


pkg/kv/txn.go, line 1332 at r3 (raw file):

	// a higher-level txn to be retried. We don't do this in any of the other
	// functions in DB; I guess we should.
	if errors.HasType(err, (*roachpb.TransactionRetryWithProtoRefreshError)(nil)) {

Shouldn't the child transaction have retried in childTxn.exec if it saw one of these errors?

Also, is errors.Wrapf enough to terminate the error? Won't errors.HasType still find the error cause?


pkg/kv/txn.go, line 1348 at r7 (raw file):

	// the child. This will also detect whether the child illegally wrote over
	// the parent's reads.
	return txn.Sender().ForwardToChild(ctx, childTxn.TestingCloneTxn())

nit: we shouldn't be using TestingCloneTxn in non-testing code.


pkg/kv/txn_external_test.go, line 229 at r3 (raw file):

// that requests issued in a child transaction carry the TxnMeta of the
// parent.
func TestChildTransactionMetadataPropagates(t *testing.T) {

Do we need another test for various retry conditions around child and parent txns?


pkg/kv/txn_external_test.go, line 362 at r4 (raw file):

	//
	// Without any transaction timeouts, the two possible outcomes should be
	// either that A and its child success or that B and its child success.

s/success/succeed/


pkg/kv/txn_external_test.go, line 365 at r4 (raw file):

	//
	// There are some extreme scenarios which can happen under transaction
	// expiration whereby A ends up getting aborted because of a timeout and A1

What kind of timeout?


pkg/kv/txn_external_test.go, line 369 at r4 (raw file):

	// could happen for both of them. We do not anticipate or allow these
	// scenarios.
	if (errA == nil && errB == nil) || (errA != nil && errB != nil) {

nit: (errA == nil) == (errB == nil)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1153 at r3 (raw file):

	// TODO(ajwerner): Consider stripping information from the parent proto like
	// LockSpans or perhaps introducing a new proto altogether for parents. The
	// latter is likely to be onerous.

A TxnMeta doesn't work?


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1172 at r3 (raw file):

	// deadlock. It seems that priority levels should only increase and thus
	// we shouldn't have these scenarios so long as the priority starts at at
	// least that of the parent.

I don't quite understand this comment. Is it assuming that priorities immediately resolve pushes one way or another? That isn't really the case – they mostly just impact who gets aborted once a deadlock has been detected.


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):

		for _, readSpan := range tc.interceptorAlloc.refreshFootprint.asSlice() {
			if writes.Overlaps(readSpan.AsRange()) {
				return errors.New("cannot forward provisional commit timestamp " +

If we refreshed the parent immediately to the new write timestamp, could we detect conflicts without all this?

Or does that not work for cases where the parent and child txns have the same timestamp? Actually, maybe that's fine for the same reason we can short circuit on child.WriteTimestamp.LessEq(tc.mu.txn.ReadTimestamp).


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):

			log.Fatalf(ctx, "%v", err)
		}
		for cur := g.Req.Txn; cur != nil && parentIntentError == nil; cur = cur.Parent {

Do we need this if we detect the conflict in the lock table? I guess we do in cases where the lock table is disabled (!added)?

If that's so, then should we detect this before even calling AddDiscoveredLock?


pkg/kv/kvserver/concurrency/lock_table.go, line 432 at r6 (raw file):

// the system.
//
//   case | parent | child

This is pretty interesting - and a little scary. How many of these combinations do we actually need to support? The first thing that jumps out to me is that this should probably be talking in terms of locking strength, not locking durability. It's understandable that you didn't do this because we only currently support a single locking strength.

Once we do that, I wonder if we can simplify this a little by saying that a lock held by a parent txn also implies that the lock is held by the child txn, up to the locking strength. So a child transaction only needs to acquire a lock that conflicts with one that its parent acquired if it needs a stronger lock.

Viewing this problem as one of lock promotion is nice because it pretty clearly raises the question of what happens when the parent shares a SHARED lock with another txn and then the child tries to upgrade to an exclusive lock. This creates the deadlock scenario that you've been thinking through.

It also raises the question of what to do if the parent already holds an exclusive lock (all four of these cases). In such cases, do we need to support the child being able to perform a write? It sounds like we don't for replicated/replicated. Do we need to for any other case?


pkg/kv/kvserver/concurrency/lock_table_test.go, line 50 at r6 (raw file):

  Creates a lockTable. The lockTable is initially enabled.

new-txn txn=<name> ts=<int>[,<int>] epoch=<int> [seq=<int>]

Add a comment about the new new-child-txn directive.


pkg/kv/kvserver/concurrency/lock_table_test.go, line 154 at r6 (raw file):

				guardsByReqName = make(map[string]lockTableGuard)
				return ""
			case "new-child-txn":

nit: drop this below new-txn


pkg/kv/kvserver/concurrency/testdata/lock_table/child_txn, line 2 at r6 (raw file):

# Test that we do not block on a parent's intent.
# Test that we do not block on a parent's read lock.

s/read/unreplicated exclusive/


pkg/kv/kvserver/concurrency/testdata/lock_table/child_txn, line 3 at r6 (raw file):

# Test that we do not block on a parent's intent.
# Test that we do not block on a parent's read lock.
# TODO(ajwerner): We need to test some of the more pernicious issues.

👍


pkg/kv/kvserver/txnwait/queue.go, line 140 at r4 (raw file):

	set := map[uuid.UUID]struct{}{}
	for _, push := range pt.waitingPushes {
		for cur := &push.req.PusherTxn; cur != nil; cur = cur.Parent {

Give this line a comment. It's pretty subtle.


pkg/kv/kvserver/txnwait/queue.go, line 484 at r4 (raw file):

		readyCh = make(chan struct{}, 1)
		queryPusherCh, queryPusherErrCh = q.startQueryPusherTxn(ctx, push, readyCh)
		// Ensure that the pusher querying goroutine is complete at exit.

Let's keep/adapt this comment.


pkg/kv/kvserver/txnwait/queue.go, line 500 at r4 (raw file):

		for i, cur := 0, &req.PusherTxn; cur != nil; cur = cur.Parent {
			if cur.ID == uuid.Nil || !cur.IsLocking() {

nit: consider pulling this into a helper and sharing that with getNumPusherTxnsToTrack.


pkg/kv/kvserver/txnwait/queue.go, line 653 at r4 (raw file):

					// not the immediate parent? Ideally we'd abort all of the parents.
					// I think this will work so long as we propagate this error and
					// only convert it to a retry at the relevant parent.

Yeah, that's how I figured this would work too. But we'll need to make sure of it in the client.


pkg/kv/kvserver/txnwait/queue.go, line 683 at r4 (raw file):

			// TODO(ajwerner): What should be the semantics here? Why would we release
			// if we're going to forcePush ourselves? Should we release if it isn't
			// the pusher which is updated but rather some parent?

I think we still want to release if it isn't the pusher which is updated, because we may still have noticed new dependents.


pkg/kv/kvserver/txnwait/queue.go, line 705 at r4 (raw file):

			}
			// Signal the pusher query txn loop to continue.
			// TODO(ajwerner): Do we need all of these to be notified?

It doesn't look like it. Only the goroutine that sent on queryPusherCh needs to be notified.


pkg/roachpb/data.proto, line 423 at r1 (raw file):

  //
  // TODO(ajwerner): come back and describe why we want a TxnMeta here as it
  // pertains to read visibility of the parent.

Do we want a TxnMeta or a Transaction?


pkg/roachpb/errors.proto, line 482 at r2 (raw file):

// behavior.
message ParentAbortedError {
  optional roachpb.Transaction parent_txn = 1 [(gogoproto.nullable) = false];

Is this the direct parent or the ancestor that was aborted? If the latter (which I suspect it is so that we can catch the retry error at the right level), should we rename this to AncestorAbortedError? Or at least spell this all out?


pkg/storage/mvcc_history_test.go, line 1007 at r5 (raw file):

}

func (e *evalCtx) newChildTxn(

nit: move this below newTxn


pkg/storage/pebble_mvcc_scanner.go, line 471 at r5 (raw file):

	if epoch < p.meta.Txn.Epoch {
		// 12. We're reading our own txn's intent but the current txn has

12?


pkg/storage/pebble_mvcc_scanner.go, line 489 at r5 (raw file):

// ownsIntent return true if the current intent is owned by the reading
// transaction or one of its parents. If it is owned by the reading transaction,

The second sentence makes it sound like the return values are only set if the intent is owned by the reading transaction and not one of its parents. Consider updating this.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look. Generally agree with your testing comments. Also agree with being more explicit about some properties in comments. Curious to explore some locking upgrade concepts to make that code clearer.

I'll come back and take another pass on the code after we discuss some of the things here.

I think the most controversial things were:

  • The protobuf design - using the transaction directly and chaining them
  • The forwarding APIs - it started as simply as you imagined but being able to detect the difference between a true refresh failure and a real problem due to parent-child conflicts drove me to this API
  • The various error types and handling in:
    • lock table when encountering a parent's write in a write-write conflict (or read lock-write conflict I suppose)
    • When the child detects an aborted ancestor in the txnwait.Queue

Another note is that we've come up with a case where we think we're going to want to use these child transactions in 21.1.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/kv/sender.go, line 302 at r7 (raw file):
If the parent read the keys and then the child writes, it seems that the parent should be doomed to fail. In particular, we want to make sure that the parent never causally precedes the child by the time it commits. If the parent read, then the child wrote, the child will get pushed above the parent because of the timestamp cache. That will force the parent to refresh here leading to certain failure.

If so, does that simplify the lock table handling?

What do you mean?


pkg/kv/sender.go, line 303 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

My immediate reaction to this signature is "why isn't this ForwardTimestampAndRefresh?"

It originally was however there's a funny cases I ran into making sure we can detect bad client behavior.


pkg/kv/txn.go, line 1317 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm not quite sure what's going on with txn.mu.userPriority in this method, but I'm not sure it's going to behave correctly if kvTxn has a min or max priority.

The child needs to have priority at least as high as its parent. Otherwise there are some funny live-lock cases that can arise where children get aborted but parents don't and these things end up looping.


pkg/kv/txn.go, line 1324 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need to call txn.mu.sender.ManualRestart or GenerateForcedRetryableError to ensure that the parent txn's TxnCoordSender itself is restarted?

ManualRestart I think makes sense. It felt like the SQL layer was dealing with this properly but I may just be misunderstanding how this should work.


pkg/kv/txn.go, line 1332 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Shouldn't the child transaction have retried in childTxn.exec if it saw one of these errors?

Also, is errors.Wrapf enough to terminate the error? Won't errors.HasType still find the error cause?

I agree this is a problem. I don't really know what error should be returned here. I don't think the parent can handle this.


pkg/kv/txn.go, line 1348 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: we shouldn't be using TestingCloneTxn in non-testing code.

Agreed, I think I want to make it non-testing. Any issues with that or alternative suggestions?


pkg/kv/txn_external_test.go, line 229 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need another test for various retry conditions around child and parent txns?

Surely


pkg/kv/txn_external_test.go, line 365 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What kind of timeout?

expiration like because heartbeats aren't working


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1153 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

A TxnMeta doesn't work?

Well, for one, we need the ignored sequence numbers and we need to propagate the entire chain of ancestors. My other concern was that it would make the code that walks the ancestor chain messier.


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1172 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't quite understand this comment. Is it assuming that priorities immediately resolve pushes one way or another? That isn't really the case – they mostly just impact who gets aborted once a deadlock has been detected.

Correct, this is entirely in the context of deadlocks. Nevertheless, imagine the above scenario. When deciding to break a deadlock, it's the pusher that makes the determination to do the force push. If both pushers have lower priorities than the transactions which they are pushing, then there's a problem, right?


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If we refreshed the parent immediately to the new write timestamp, could we detect conflicts without all this?

Or does that not work for cases where the parent and child txns have the same timestamp? Actually, maybe that's fine for the same reason we can short circuit on child.WriteTimestamp.LessEq(tc.mu.txn.ReadTimestamp).

I originally just pushed the parent's write timestamp. That will eventually force a refresh. The problem is that if that refresh fails, we won't know whether that's a legitimate serializable restart or a programming bug (this case). Because we can't know which transaction caused the refresh failure, we can't just refresh unless we want all refreshes due to child pushes always fail, which seems unacceptable.

A test that I wrote ran into this where the parent transaction just retried forever when a child had invalidated one of its reads.

One way out of this would be if we forced parents to run in a pessimistic mode with replicated locks on all reads. Then we could detect any such conflict before it happened rather than after. Of course, that doesn't seem like something that's going to happen.


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need this if we detect the conflict in the lock table? I guess we do in cases where the lock table is disabled (!added)?

If that's so, then should we detect this before even calling AddDiscoveredLock?

I feel like I'm missing something here.


pkg/kv/kvserver/concurrency/lock_table.go, line 432 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is pretty interesting - and a little scary. How many of these combinations do we actually need to support? The first thing that jumps out to me is that this should probably be talking in terms of locking strength, not locking durability. It's understandable that you didn't do this because we only currently support a single locking strength.

Once we do that, I wonder if we can simplify this a little by saying that a lock held by a parent txn also implies that the lock is held by the child txn, up to the locking strength. So a child transaction only needs to acquire a lock that conflicts with one that its parent acquired if it needs a stronger lock.

Viewing this problem as one of lock promotion is nice because it pretty clearly raises the question of what happens when the parent shares a SHARED lock with another txn and then the child tries to upgrade to an exclusive lock. This creates the deadlock scenario that you've been thinking through.

It also raises the question of what to do if the parent already holds an exclusive lock (all four of these cases). In such cases, do we need to support the child being able to perform a write? It sounds like we don't for replicated/replicated. Do we need to for any other case?

Fundamentally I think if the parent holds an exclusive lock then we should have an assertion failure. I had a hard time understanding how to make that happen so instead I allowed it to happen and then forced the assertion failure in other places. I'd love to make this interaction more explicit and fix any problems here.


pkg/roachpb/data.proto, line 423 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we want a TxnMeta or a Transaction?

Right, s/TxnMeta/Transaction/.

Discussed elsewhere but the issues are the ignored sequence numbers and also the chain of ancestors.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):

Previously, ajwerner wrote…

I originally just pushed the parent's write timestamp. That will eventually force a refresh. The problem is that if that refresh fails, we won't know whether that's a legitimate serializable restart or a programming bug (this case). Because we can't know which transaction caused the refresh failure, we can't just refresh unless we want all refreshes due to child pushes always fail, which seems unacceptable.

A test that I wrote ran into this where the parent transaction just retried forever when a child had invalidated one of its reads.

One way out of this would be if we forced parents to run in a pessimistic mode with replicated locks on all reads. Then we could detect any such conflict before it happened rather than after. Of course, that doesn't seem like something that's going to happen.

Alright, thinking a bit more about this, it seems like if the parent is forced to do a refresh before the child commits and proves that that succeeds, then it feels like that could just work. It might be the case that the child gets pushed even further between the refresh and its commit but that's okay because we'll have a proof that the child didn't invalidate any of the parent reads. This also would be better because then the child wouldn't have to commit first.

This would require:

  1. Building the machinery to force the parent to refresh eagerly
  2. Make sure to propagate a reasonable error when that refresh hits a child's intent
  3. Find a way to ensure that the EndTxn cannot happen before the parent does its stuff. Perhaps the way that happens is by having the child require that the read timestamp of the parent is at least as high as the timestamp of any intent it has written. I don't know how trivial this will be.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):

Previously, ajwerner wrote…

Alright, thinking a bit more about this, it seems like if the parent is forced to do a refresh before the child commits and proves that that succeeds, then it feels like that could just work. It might be the case that the child gets pushed even further between the refresh and its commit but that's okay because we'll have a proof that the child didn't invalidate any of the parent reads. This also would be better because then the child wouldn't have to commit first.

This would require:

  1. Building the machinery to force the parent to refresh eagerly
  2. Make sure to propagate a reasonable error when that refresh hits a child's intent
  3. Find a way to ensure that the EndTxn cannot happen before the parent does its stuff. Perhaps the way that happens is by having the child require that the read timestamp of the parent is at least as high as the timestamp of any intent it has written. I don't know how trivial this will be.

If we do want to push and refresh prior to committing the child, we'll need a way to do that for all of the ancestors. Perhaps this implies a need for tigher coupling of children to parents inside of the txnCoordSender.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protobuf design - using the transaction directly and chaining them

I'm 👍 on this.

The forwarding APIs - it started as simply as you imagined but being able to detect the difference between a true refresh failure and a real problem due to parent-child conflicts drove me to this API

Yeah, I'm still not clear what the best option for this is either. It's a tricky problem. I like the idea of using refreshing to detect the conflicts so that we don't need to rely on the client having a precise memory of parent reads and child writes, but getting that to work properly is tough.

The various error types and handling in:
lock table when encountering a parent's write in a write-write conflict (or read lock-write conflict I suppose)

Seems like this should just be an assertion failure, right?

When the child detects an aborted ancestor in the txnwait.Queue

I like the solution you've come up with here for this problem (ParentAbortedError + client-side unwinding).

Reviewed 1 of 1 files at r8, 3 of 3 files at r9.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/kv/sender.go, line 302 at r7 (raw file):

What do you mean?

This was in regard to the discussion we were having in lock_table.go. I thought that we had to handle complex cases like Shared locks held by the parent being promoted to Exclusive locks by the child, but it sounds like we don't. In that case, we can throw an error in the lockTable whenever we find conflicts between ancestors, regardless of locking strength.


pkg/kv/txn.go, line 1317 at r3 (raw file):

Previously, ajwerner wrote…

The child needs to have priority at least as high as its parent. Otherwise there are some funny live-lock cases that can arise where children get aborted but parents don't and these things end up looping.

Right, so what's up with txn.mu.userPriority = roachpb.NormalUserPriority in NewTxnFromProto? Is that going to break things?


pkg/kv/txn.go, line 1348 at r7 (raw file):

Previously, ajwerner wrote…

Agreed, I think I want to make it non-testing. Any issues with that or alternative suggestions?

Nope, that SGTM.


pkg/kv/txn_external_test.go, line 365 at r4 (raw file):

Previously, ajwerner wrote…

expiration like because heartbeats aren't working

👍 maybe add that as an example here.


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1153 at r3 (raw file):

Previously, ajwerner wrote…

Well, for one, we need the ignored sequence numbers and we need to propagate the entire chain of ancestors. My other concern was that it would make the code that walks the ancestor chain messier.

Ack. I thought the ignored sequence numbers were in the TxnMeta, but it appears that they are not.


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1172 at r3 (raw file):

Previously, ajwerner wrote…

Correct, this is entirely in the context of deadlocks. Nevertheless, imagine the above scenario. When deciding to break a deadlock, it's the pusher that makes the determination to do the force push. If both pushers have lower priorities than the transactions which they are pushing, then there's a problem, right?

Yeah, I see what you're saying. So it is not safe for a child to have a lower priority than its parent. I guess that intuitively makes sense from a behavioral standpoint, but it's a good catch that this could actually cause a stall in deadlock resolution.


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):

If we do want to push and refresh prior to committing the child, we'll need a way to do that for all of the ancestors.

We have this problem with the current approach as well, right? We're calling ForwardToChild on the child's immediate parent, but we should really be doing this on all of the child's ancestors.

The idea of refreshing each of a child's ancestors before the child commits seems pretty complex, but your point about infinite retries is valid. Is the idea that refreshing earlier gives us the opportunity to detect conflicts while the child's writes are still intents (i.e. still contain its ID) and therefore can still be distinguished from other transactions' intents?


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):

Previously, ajwerner wrote…

I feel like I'm missing something here.

Me too. My thought was that if we detect the illegal conflict in the lock table already, why do so here as well? We just added the intent to the lock table and we're just about to go back and scan it again.


pkg/kv/kvserver/concurrency/lock_table.go, line 432 at r6 (raw file):

Previously, ajwerner wrote…

Fundamentally I think if the parent holds an exclusive lock then we should have an assertion failure. I had a hard time understanding how to make that happen so instead I allowed it to happen and then forced the assertion failure in other places. I'd love to make this interaction more explicit and fix any problems here.

Well all locks today are exclusive, regardless of whether they're replicated or not. I agree that we should try to make this interaction more explicit.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the look. I'm thinking that the biggest sticking point now is the forwarding behavior. I've commented on that thread. I think we should rework it to do a preemptive refresh of all ancestors prior to the committing of the child. We'll still need to forward the immediate ancestor's read timestamp after, just in case the commit still gets pushed, but we don't need that to be preemptive. Any other thoughts before I move in that direction?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):
Yeah, that seems right. We currently aren't detecting issues where a grandchild invalidates a grand-parent's read. This seems like a hint that this deferred checking isn't going to cut it.

Is the idea that refreshing earlier gives us the opportunity to detect conflicts while the child's writes are still intents (i.e. still contain its ID) and therefore can still be distinguished from other transactions' intents?

Exactly. I agree it's pretty complex but it feels like the right thing to do. I'm okay with the child committing and the parent not, there's no way to avoid that. However, I'm not that keen on the possibility of the child committing and the parent not in the case of this egregious assertion failure. I'm starting to think that the right way to spin this is to do some coupling of the child TxnCoordSender to its parent.

Are you okay with my moving forward with the child TxnCoordSender holding a reference to the parent and having an interceptor which grabs the EndTxn and forces the parents to refresh? That will simplify the lifecycle handling in the kv layer and ensure that any of these bogus conflicts are detected.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other thoughts before I move in that direction?

I don't think so. Let's move this PR in that direction and see how it looks.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):

Previously, ajwerner wrote…

Yeah, that seems right. We currently aren't detecting issues where a grandchild invalidates a grand-parent's read. This seems like a hint that this deferred checking isn't going to cut it.

Is the idea that refreshing earlier gives us the opportunity to detect conflicts while the child's writes are still intents (i.e. still contain its ID) and therefore can still be distinguished from other transactions' intents?

Exactly. I agree it's pretty complex but it feels like the right thing to do. I'm okay with the child committing and the parent not, there's no way to avoid that. However, I'm not that keen on the possibility of the child committing and the parent not in the case of this egregious assertion failure. I'm starting to think that the right way to spin this is to do some coupling of the child TxnCoordSender to its parent.

Are you okay with my moving forward with the child TxnCoordSender holding a reference to the parent and having an interceptor which grabs the EndTxn and forces the parents to refresh? That will simplify the lifecycle handling in the kv layer and ensure that any of these bogus conflicts are detected.

So is the proposal that we'll augment RefreshRequest and RefreshRangeRequest with a new "invalid intent ID" field? Or introduce a structured error for it to return in cases where a conflicting value/intent is detected (with a way to distinguish between the two) and then detect the illegal conflicts in the client? All things considered, the second option seems cleaner to me.

Other than that, I think child TxnCoordSenders holding a reference to their parent seems ok. Or at least, some slim handle to avoid the possibility of too much coupling.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Let's move this PR in that direction and see how it looks.

👍 let's see how it goes

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1219 at r7 (raw file):

Or introduce a structured error for it to return in cases where a conflicting value/intent is detected (with a way to distinguish between the two) and then detect the illegal conflicts in the client?

A structured error for sure.

Or at least, some slim handle to avoid the possibility of too much coupling.

Yep, I was thinking the child would hold onto an interface that allows it to preemptively push the parent and run a refresh.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvanbenschoten I reworked the pushing bits to have the child push its ancestors. It's honestly pretty messy. I haven't yet gone back to deal with the lock table. I ended up augmenting the refresh requests to include the set of descendants. It's not terrible. I'd love to get your take on this as it is before making another pass. The interfaces are not making me very happy. You tend to have a way with these things.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/kv/txn.go, line 1317 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Right, so what's up with txn.mu.userPriority = roachpb.NormalUserPriority in NewTxnFromProto? Is that going to break things?

Done.


pkg/kv/kvserver/concurrency/lock_table_test.go, line 154 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: drop this below new-txn

Done.


pkg/storage/mvcc_history_test.go, line 1007 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: move this below newTxn

Done.


pkg/storage/pebble_mvcc_scanner.go, line 471 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

12?

Done.


pkg/storage/pebble_mvcc_scanner.go, line 489 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The second sentence makes it sound like the return values are only set if the intent is owned by the reading transaction and not one of its parents. Consider updating this.

Done.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm trying to determine is whether we need to take into consideration server-side refreshes, which could raise the commit timestamp of a child transaction after the point at which we refresh all ancestors. If that's an issue, then this kind of pre-commit hook isn't going to work out.

I see why we arrived where we did here, but it's all starting to become pretty heavyweight. This makes me think you may have had things right earlier on, when you compared the write footprint of the child transaction after committing with the refresh spans of each ancestor transaction and called it a day. Since this is all in service of detecting programming errors, it feels a little wrong to be doing something so expensive and to add so much new logic in. Which way are you leaning on all of this? It sounds like you aren't too happy with the recursion in the txn_coord_sender.

Reviewed 1 of 29 files at r10, 3 of 6 files at r12, 3 of 4 files at r15, 3 of 4 files at r17, 3 of 4 files at r18, 1 of 1 files at r19, 1 of 2 files at r20.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 183 at r20 (raw file):

		return nil
	}
	updated, err := tc.interceptorAlloc.forceRefreshLocked(ctx, txn, childrenIDs...)

nit here and below: calling the method/access fields on the interceptorAlloc itself is pretty confusing. Consider referencing the specific interceptor directly, like we do elsewhere in this file.


pkg/roachpb/api.proto, line 1591 at r20 (raw file):

  // their descendants. Encountering such an intent must trigger an assertion
  // failure as it is a programming error.
  repeated bytes descendant_txns = 4 [(gogoproto.customtype) = "github.com/cockroachdb/cockroach/pkg/util/uuid.UUID"];

If we're already returning the conflicting intent in a structured RefreshFailedError, why do we also need to send this in the request itself? Can't we make a determination about this on the client by looking at the RefreshFailedError?

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm trying to determine is whether we need to take into consideration server-side refreshes, which could raise the commit timestamp of a child transaction after the point at which we refresh all ancestors. If that's an issue, then this kind of pre-commit hook isn't going to work out.

Makes sense. I was worried about something like that and was thinking about splitting off the EndTxn to deal with it. This is especially a problem if the final batch of contains a write that would do the invalidating. I'm somewhat unhappy with the post-commit hook to forward the parent. I'm not upset about the recursion per se, but I'm upset that the recursion potentially runs recursive RPCs.

What if we did a hybrid of all of this where the child hangs on to a handle to its parent (and thus all of its ancestors) and then, upon commit EndTxn, ensures that the set of writes do not invalidate any parent reads. It's almost the same as the old approach except that it removes the crufty API from the kv.TxnSender.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/roachpb/api.proto, line 1591 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If we're already returning the conflicting intent in a structured RefreshFailedError, why do we also need to send this in the request itself? Can't we make a determination about this on the client by looking at the RefreshFailedError?

The issue is that there might be multiple violating intents and we want to make sure to discover the right one. That might just be premature optimization given this is all about detecting a programming error.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we did a hybrid of all of this where the child hangs on to a handle to its parent (and thus all of its ancestors) and then, upon commit EndTxn, ensures that the set of writes do not invalidate any parent reads. It's almost the same as the old approach except that it removes the crufty API from the kv.TxnSender.

This approach sounds reasonable to me.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/roachpb/api.proto, line 1591 at r20 (raw file):

Previously, ajwerner wrote…

The issue is that there might be multiple violating intents and we want to make sure to discover the right one. That might just be premature optimization given this is all about detecting a programming error.

Yeah, I would think that we would only need to reject due to a programming error if we only ran into our child's intent. Otherwise, we're also seeing a real conflict, so we're either not in an infinite retry loop (meaning we'll find the programming error after a retry) or we have other real issues.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, this is RFAL.

The lock table stuff is now simplified and more explicit but it still deserves more scrutiny. Maybe I can get @sumeerbhola to sign off.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/txn.go, line 1332 at r3 (raw file):

Previously, ajwerner wrote…

I agree this is a problem. I don't really know what error should be returned here. I don't think the parent can handle this.

I just removed this. It wasn't doing anything valuable.


pkg/kv/txn_external_test.go, line 362 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/success/succeed/

Done.


pkg/kv/txn_external_test.go, line 365 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

👍 maybe add that as an example here.

Done.


pkg/kv/txn_external_test.go, line 369 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: (errA == nil) == (errB == nil)

Done.


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 1172 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yeah, I see what you're saying. So it is not safe for a child to have a lower priority than its parent. I guess that intuitively makes sense from a behavioral standpoint, but it's a good catch that this could actually cause a stall in deadlock resolution.

This has been pulled into the TxnCoordSender.


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 183 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit here and below: calling the method/access fields on the interceptorAlloc itself is pretty confusing. Consider referencing the specific interceptor directly, like we do elsewhere in this file.

No longer relevant.


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Me too. My thought was that if we detect the illegal conflict in the lock table already, why do so here as well? We just added the intent to the lock table and we're just about to go back and scan it again.

I moved this above adding the discovered lock. I do think this still needs to be here as the lock table doesn't have a complete view.


pkg/kv/kvserver/concurrency/lock_table.go, line 432 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Well all locks today are exclusive, regardless of whether they're replicated or not. I agree that we should try to make this interaction more explicit.

Okay, made it all more explicit.


pkg/kv/kvserver/concurrency/lock_table_test.go, line 50 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Add a comment about the new new-child-txn directive.

Done.


pkg/kv/kvserver/concurrency/testdata/lock_table/child_txn, line 3 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

👍

Added some more testing and clarified the behavior here.


pkg/kv/kvserver/txnwait/queue.go, line 140 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Give this line a comment. It's pretty subtle.

Done.


pkg/kv/kvserver/txnwait/queue.go, line 484 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's keep/adapt this comment.

Done.


pkg/kv/kvserver/txnwait/queue.go, line 500 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: consider pulling this into a helper and sharing that with getNumPusherTxnsToTrack.

Done.


pkg/kv/kvserver/txnwait/queue.go, line 683 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think we still want to release if it isn't the pusher which is updated, because we may still have noticed new dependents.

Removed the comment.


pkg/kv/kvserver/txnwait/queue.go, line 705 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It doesn't look like it. Only the goroutine that sent on queryPusherCh needs to be notified.

Done.


pkg/roachpb/api.proto, line 1591 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yeah, I would think that we would only need to reject due to a programming error if we only ran into our child's intent. Otherwise, we're also seeing a real conflict, so we're either not in an infinite retry loop (meaning we'll find the programming error after a retry) or we have other real issues.

This is gone.


pkg/roachpb/errors.proto, line 482 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this the direct parent or the ancestor that was aborted? If the latter (which I suspect it is so that we can catch the retry error at the right level), should we rename this to AncestorAbortedError? Or at least spell this all out?

Done.

@ajwerner ajwerner force-pushed the ajwerner/child-transactions branch 2 times, most recently from c016b77 to 7182dc0 Compare February 1, 2021 08:35
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r25, 1 of 7 files at r26, 1 of 6 files at r27.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/kv/txn.go, line 1291 at r28 (raw file):

// from recoverable internal errors, and is automatically committed
// otherwise. The retryable function should have no side effects which could
// cause problems in the event it must be run more than once.

does this "run more than once" include "run and committed more than once", i.e., do the writes need to be idempotent?


pkg/kv/txn.go, line 1300 at r28 (raw file):

//
//   * No concurrent operations may be performed on the parent transaction.
//   * The child transaction must not invalidate any reads of the parent.

so it can't write to any keys that the parent has already read? Is this beyond the protection that the ts cache gives us? I am assuming this includes the case where the child txn is writing at a higher timestamp, which the ts cache will allow, but we are explicitly saying it should not do so.

But it could write to keys that the parent will read after the child commits?

Without some documented constraints on the "read set compression", I am not sure how the child txn can do any writes. Can we document them here?

What if the parent has acquired an unreplicated exclusive lock that has been forgotten. We can't guarantee an assertion failure.

In practice do we need many levels of ancestors or will this be used with only 2-levels? If the latter, can the overall code be simplified under that constraint.


pkg/kv/txn.go, line 1318 at r28 (raw file):

//     if the parent is aborted during deadlock detection and the child observes
//     that it has been aborted, the child will also be aborted and the parent
//     will encounter a retry.

This behavior is not needed for correctness, yes?


pkg/kv/txn.go, line 1323 at r28 (raw file):

//     timestamp of the parent.
//   * The child may commit values which the parent subsequently reads. This is
//     fine and can be desirable.

These committed values are observable by other unrelated txns that are concurrent, yes?
So is the only thing that makes the parent-child relationship special the one-way information transfer from the parent to the child of writes by the parent being observed by the child? The other direction of information is just whether the child committed or not, which the parent would know about if it used a non-child transaction?

(I'm asking things that were probably documented in the RFC, but I've swapped out some of that state, and I am guessing some things may have changed since the RFC)


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):

Previously, ajwerner wrote…

I moved this above adding the discovered lock. I do think this still needs to be here as the lock table doesn't have a complete view.

This seems reasonable.
Will the parentIntentError cause this Request to stop evaluating and call FinishReq? Do we want the call to AddDiscoveredLock when this error occurs?


pkg/kv/kvserver/concurrency/lock_table.go, line 454 at r28 (raw file):

// will result in an assertion failure error. This is consistent with the
// rules of child transactions which state that descendants my not acquire
// exclusive locks or write underneath the reads or writes of any ancestors.

What if the descendent is writing above a read of the ancestor? We can't detect it here, yes?

Since the lock table is incomplete wrt both replicated and unreplicated locks, and this write-write conflict seems mainly an error check against our internal use of child transactions, I am wondering whether we need to make any changes to the lock table at all and just treat an ancestor txn lock as equivalent to the child txn already holding the lock. Then the evaluation will pick up any errors via the parentIntentError path.

We're also introducing optimistic locking where the checking happens after evaluation #58670, so I'm trying to see what changes are essential to the lock table and what are optional (to reduce its complexity as much as possible).

I'm probably missing some subtlety here.


pkg/storage/pebble_mvcc_scanner.go, line 92 at r28 (raw file):

// Go port of mvccScanner in libroach/mvcc.h. Stores all variables relating to
// one MVCCGet / MVCCScan call.
type pebbleMVCCScanner struct {

Do we only need changes to pebbleMVCCScanner in the MVCC path since the only special handling for child txns is the write => read data transfer from parent to child?
I keep forgetting that even MVCCGet uses pebbleMVCCScanner. The code comment for this struct is both dated and insufficient. Could you add another sentence saying that pebbleMVCCScanner must be used for all Gets and Scans since it correctly handles child transactions.


pkg/storage/pebble_mvcc_scanner.go, line 373 at r28 (raw file):

	}

	ownIntent, epoch, seqNum, ignoredSeqNums := p.ownsIntent()

How does the epoch and seqNum of the child txn relate to that of the parent? I am guessing it is derived from the parent since we are continuing to do comparisons like seqNum >= p.meta.Txn.Sequence and epoch == p.meta.Txn.Epoch. Is this already documented elsewhere in the code (I didn't look at the whole PR)?

And this works because of the "No concurrent operations may be performed on the parent transaction"?

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola I'm replying to your comments before addressing the ones which indicate the clear action to keep the conversation going.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @sumeerbhola)


pkg/kv/txn.go, line 1291 at r28 (raw file):

Previously, sumeerbhola wrote…

does this "run more than once" include "run and committed more than once", i.e., do the writes need to be idempotent?

No, it does not mean run and commit more than once. It's trying to express the same behavior as (*kv.DB).Txn().


pkg/kv/txn.go, line 1300 at r28 (raw file):

Without some documented constraints on the "read set compression", I am not sure how the child txn can do any writes. Can we document them here?

This is a fair and valid point. If the refresh_span_bytes were to be set to zero, child transactions which get pushed can never commit. If the child never gets pushed, there's no problem. If the child were to invalidate the write of an ancestor, it would have to be pushed.

If I read the span condensing logic correctly, we'll never add span bounds which we've not added. For example, we won't pick up spans which are outside of the spans which were read. Say we had AB-AC,AD-AE,AF-AG We might compress that to AB-AG but I don't think we'll ever change the bounds to A-B.

The principle this ends up with is that if you want to be safe, never write to any of the same table spans in child transactions that the parent has read from and even then don't read from tables which might be in-between tables that the parent has read from. This ends up being both quite restrictive and yet good enough for the use-cases I currently envision. We could probably tighten this up to change the span compression logic to not cross table boundaries.

In practice do we need many levels of ancestors or will this be used with only 2-levels? If the latter, can the overall code be simplified under that constraint.

I don't have an immediate use-case for more than one layer. @nvanbenschoten proposed the generalization. I'm not sure that it does too much to simplify any of the code.


pkg/kv/txn.go, line 1318 at r28 (raw file):

Previously, sumeerbhola wrote…

This behavior is not needed for correctness, yes?

Correct.


pkg/kv/txn.go, line 1323 at r28 (raw file):

These committed values are observable by other unrelated txns that are concurrent, yes?

Right

So is the only thing that makes the parent-child relationship special the one-way information transfer from the parent to the child of writes by the parent being observed by the child? The other direction of information is just whether the child committed or not, which the parent would know about if it used a non-child transaction?

Yes, well that and the parent will be pushed to the child's commit timestamp so that it cannot commit under the parent. Ideally if code needs information to transfer from the child back to the parent, it will do it through the closure.


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):

Previously, sumeerbhola wrote…

This seems reasonable.
Will the parentIntentError cause this Request to stop evaluating and call FinishReq? Do we want the call to AddDiscoveredLock when this error occurs?

Before this most recent revision, we were calling AddDiscoveredLock. I don't think it's a correctness problem either way, right?


pkg/kv/kvserver/concurrency/lock_table.go, line 454 at r28 (raw file):

Previously, sumeerbhola wrote…

What if the descendent is writing above a read of the ancestor? We can't detect it here, yes?

Since the lock table is incomplete wrt both replicated and unreplicated locks, and this write-write conflict seems mainly an error check against our internal use of child transactions, I am wondering whether we need to make any changes to the lock table at all and just treat an ancestor txn lock as equivalent to the child txn already holding the lock. Then the evaluation will pick up any errors via the parentIntentError path.

We're also introducing optimistic locking where the checking happens after evaluation #58670, so I'm trying to see what changes are essential to the lock table and what are optional (to reduce its complexity as much as possible).

I'm probably missing some subtlety here.

We detect the descendant writing over an ancestor's read at commit time by comparing the writes of the descendant to the read spans of its ancestors. All of this is just a protection against programming errors so I'd like it to be minimally disruptive.


pkg/storage/pebble_mvcc_scanner.go, line 92 at r28 (raw file):

Previously, sumeerbhola wrote…

Do we only need changes to pebbleMVCCScanner in the MVCC path since the only special handling for child txns is the write => read data transfer from parent to child?
I keep forgetting that even MVCCGet uses pebbleMVCCScanner. The code comment for this struct is both dated and insufficient. Could you add another sentence saying that pebbleMVCCScanner must be used for all Gets and Scans since it correctly handles child transactions.

Will do.


pkg/storage/pebble_mvcc_scanner.go, line 373 at r28 (raw file):
I'll comment ownsIntent() to better explain what's going on. The idea is that the descendant is viewing the parent's writes as though it were the ancestor as of the snapshot of the ancestor when the child is created.

And this works because of the "No concurrent operations may be performed on the parent transaction"?

Indeed.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me now! Thanks for iteratively refining this.

Reviewed 2 of 32 files at r21, 6 of 6 files at r22, 3 of 7 files at r23, 3 of 4 files at r24, 4 of 4 files at r25, 6 of 7 files at r26, 5 of 6 files at r27, 2 of 2 files at r28.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @sumeerbhola)


pkg/kv/txn.go, line 1324 at r3 (raw file):

Previously, ajwerner wrote…

ManualRestart I think makes sense. It felt like the SQL layer was dealing with this properly but I may just be misunderstanding how this should work.

Did you get a chance to confirm that the SQL layer is handling this correctly?


pkg/kv/txn.go, line 1318 at r28 (raw file):

Previously, sumeerbhola wrote…

This behavior is not needed for correctness, yes?

I don't think so. It would be impossible to guarantee.


pkg/kv/txn.go, line 1323 at r28 (raw file):

Previously, sumeerbhola wrote…

These committed values are observable by other unrelated txns that are concurrent, yes?
So is the only thing that makes the parent-child relationship special the one-way information transfer from the parent to the child of writes by the parent being observed by the child? The other direction of information is just whether the child committed or not, which the parent would know about if it used a non-child transaction?

(I'm asking things that were probably documented in the RFC, but I've swapped out some of that state, and I am guessing some things may have changed since the RFC)

The other thing that makes the relationship special is its impact on deadlock detection between transactions. Parent transactions have an implicit dependency on their children transactions.


pkg/kv/txn_external_test.go, line 233 at r28 (raw file):

func TestChildTransactionMetadataPropagates(t *testing.T) {
	defer leaktest.AfterTest(t)()
	//defer log.Scope(t).Close(t)

Did you mean to uncomment this?


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 159 at r28 (raw file):

both in the txnSpanRefresher

This part isn't true anymore, right?


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 170 at r28 (raw file):

//     error.
//
// Note that if the commitCommitTimestamp is equal to the parent's current

childCommitTimestamp


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 828 at r28 (raw file):

		tc.mu.txn.Update(br.Txn)
		if justCommitted {
			return roachpb.NewError(tc.pushParentUponCommitLocked(ctx))

It seems like this should live near maybeSleepForLinearizable, maybe in some method that handles post-commit actions. And commitReadOnlyTxnLocked should also call this post-commit hook.


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 2415 at r28 (raw file):

	require.NotNil(t, child.Parent)
	require.Equal(t, parent.ID, child.Parent.ID)
	require.Equal(t, parent.Name+" child", child.Name)

nit: are we ok with the recursive " child"s on grandchildren?


pkg/kv/kvserver/concurrency/lock_table.go, line 275 at r28 (raw file):

	ts      hlc.Timestamp
	spans   *spanset.SpanSet
	readTS  hlc.Timestamp

I think these are leftovers from a bad rebase. They were removed a few weeks go.


pkg/kv/kvserver/concurrency/lock_table.go, line 453 at r28 (raw file):

// ReadWrite requests due to descendants seeking to acquire locks over ancestors
// will result in an assertion failure error. This is consistent with the
// rules of child transactions which state that descendants my not acquire

"may"


pkg/kv/kvserver/concurrency/lock_table_test.go, line 55 at r28 (raw file):

 Creates a Transaction.

new-child-txn parent=<name> txn=<name> ts=<int>[,<int>] epoch=<int> [seq=<int>]

Feel free to ignore, but would it be easier and result in less duplicate code to add an optional parent field to new-txn?


pkg/kv/kvserver/txnwait/queue_test.go, line 491 at r28 (raw file):

}

// noopT implements assert.TestingT so that its tests can be used without

:expandingbrain:


pkg/roachpb/data.proto, line 410 at r28 (raw file):

  // STAGING to COMMITTED. Because of this, the set will only ever contain
  // entries when the transaction is STAGING. For more, see txnCommitter. It
  // may be the case that InFlightWrites is populated in a transaction which

Mind changing "in a transaction which" to "in a transaction object which"? This took me a second to understand.


pkg/roachpb/data.proto, line 431 at r28 (raw file):

  // parent with this metadata.
  //
  // TODO(ajwerner): come back and describe why we want a TxnMeta here as it

Do you intend to address this TODO here?

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):

Previously, ajwerner wrote…

Before this most recent revision, we were calling AddDiscoveredLock. I don't think it's a correctness problem either way, right?

Yes, its harmless to call AddDiscoveredLock even if this request is never going to go and wait at that lock.


pkg/kv/kvserver/concurrency/lock_table.go, line 454 at r28 (raw file):

Previously, ajwerner wrote…

We detect the descendant writing over an ancestor's read at commit time by comparing the writes of the descendant to the read spans of its ancestors. All of this is just a protection against programming errors so I'd like it to be minimally disruptive.

Are you saying we don't actually need any changes to lock_table.go other than making the parent txn appear equivalent to the child wrt the child not waiting on the parent's locks? That is, is my supposition that the detection of bad behavior via parentIntentError, actually sufficient?

@ajwerner ajwerner force-pushed the ajwerner/child-transactions branch from 00101e6 to d260fce Compare February 2, 2021 06:02
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the sticking point here is going to be the lock table changes. I don't feel strongly either way. I think the error propagation changes in the PR are sort of ugly, but they do at least make some of the interactions more explicit. Until we offer lock strengths weaker than exclusive, I think just letting children pass through under locks held by ancestors works well. If the ancestor were to hold a shared lock, I think the interactions get less clear.

Dismissed @sumeerbhola from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/kv/txn.go, line 1324 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you get a chance to confirm that the SQL layer is handling this correctly?

Added a test that shows that it behaves properly.


pkg/kv/txn.go, line 1323 at r28 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The other thing that makes the relationship special is its impact on deadlock detection between transactions. Parent transactions have an implicit dependency on their children transactions.

Right, that's sort of the most important part.


pkg/kv/txn_external_test.go, line 233 at r28 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you mean to uncomment this?

Yes, done.


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 159 at r28 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
both in the txnSpanRefresher

This part isn't true anymore, right?

Fixed.


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 170 at r28 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

childCommitTimestamp

Done.


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 828 at r28 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It seems like this should live near maybeSleepForLinearizable, maybe in some method that handles post-commit actions. And commitReadOnlyTxnLocked should also call this post-commit hook.

Done.


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 2415 at r28 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: are we ok with the recursive " child"s on grandchildren?

I am. It's at least pretty clear. I don't know when we're going to have grandchildren.


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 268 at r6 (raw file):

Previously, sumeerbhola wrote…

Yes, its harmless to call AddDiscoveredLock even if this request is never going to go and wait at that lock.

Reverted.


pkg/kv/kvserver/concurrency/lock_table.go, line 275 at r28 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think these are leftovers from a bad rebase. They were removed a few weeks go.

Done.


pkg/kv/kvserver/concurrency/lock_table.go, line 453 at r28 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"may"

Done.


pkg/kv/kvserver/concurrency/lock_table.go, line 454 at r28 (raw file):

Previously, sumeerbhola wrote…

Are you saying we don't actually need any changes to lock_table.go other than making the parent txn appear equivalent to the child wrt the child not waiting on the parent's locks? That is, is my supposition that the detection of bad behavior via parentIntentError, actually sufficient?

My original implementation was exactly what I think you're saying. It just made child transactions appear equivalent to the parent in the lock table and then any issues were caught in the HandleWriteIntentError. @nvanbenschoten felt that this was too implicit so I added the more explicit error checking here.

This is what that commit looked like (the testing was bogus but otherwise, I do think it was okay and simpler): 30037f1


pkg/kv/kvserver/concurrency/lock_table_test.go, line 55 at r28 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Feel free to ignore, but would it be easier and result in less duplicate code to add an optional parent field to new-txn?

Ack.


pkg/roachpb/data.proto, line 410 at r28 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Mind changing "in a transaction which" to "in a transaction object which"? This took me a second to understand.

Done.


pkg/roachpb/data.proto, line 431 at r28 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you intend to address this TODO here?

Removed. That was very old.


pkg/storage/pebble_mvcc_scanner.go, line 92 at r28 (raw file):

Previously, ajwerner wrote…

Will do.

Done.

@ajwerner ajwerner force-pushed the ajwerner/child-transactions branch 5 times, most recently from 0a017e4 to 00abaee Compare February 2, 2021 18:50
craig bot pushed a commit that referenced this pull request Feb 2, 2021
58979: schemachanger,*: introduce new schema changer r=lucy-zhang a=ajwerner

This PR introduces the new schema changer, for which a partially written RFC (including a "guide-level explanation") can be found in #59288. It contains a somewhat complete implementation for a subset of `ALTER TABLE ... ADD COLUMN`. Like the existing implementation, these schema changes are planned and partially executed in the user transaction, but most of the work happens in an asynchronous job that runs after the transaction has committed. We use the index backfiller to build new columns, as described in #47989.

See the RFC and individual commits for details.

Missing things that will be in later PRs:

- Making execution changes to support concurrent writes for columns added using the index backfiller. We currently don't have a way to write column values which are not part of the primary index. See #59149. (#58990 takes care of the backfill part).
- Ensuring compatible coexistence with the old schema changer; in particular, polling for queued mutations to finish before any schema changes can start, using child transactions within the user transaction (#56588). The new schema changer assumes it has exclusive rights to perform schema changes on the table, which anticipates the transactional schema changes of the future.
- Reverting schema changes when failed or canceled.
- Properly checkpointing and reading job progress for the backfill. Currently there's a placeholder implementation that doesn't interact with the job at all.

Co-authored-by: Andrew Werner <[email protected]>
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 31 files at r29, 6 of 6 files at r30, 3 of 7 files at r31, 3 of 4 files at r32, 4 of 4 files at r33, 7 of 7 files at r34, 5 of 6 files at r35, 2 of 2 files at r36.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/kv/kvserver/concurrency/lock_table.go, line 454 at r28 (raw file):

Previously, ajwerner wrote…

My original implementation was exactly what I think you're saying. It just made child transactions appear equivalent to the parent in the lock table and then any issues were caught in the HandleWriteIntentError. @nvanbenschoten felt that this was too implicit so I added the more explicit error checking here.

This is what that commit looked like (the testing was bogus but otherwise, I do think it was okay and simpler): 30037f1

The hazard I was concerned with was allowing a child txn to acquire a replicated lock on top of their parent's unreplicated lock, the (3) | u | r case from that ref. The semantics of such a lock acquisition are not clear to me. Do we replace the parent's unreplicated lock with the child's replicated lock? What then happens when the child commits? Does the parent lose its lock? It's not clear how the lockHolderInfo structure can even represent such a lock state.

Maybe I missed an implied "we don't care about corrupting a lockHolderInfo because this is an error case anyway and no one should be able to commit".

I agree that we should get to a place that we're all happy with here, as @sumeerbhola's point about this needing to interact well with optimistic locking is very valid.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r33, 6 of 7 files at r34.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/concurrency/concurrency_control.go, line 464 at r36 (raw file):

	// The returned error indicates illegal operations by descendant transactions
	// interacting with their ancestors' locks. It is illegal for a descendant to
	// try to acquire a lock that overlaps with any of its descendants.

I think the "it is illegal" statement should say that the descendant should not have a SpanReadWrite span that overlaps with an Exclusive lock held by an ancestor.


pkg/kv/kvserver/concurrency/concurrency_control.go, line 601 at r36 (raw file):

	// CurState returns the latest waiting state. It returns an error if the
	// lock which was encountered is not allowed as it is due to an ancestor
	// transaction.

(nit: to be more consistent with the earlier comment)

... was encountered is illegal for this descendant transaction to conflict with, since it is held by an ancestor transaction.


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 168 at r36 (raw file):

		// Scan for conflicting locks.
		log.Event(ctx, "scanning lock table for conflicting locks")
		g.ltg, _ = m.lt.ScanAndEnqueue(g.Req, g.ltg)

why is this ignoring the error return value?


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 274 at r36 (raw file):

			wait = true
		}
		for cur := g.Req.Txn; cur != nil && parentIntentError == nil; cur = cur.Parent {

Can you add a comment before this for loop
// The lock table detects an illegal conflict with the parent, in a manner similar to this parentIntentError computation. So usually going back and waiting at the lock table after calling AddDiscoveredLock would be good enough, and we would not need to do this computation here. However, the lockTable may not be enabled, which would prevent that detection. So we make sure by computing parentIntentError.


pkg/kv/kvserver/concurrency/lock_table.go, line 454 at r28 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The hazard I was concerned with was allowing a child txn to acquire a replicated lock on top of their parent's unreplicated lock, the (3) | u | r case from that ref. The semantics of such a lock acquisition are not clear to me. Do we replace the parent's unreplicated lock with the child's replicated lock? What then happens when the child commits? Does the parent lose its lock? It's not clear how the lockHolderInfo structure can even represent such a lock state.

Maybe I missed an implied "we don't care about corrupting a lockHolderInfo because this is an error case anyway and no one should be able to commit".

I agree that we should get to a place that we're all happy with here, as @sumeerbhola's point about this needing to interact well with optimistic locking is very valid.

I see. In light of this I am fine with the change here. It won't complicate optimistic locking, I think. An optimistically evaluated request by a descendant will discover an error due to an unreplicated lock by the ancestor, after it evaluates (when it goes to check the lock table).


pkg/kv/kvserver/concurrency/lock_table.go, line 446 at r36 (raw file):

}

// isAncestorTxn returns true if txn corresponds any of the ancestors of the

... if txn is an ancestor of the txn in g.


pkg/kv/kvserver/concurrency/lock_table.go, line 456 at r36 (raw file):

// Note that parent transactions may have replicated exclusive locks on keys
// due to writes (intents) which are not discovered in the lock table. That is
// fine and dealt with in HandleWriteIntentError. This is not handled perfectly

I worry about this "This is not handled perfectly ..." stuff.
Having a descendants unreplicated lock over an ancestor's replicated lock seems as confusing as the hazard @nvanbenschoten was mentioning where the ancestor had the unreplicated lock and the descendant had the replicated one. Given we control what these descendants are doing, can we just ensure (while the only locks are Exclusive locks -- we can revisit when adding support for other strengths) that the descendants don't unnecessarily scan with locks, unless they plan to write to that span. Then we can declare this case as illegal too.

Hmm, maybe I'm misreading this comment. The code in tryActiveWait is doing the straightforward thing and not caring whether the currently held lock is replicated or unreplicated.


pkg/storage/pebble_mvcc_scanner.go, line 441 at r36 (raw file):

		if seqNum >= p.meta.Txn.Sequence &&
			!enginepb.TxnSeqIsIgnored(p.meta.Txn.Sequence, ignoredSeqNums) {
			// 9. We're reading our own txn's intent at an equal or higher sequence.

would be nice to update all these comments that say "our own".

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021
@ajwerner ajwerner force-pushed the ajwerner/child-transactions branch from 00abaee to 7f1efd8 Compare March 27, 2023 01:22
@ajwerner ajwerner requested a review from a team March 27, 2023 01:22
@ajwerner ajwerner requested review from a team as code owners March 27, 2023 01:22
@ajwerner ajwerner requested a review from RaduBerinde March 27, 2023 01:22
Adds an error to deal with the case where a child detects that an ancestor
is aborted. We can then turn this into a restart error for the appropriate
ancestor and abort all of the other relevant descendents of that ancestor.

Release note: None
This commit introduces the interface to child transactions. The method,
`ChildTxn()` is symmetrical with `Txn()` in that it takes a function which
will be handed the child transaction. This method additionally will deal
with translating certain errors for the parent and with pushing the parent
as makes sense.

Release note: None
This commit adds logic to detect deadlocks involving child transactions.
Child transactions are interpretted as being implicitly pushed by their
parents.

Release note: None
This commit augments the MVCC layer to understand child transactions. A child
transaction can read data written by the parent as though it was the parent.a

Release note: None
This commit adds support for child transactions to the lockTable and
deals with unexpected WriteIntentErrors due to write-write conflicts
between the parent and child transaction. It also disallows attempts
by child transactions to acquire exclusive locks over unreplicated
exclusive locks of their ancestors.

Note that because the lock table does not have a complete view of
locks, some locks will be discovered during evaluation. That case
is handled during HandleWriteIntentError.

Release note: None
This commit adds logic to ensure that the child transaction pushes its
ancestors to at least its commit timestamp. It also ensures that it can
detect illegal writes performed by a child which would invalidate the
reads of an ancestor and result in an infinite retry loop.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/child-transactions branch from 7f1efd8 to fd39881 Compare March 27, 2023 02:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: child transactions
5 participants