Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: child transactions #54633

Closed
ajwerner opened this issue Sep 21, 2020 · 9 comments · May be fixed by #56588
Closed

kv: child transactions #54633

ajwerner opened this issue Sep 21, 2020 · 9 comments · May be fixed by #56588
Labels
A-kv-transactions Relating to MVCC and the transactional model. A-schema-transactional C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity T-kv KV Team X-stale

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Sep 21, 2020

Is your feature request related to a problem? Please describe.

KV transactions provide serializable isolation for a series of KV operations. Importantly, the KV layer also provides deadlock detection to ensure that the system remains live. This deadlock detection system makes an assumption that each operation transaction corresponds to a single thread of execution which dependent on the liveness of another transaction to make progress. We've observed a variety of deadlocks related to descriptor lease acquisition and query planning which invalidated that assumption (#24885, #46224, #46447). The solutions (#46170, #46384, #46588) to these deadlocks was to run the "child" operation at PRIORITY HIGH in hopes that the intent on which the operations blocked were not run at that priority. Unfortunately this was really just a band-aid and if users did run their schema change operations at PRIORITY HIGH we'd still have those deadlocks.

Another related problem with the current transactional model is that it does not provide a mechanism to run a secondary transaction on the thread of execution of another transaction which might encounter intents (of the parent transaction or of another transaction which might be blocked on the parents). This is a critical component of the proposed scheme for providing transactional schema changes. It is important to enable changes to the database for the purpose of coordinating with concurrent transactions but allowing the parent to avoid interacting with changing keys in order to retain its serializable view of the database.

Describe the solution you'd like

The proposal is that we create a new type of transaction, a child transaction to join the existing two types of transactions, root transactions and leaf transactions. A child transaction is an independent read-write transaction that refers to a parent transaction at a given sequence and snapshot. For the purpose of deadlock detection, the parent will be seen as being blocked on the child (more details below). The child transaction will read written values of the parent as though it were the parent transaction.

Implementation details

The child transaction will have a new field in its enginepb.TxnMeta. Child transactions may create their own child transactions. In that case, the TxnMeta.Parent.Parent will be not nil. This will effectively form a linked-list of parents. We do not intend for the depth to get very deep.

// Parent, if non-nil, represents the parent of this transaction ...
Parent *TxnMeta

When the MVCC code encounters an intent it will check if it refers to a parent (recursively).

Deadlock breaking during lease acquisition

A goal of this work should be to cleanup the hacks introduced in #46170 and #46384 which use priority high to push a deadlock caused by lease acquisition blocking on an earlier epoch or rolled back sequence number. Unfortunately, this isn't that simple! The problem is that today we use a single-flight group to coalesce lease acquisition. We have no way to represent the dependency on that singleflight in our distributed deadlock detection.

Jira issue: CRDB-3733

@ajwerner ajwerner added A-kv-transactions Relating to MVCC and the transactional model. A-schema-transactional labels Sep 21, 2020
@blathers-crl
Copy link

blathers-crl bot commented Sep 21, 2020

Hi @ajwerner, I've guessed the C-ategory of your issue and suitably labeled it. Please re-label if inaccurate.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Sep 21, 2020
@nvanbenschoten
Copy link
Member

The proposal is that we create a new type of transaction, a child transaction to join the existing two types of transactions, root transactions and leaf transactions. A child transaction is an independent read-write transaction that refers to a parent transaction at a given sequence and snapshot. For the purpose of deadlock detection, the parent will be seen as being blocked on the child (more details below). The child transaction will read written values of the parent as though it were the parent transaction.

Will we ever want to run DistSQL flows for child transactions? Everything else here lines up with my expectations except for the extension to the TxnType enumeration. Is there some reason why we need to distinguish between RootTxn and ChildTxn (and by extension, disallow/complicate "LeafChildTxn")? Even if so, I'm not sure the TxnType extension is appropriate here, as my understanding of the type is that it more closely expresses "transaction coordinator instance type" and not "transaction type".

When the MVCC code encounters an intent it will check if it refers to a parent (recursively).

To confirm, the TxnMeta of a grandchild txn will ship around and persist the full len=2 linked-list of its transaction hierarchy, right? That seems fine to me – we don't expect this hierarchy to get very deep. I just want to make sure we have the right information when we need it and don't need to recursively QueryTxn in the txnWaitQueue or anything.

@ajwerner
Copy link
Contributor Author

Ack on the TxnType enum, it felt right but I don't have strong feelings. Will just remove.

The proposal is that we create a new type of transaction, a child transaction to join the existing two types of transactions, root transactions and leaf transactions. A child transaction is an independent read-write transaction that refers to a parent transaction at a given sequence and snapshot. For the purpose of deadlock detection, the parent will be seen as being blocked on the child (more details below). The child transaction will read written values of the parent as though it were the parent transaction.

Will we ever want to run DistSQL flows for child transactions? Everything else here lines up with my expectations except for the extension to the TxnType enumeration. Is there some reason why we need to distinguish between RootTxn and ChildTxn (and by extension, disallow/complicate "LeafChildTxn")? Even if so, I'm not sure the TxnType extension is appropriate here, as my understanding of the type is that it more closely expresses "transaction coordinator instance type" and not "transaction type".

Fair, it just felt right, I don't have strong feelings. I'll just remove that whole thing

When the MVCC code encounters an intent it will check if it refers to a parent (recursively).

To confirm, the TxnMeta of a grandchild txn will ship around and persist the full len=2 linked-list of its transaction hierarchy, right? That seems fine to me – we don't expect this hierarchy to get very deep. I just want to make sure we have the right information when we need it and don't need to recursively QueryTxn in the txnWaitQueue or anything.

Correct, that's the idea. I'll make it more explicit.

@ajwerner
Copy link
Contributor Author

@nvanbenschoten added some more commentary on a problem I'm seeing to remove the PRIORITY HIGH hack.

@nvanbenschoten
Copy link
Member

A goal of this work should be to cleanup the hacks introduced in #46170 and #46384 which use priority high to push a deadlock caused by lease acquisition blocking on an earlier epoch or rolled back sequence number. Unfortunately, this isn't that simple! The problem is that today we use a single-flight group to coalesce lease acquisition. We have no way to represent the dependency on that singleflight in our distributed deadlock detection.

Yeah, this seems tricky. I don't see a great way that we can coalesce all lease acquisitions using child transactions without introducing some form of multiple inheritance, which we should try to avoid at all costs. Do we need to coalesce lease acquisition in all cases? Can we determine when the transaction inheritance is needed to avoid deadlocks and avoid coalescing in those cases, using a child transaction instead?

One thing that's not clear to me is whether the child transactions need to be able to read the provisional values of the parent transactions in this case, or whether they should just be able to push them. Are there two different use cases here? Can we get away with only supporting one or the other?

@ajwerner
Copy link
Contributor Author

Can we determine when the transaction inheritance is needed to avoid deadlocks and avoid coalescing in those cases, using a child transaction instead?

It's not obvious to me how to do that. In general I think this issue is thorny. One solution would be to continue to use PRIORITY HIGH in lease acquistion and to avoid coalescing lease resolution on behalf of PRIORITY HIGH user transactions or something like that (this is intentionally vague).

One thing that's not clear to me is whether the child transactions need to be able to read the provisional values of the parent transactions in this case, or whether they should just be able to push them. Are there two different use cases here? Can we get away with only supporting one or the other?

Yeah, totally agree. Reading the provisional values is really needed for the transactional schema changes. For the leasing it's not even desirable. I was thinking that supporting it but being able to set the sequence number was a generalization.

@nvanbenschoten
Copy link
Member

It turns out that making distributed deadlock detection work correctly with child transactions is a little more involved than I thought it was going to be. It looks like we're going to need to have a child transaction watch all of its ancestor's records as well as its own when in the txnWaitQueue. Additionally, we're going to have have to augment the WaitingTxns slice attached to a QueryTxnResponse with the ancestors of each waiting transaction. I think the place to do that will be in pendingTxn.getDependentsSet.

Here's an example that demonstrates why we need both of these changes:

//   - txnA launches txnA2 as a child transaction
//   - txnA2 enters txnB's txnWaitQueue during a PushTxn request (MaybeWaitForPush)
//   - txnB launches txnB2 as a child transaction
//   - txnB2 enters txnA's txnWaitQueue during a PushTxn request (MaybeWaitForPush)
//
//          .-------------------------------------------------------.
//          |                                                       |
//          v                                                       |
//    [txnA record]     [txnA2 record] --> [txnB record]     [txnB2 record]
//     deps:                                deps:             
//     - txnB2                               - txnA2            
//
//   - txnA2 queries its own txnWaitQueue using a QueryTxn request (MaybeWaitForQuery)
//   - txnB2 queries its own txnWaitQueue using a QueryTxn request (MaybeWaitForQuery)
//
//          .-------------------------------------------------------.
//          | ..................................................    |
//          | .                ...................             .    |
//          v .                v                 .             v    |
//    [txnA record]     [txnA2 record] --> [txnB record]     [txnB2 record]
//     deps:                                deps:             
//     - txnB2                               - txnA2            
//
//   - txnA2 finds no dependencies
//   - txnB2 finds no dependencies
//   - if this was it, we wouldn't be able to detect deadlocks of this form
//
//   - but this isn't all child transaction do. They also need to query their ancestor's records
//   - txnA2 queries txnA's txnWaitQueue using a QueryTxn request (MaybeWaitForQuery)
//   - txnB2 queries txnB's txnWaitQueue using a QueryTxn request (MaybeWaitForQuery)
//
//          .-------------------------------------------------------.
//          | .......................................               |
//          | .  .................................  .               |
//          v .  v                               .  v               |
//    [txnA record]     [txnA2 record] --> [txnB record]     [txnB2 record]
//     deps:                                deps:             
//     - txnB2                               - txnA2      
//
//   - txnA2 finds that txnB2 AND txnB (its parent!) are dependents. It transfers these dependency to txnB
//   - txnB2 finds that txnA2 AND txnA (its parent!) are dependents. It transfers these dependency to txnA
//
//          .-------------------------------------------------------.
//          |                                                       |
//          v                                                       |
//    [txnA record]     [txnA2 record] --> [txnB record]     [txnB2 record]
//     deps:                                deps:             
//     - txnB2                               - txnA2            
//     - txnA                                - txnB            
//     - txnA2                               - txnB2            
//                                         
//   - txnA2 notices that txnB is a transitive dependency of itself. This indicates
//     a cycle in the global wait-for graph. txnB is aborted, breaking the cycle
//     and the deadlock.
//   
//   OR 
//                                         
//   - txnB2 notices that txnA is a transitive dependency of itself. This indicates
//     a cycle in the global wait-for graph. txnA is aborted, breaking the cycle
//     and the deadlock.

Interestingly, this means that we do not need to persist a transaction's ancestor list inside its intents, so I don't even think it needs to live inside of TxnMeta. Instead, we can just store Parent *TxnMeta in the Transaction proto carried on requests and keep it off disk.

@ajwerner
Copy link
Contributor Author

ajwerner commented Oct 2, 2020

Thanks for thinking through this!

@github-actions
Copy link

github-actions bot commented Sep 6, 2023

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 19, 2023
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. A-schema-transactional C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity T-kv KV Team X-stale
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants