-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: introduce an "ignore list" for seqnums in MVCC reads #41612
Comments
Discussed with @petermattis - since @itsbilal is currently rewriting the mvcc code he's the best person to look at this (or, conversely it would be disruptive to his work for anyone else to look at it) |
The ideal timeline would be to enable early testing of these semantics before the end of november. Please reach out to @andreimatei or myself to discuss milestones. |
Note that the the seqnums being talked about here are |
Good point. Bilal and I were just understanding that by looking at the code. |
Discussing implementation with @nvanbenschoten . The code currently uses First idea was to apply Nathan suggests iterated binary search:
|
Discussion with Nathan and @petermattis:
Peter's answer:
my opinion (knz): even if there's just one entry, that's 2 heap allocations instead of just 1, for every single intent in the system. |
If you want to reduce heap allocations in the C++ proto code, I believe the right answer is to use arenas. I'll be moderately surprised if this turns out to be worthwhile. |
Investigating this further: I'm hitting a snag (perhaps two). First finding is that there are two copies of the TxnMeta object I should probably care about:
Now there are two problems.
|
This matches my understanding.
This slice will be empty for most scans, right? I'd hope the overhead of passing is non-existent if the slice is empty. Note that passing a read-only slice of primitive values to C++ has near zero overhead as the C++ side can read the Go memory. The challenge is if the slice has objects which contain pointers.
Is |
This is correct. The txn in the
+1
Yes, and note that the ignore list will be provided in |
Is that up to date with the latest txn record at that point? |
Probably not. |
If the latest list of ignored seqnums is not available, or not up-to-date, this function may erroneously preserve a written value that should have been rolled back via a savepoint rollback. In other words, intent resolution needs to operate on the latest list of ignored seqnums. I do not know (yet) how to guarantee this. |
|
thanks that was helpful |
I see this to be intuitively true, but there's some plumbing I need to figure out.
What I'm going to do now is cobble something together that will look ugly and misguided to your educated eyes, and then you're going to tell me how to fix it. |
I'm not sure if there are any examples of this, but there shouldn't be a problem with doing so. You basically will pass a pointer and a length.
Ha! Perfect! That works for me. |
done #42152 |
(@nvanbenschoten can you check the following) Coming back to the work specification: the RFC also calls for detaching the read sequence from the write sequence. Initially, I thought that this req was only a matter of the TxnCoordSender populating the sequence field differently for read and write requests. However, that was mistaken. Both seqnums are needed for every write request and handled in MVCC:
Additionally, to ensure replayability (idempotence) the following invariant must hold: every write request at a given write seqnum must always be issued with the same read seqnum. This won't be a problem in TxnCoordSender but it must become part of the API spec. |
Is considering the CPut's read sequence to be 1 less than its write sequence insufficient? This should be what we do today. I can't imagine a case where we'd want to do anything other than this.
Why is this? I'd expect it to be the previous value as per that value's write seq num (as it is today). As is, I'm still not convinced that "the RFC also calls for detaching the read sequence from the write sequence" is true. |
Let's hope you're right. I would also prefer this to be true as it is simpler. |
Required for SQL savepoints, as discussed in #41569.
To support partial txn rollbacks (savepoint rollbacks) we need to skip over values written in the past that are associated with rolled back seqnums.
Today, the MVCC read logic is already equipped with logic to skip all values written after a specific seqnum stored in the meta txn proto (
Sequence
).We want to extend the read logic to also skip over values written at seqnums part of an "ignore list":
This logic should be available for both the rocksdb and pebble engines.
The text was updated successfully, but these errors were encountered: