-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: write at provisional commit ts, not orig ts #32487
Conversation
The big downside of this change is that we really want TBIs to be useful for CDC in 2.1. CDC in 2.2 will make more use of range feeds and less use of polling so TBIs will be less necessary. But this change is definitely too scary to backport to 2.1. |
Are we exploring ripping out TBIs and pushing logic to C++? That sounded like the least scary venue. |
My gut says that that won't be comparable to the benefit you get when you only need to scan the memtable and some L0 files. There is a fundamental limit to how fast scanning n keys can be, and TBIs are currently our only way of reducing n rather than the constant factor. Perhaps we'd get lucky and things would be fast enough though. It's definitely worth an experiment. |
Oh, this is subtle. The CDC poller doesn't actually care about uncommitted intents, and it will see the committed value at its new timestamp when and if it is able to commit. But the tight loop starves out the transaction so it can't commit. How does this change solve the problem? We write above the poller's read, but what stops the poller from advancing the timestamp cache again? Is it just that the poller stops advancing while there is a pending intent? We have some tests that use SNAPSHOT isolation; these tests will (hopefully) become flaky with a change like this. We should get rid of tests that use SNAPSHOT before we make changes that will break them.
We can't break SNAPSHOT in 2.1.x; clusters being upgraded from 2.0 may still perform snapshot-isolation transactions in a mixed-version configuration. If reads still use OrigTimestamp but writes use Timestamp, how does that affect commands that do both like ConditionalPut? |
I think it's actually possible that the CDC poller misses the value entirely. Say the poller reads at (ts1, ts2], and a concurrent transaction writes at ts1 (its orig timestamp) while getting pushed to ts3. If that transaction takes a while to send its commit RPC, the CDC poller might read (ts2, ts3] before the intent at ts3 gets resolved. And then it will be reading (ts3, ts4] by the time the intent gets resolved and never see the value committed at ts3. I could be off base here, but I think the intuition is that an intent is a lock on [tintent, ∞). You can't say that you've read everything for t ≥ tintent until that intent is resolved because the intent could commit at any such t.
Right. If the poller's ExportRequest catches a WriteIntentError, it won't update the timestamp cache, and so the intent can get bumped.
Good point.
I don't know. I suspect it works fine because all reads get refreshed to the commit timestamp anyway, so we're just doing extra work. But I need to think about this more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM mod the fact that this really needs tests for all the changes in applyTimestampCache
.
I think we should couple this patch with a change to PushTxn
which makes a push succeed if it's trying to push the txn at a ts <= its current timestamp (i.e. the timestamps known by the txn record). Like, if txn1 wrote at ts 5, and then it got bumped to ts 10, and now we have a read at ts 7, that read should succeed.
Otherwise, with this change, we can have situations where, because a txn now is leaving intents at different timestamps, whether or not two txn conflict can depend on which intent from the first the second one runs into.
In other words, I'm not so sure about the "the new behavior is more
intuitive than the old behavior" claim; the old one seems more intuitive to me. But otherwise I do think I like this change - there are benefits to writing at timestamps as high as possible.
// but to avoid being starved by short-lived transactions on that key which | ||
// would otherwise not have to go through conflict resolution with txn1. | ||
// | ||
// Again, keep in mind that, when the transaction commits, all the intents are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part of the comment should be kept somewhere. The comment on the timestamp
field should be expended to explain how its used for writes, and how a txn might be laying down intents at different timestamps as it goes along.
pkg/storage/replica.go
Outdated
@@ -2746,6 +2746,7 @@ func (r *Replica) applyTimestampCache( | |||
txn := ba.Txn.Clone() | |||
bumped = txn.Timestamp.Forward(nextTS) || bumped | |||
ba.Txn = &txn | |||
ba.Timestamp = txn.Timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a comment at this change and below, reiterating that we want to write at the bumped ts.
Oh I wanted to say something else too - let's also couple this patch with deleting every trace of |
And backports aren't allowed to introduce cluster versions unless it's a really exceptional case. |
We were just discussing offline how you don't really need a new cluster version, you could just gate on the |
Snapshot isolation was disabled in v2.1, so it is safe to remove all code paths that handle snapshot isolation in v2.2. This is a necessary precursor to cockroachdb#32487. Release note: None
Snapshot isolation was disabled in v2.1, so it is safe to remove all code paths that handle snapshot isolation in v2.2. Note that the change to the MVCCMetadata checksum in pkg/storage/below_raft_protos_test.go is safe, because the actual encoding of the protobuf is not affected by this change. What is affected is NewPopulatedMVCCMetadata, which no longer returns a TxnMeta with a non-zero IsolationType, since the IsolationType field has been removed. This is a necessary precursor to cockroachdb#32487. Release note: None
32860: kv,storage,sql: purge snapshot isolation r=andreimatei,nvanbenschoten,bdarnell a=benesch This touches a lot so I cast a wide net for reviewers. Feel free to remove yourself! --- Snapshot isolation was disabled in v2.1, so it is safe to remove all code paths that handle snapshot isolation in v2.2. This is a necessary precursor to #32487. Release note: None Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
a801911
to
26ea8f2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, PTAL. I added a test and updated some more comments.
LGTM mod the fact that this really needs tests for all the changes in applyTimestampCache.
Done.
Oh I wanted to say something else too - let's also couple this patch with deleting every trace of snapshot txns from our code. You can do it in a different commit if you want to backport the first (although the backport I guess would need to be put behind a cluster version per Ben's comments)
Done in #32860.
I think we should couple this patch with a change to PushTxn which makes a push succeed if it's trying to push the txn at a ts <= its current timestamp (i.e. the timestamps known by the txn record). Like, if txn1 wrote at ts 5, and then it got bumped to ts 10, and now we have a read at ts 7, that read should succeed.
Isn't this already the case?
cockroach/pkg/storage/batcheval/cmd_push_txn.go
Lines 161 to 166 in 2aef042
// If we're trying to move the timestamp forward, and it's already | |
// far enough forward, return success. | |
if args.PushType == roachpb.PUSH_TIMESTAMP && args.PushTo.Less(reply.PusheeTxn.Timestamp) { | |
// Trivial noop. | |
return result.Result{}, nil | |
} |
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 2749 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
please add a comment at this change and below, reiterating that we want to write at the bumped ts.
Done.
It's greeeeen! 🍏📗💚🥗 Thank goodness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 12 of 12 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/engine/mvcc.go, line 1310 at r3 (raw file):
// A txn writes intents at its provisional commit timestamp. See the // comment on the txn.Timestamp field definition for rationale. writeTimestamp.Forward(txn.Timestamp)
Should it be an error for txn.Timestamp
to be less than timestamp
?
pkg/storage/engine/mvcc_test.go, line 436 at r3 (raw file):
txn := *txn1 txn.Sequence++ txn.Timestamp = hlc.Timestamp{WallTime: 1}
The tests would pass without this, right? (because of the way .Forward
is used instead of just overwriting timestamp
with txn.Timestamp
)
pkg/storage/engine/mvcc_test.go, line 1114 at r3 (raw file):
ts := []hlc.Timestamp{{Logical: 1}, {Logical: 2}, {Logical: 3}, {Logical: 4}, {Logical: 5}, {Logical: 6}} txn1ts := *txn1
Maybe the txn1
global should be replaced with a factory function since it's no longer used as-is anywhere (or is it?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM mod a comment on the interface of MVCCPut.
And a question: What is the timestamp used for populating the write timestamp cache (by DelRange)?
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica_test.go, line 2621 at r3 (raw file):
} func TestReplicaTSCacheForwardsIntentTS(t *testing.T) {
pls add a comment to the test with some words
pkg/storage/replica_test.go, line 2666 at r3 (raw file):
t.Fatal(pErr) } iter := tc.engine.NewIterator(engine.IterOptions{Prefix: true})
respect for going medieval
pkg/storage/engine/mvcc.go, line 1310 at r3 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Should it be an error for
txn.Timestamp
to be less thantimestamp
?
yeah, I also wonder. And this function's interface, taking both a txn and a timestamp, is confusing. Please add some comments to these args. Maybe the timestamp
can be documented and asserted to be nil if a txn
is passed.
pkg/storage/engine/enginepb/mvcc3.proto, line 63 at r3 (raw file):
// original timestamp: // // 1. This timestamp is forwarded by the timestamp cache when this
good comment
39dba20
to
50869b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whew, ok, addressed all the feedback. PTAL as the required changes to the storage/engine tests were rather large.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica_test.go, line 2621 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
pls add a comment to the test with some words
Done.
pkg/storage/engine/mvcc.go, line 1310 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
yeah, I also wonder. And this function's interface, taking both a txn and a timestamp, is confusing. Please add some comments to these args. Maybe the
timestamp
can be documented and asserted to be nil if atxn
is passed.
Yes, it's extremely confusing. We can't use nil
because hlc.Timestamp is passed by value. We don't really want to use hlc.Timestamp{}
because that already means inline put. (Yes, technically you can't do an inline put within a txn, but seems bad to overload that sentinel.)
The best thing I came up with was to mandate that timestamp matches txn.OrigTimestamp
(or txn.RefreshedTimestamp
if that's set). This is quite redundant, but it matches today's usage and I can't come up with anything better that doesn't require a major refactor.
Note that enforcing this actually required a ton of changes to the tests in storage/engine, since those tests were frequently violating this new timestamp-must-be-the-txn's-read-timestamp rule.
pkg/storage/engine/mvcc_test.go, line 436 at r3 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
The tests would pass without this, right? (because of the way
.Forward
is used instead of just overwritingtimestamp
withtxn.Timestamp
)
That's true in the revision you made this comment on, but is no longer true in the latest revision. See comment above for details.
pkg/storage/engine/mvcc_test.go, line 1114 at r3 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Maybe the
txn1
global should be replaced with a factory function since it's no longer used as-is anywhere (or is it?).
There's still a bunch of tests that use it directly. (Many tests are happy with txn1's default timestamp.)
pkg/storage/engine/enginepb/mvcc3.proto, line 63 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
good comment
Thank you! :)
pkg/storage/replica.go, line 2749 at r1 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Done.
No longer relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was asking:
What is the timestamp used for populating the write timestamp cache (by DelRange)?
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/engine/mvcc.go, line 1267 at r4 (raw file):
// for inline puts.) // // TODO(benesch): explore alternate function signatures that reduce
umm... I don't think you should be putting your name in TODOs any more (unless you've reconsidered :P).
I'd encourage you to add an options struct to this method that lets one clearly express inline vs intent vs non-transactional, and boil the ocean :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/engine/mvcc.go, line 1267 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
umm... I don't think you should be putting your name in TODOs any more (unless you've reconsidered :P).
I'd encourage you to add an options struct to this method that lets one clearly express inline vs intent vs non-transactional, and boil the ocean :)
Can I put your name on the TODO then? :sinceretroll:
I've got a bunch of merge stuff that's higher priority than this refactor, so I'm going to merge this as-is unless you object strongly. I think there's an 80% chance or so that I can actually get back to this refactor before I leave. (I've already done the refactor for MVCCGet (#33213) and MVCCScan (#32389).)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/engine/mvcc.go, line 1267 at r4 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Can I put your name on the TODO then? :sinceretroll:
I've got a bunch of merge stuff that's higher priority than this refactor, so I'm going to merge this as-is unless you object strongly. I think there's an 80% chance or so that I can actually get back to this refactor before I leave. (I've already done the refactor for MVCCGet (#33213) and MVCCScan (#32389).)
yeah, use my name
The fact that transactions write at their original timestamp, and not their provisional commit timestamp, allows leaving an intent under a read. The timestamp cache will ensure that the transaction can't actually commit unless it can bump its intents above the read, but it will still leave an intent under the read in the meantime. This can lead to starvation. Intents are meant to function as a sort of lock on a key. Once a writer lays down an intent, no readers should be allowed to read above that intent until that intent is resolved. Otherwise a continual stream of readers could prevent the writer from ever managing to commit by continually bumping the timestamp cache. Now consider how CDC's poller works: it reads (tsmin, ts1], then (ts1, ts2], then (ts2, ts3], and so on, in a tight loop. Since it uses a time-bound iterator under the hood, reading (ts2, ts3], for example, cannot return an intent written at ts2. But the idea was that we inductively guranteed that we never read above an intent. If an intent was written at ts2, even though the read from (ts2, ts3] would fail to observe it, the previous read from (ts1, ts2] would have. Of course, since transactions write at their original timestamp, a transaction with an original timestamp of ts2 can write an intent at ts2 *after* the CDC poller has read (ts1, ts2]. (The transaction will be forced to commit at ts3 or later to be sequenced after the CDC poller's read, but it will leave the intent at ts2.) The CDC poller's next read, from (ts2, ts3], thus won't see the intent, nor will any future reads at higher timestamps. And so the CDC poller will continually bump the timestamp cache, completely starving the writer. Fix the problem by writing at the transaction's provisional commit timestamp (i.e., the timestamp that has been forwarded by the timestamp cache, if necessary) instead of the transaction's original timestamp. Writing at the original timestamp was only necessary to prevent a lost update in snapshot isolation mode, which is no longer supported. In serializable mode, the anomaly is protected against by the read refresh mechanism. Besides fixing the starvation problem, the new behavior is more intuitive than the old behavior. It also might have some performance benefits, as it is less likely that intents will need to be bumped at commit time, which saves on RocksDB writes. Note that one conceptual oddity surrounds transactions reading their own writes: a transaction that is reading at ts1 and writing at ts2 will be able to read its own writes, even though simple timestamp comparison would make those writes invisible. This oddity is not introduced by this commit, though this commit does make this situation more common. Touches cockroachdb#32433. Release note: None
🎉 🎉 🎉 bors r=andreimatei,bdarnell |
32487: storage: write at provisional commit ts, not orig ts r=andreimatei,bdarnell a=benesch The fact that transactions write at their original timestamp, and not their provisional commit timestamp, allows leaving an intent under a read. The timestamp cache will ensure that the transaction can't actually commit unless it can bump its intents above the read, but it will still leave an intent under the read in the meantime. This can lead to starvation. Intents are meant to function as a sort of lock on a key. Once a writer lays down an intent, no readers should be allowed to read above that intent until that intent is resolved. Otherwise a continual stream of readers could prevent the writer from ever managing to commit by continually bumping the timestamp cache. Now consider how CDC's poller works: it reads (tsmin, ts1], then (ts1, ts2], then (ts2, ts3], and so on, in a tight loop. Since it uses a time-bound iterator under the hood, reading (ts2, ts3], for example, cannot return an intent written at ts2. But the idea was that we inductively guranteed that we never read above an intent. If an intent was written at ts2, even though the read from (ts2, ts3] would fail to observe it, the previous read from (ts1, ts2] would have. Of course, since transactions write at their original timestamp, a transaction with an original timestamp of ts2 can write an intent at ts2 *after* the CDC poller has read (ts1, ts2]. (The transaction will be forced to commit at ts3 or later to be sequenced after the CDC poller's read, but it will leave the intent at ts2.) The CDC poller's next read, from (ts2, ts3], thus won't see the intent, nor will any future reads at higher timestamps. And so the CDC poller will continually bump the timestamp cache, completely starving the writer. Fix the problem by writing at the transaction's provisional commit timestamp (i.e., the timestamp that has been forwarded by the timestamp cache, if necessary) instead of the transaction's original timestamp. Writing at the original timestamp was only necessary to prevent a lost update in snapshot isolation mode, which is no longer supported. In serializable mode, the anomaly is protected against by the read refresh mechanism. Besides fixing the starvation problem, the new behavior is more intuitive than the old behavior. It also might have some performance benefits, as it is less likely that intents will need to be bumped at commit time, which saves on RocksDB writes. Touches #32433. Release note: None Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
Build succeeded |
33534: Revert "Revert "engineccl: ignore intents beneath start in MVCCIncrem… r=bdarnell,tbg a=benesch …entalIterator"" This reverts commit ceb4072. This previously caused flakiness in some CDC tests, but that's been fixed by #32487. Release note: None Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
…rrors Fixes cockroachdb#34853. This change fixes cockroachdb#34853, which appears to have been broken by cockroachdb@57d0201#diff-41e9e1b7829f8e25daeb564f42c26017L1514. That change made it so that intents are now written at a transaction's provisional commit timestamp instead of its original timestamp (which is fantastic!). In doing so, it removed special purpose code to forward the `MVCCLogicalOpDetails` timestamp to the provisional commit timestamp of the transaction because it assumed that the new `writeTimestamp` variable that the change introduced already assumed this role. Unfortunately, the change allowed this `writeTimestamp` variable to regress below the transaction's provisional commit timestamp in subtle cases where a committed value ended up between the transaction's original timestamp and its provisional commit timestamp (a case that generates `WriteTooOldError`s). In this case, we were replacing the `writeTimestamp` with the committed value's timestamp + 1 logical tick. This regression allowed `MVCCWriteIntentOp`s to look like they were ignoring closed timestamps. The fix was to either introduce the special purpose code again or to avoid letting `writeTimestamp` regress in this `WriteTooOldError` code path. This PR chose the latter solution. In doing so, this PR also improves upon the benefit already provided by cockroachdb#32487 - it reduces the number of intents that need to be moved on transaction commit. This is because it ensures that if a transaction has an original (read) timestamp below a committed value and a provisional commit (write) timestamp above a committed value, an intent left by that transaction will be in a "committable" state without a need to rewrite it during intent resolution. Release note: None
…rrors Fixes cockroachdb#34853. This change fixes cockroachdb#34853, which appears to have been broken by cockroachdb@57d0201#diff-41e9e1b7829f8e25daeb564f42c26017L1514. That change made it so that intents are now written at a transaction's provisional commit timestamp instead of its original timestamp (which is fantastic!). In doing so, it removed special purpose code to forward the `MVCCLogicalOpDetails` timestamp to the provisional commit timestamp of the transaction because it assumed that the new `writeTimestamp` variable that the change introduced already assumed this role. Unfortunately, the change allowed this `writeTimestamp` variable to regress below the transaction's provisional commit timestamp in subtle cases where a committed value ended up between the transaction's original timestamp and its provisional commit timestamp (a case that generates `WriteTooOldError`s). In this case, we were replacing the `writeTimestamp` with the committed value's timestamp + 1 logical tick. This regression allowed `MVCCWriteIntentOp`s to look like they were ignoring closed timestamps. The fix was to either introduce the special purpose code again or to avoid letting `writeTimestamp` regress in this `WriteTooOldError` code path. This PR chose the latter solution. In doing so, this PR also improves upon the benefit already provided by cockroachdb#32487 - it reduces the number of intents that need to be moved on transaction commit. This is because it ensures that if a transaction has an original (read) timestamp below a committed value and a provisional commit (write) timestamp above a committed value, an intent left by that transaction will be in a "committable" state without a need to rewrite it during intent resolution. Release note: None
…rrors Fixes cockroachdb#34853. This change fixes cockroachdb#34853, which appears to have been broken by cockroachdb@57d0201#diff-41e9e1b7829f8e25daeb564f42c26017L1514. That change made it so that intents are now written at a transaction's provisional commit timestamp instead of its original timestamp (which is fantastic!). In doing so, it removed special purpose code to forward the `MVCCLogicalOpDetails` timestamp to the provisional commit timestamp of the transaction because it assumed that the new `writeTimestamp` variable that the change introduced already assumed this role. Unfortunately, the change allowed this `writeTimestamp` variable to regress below the transaction's provisional commit timestamp in subtle cases where a committed value ended up between the transaction's original timestamp and its provisional commit timestamp (a case that generates `WriteTooOldError`s). In this case, we were replacing the `writeTimestamp` with the committed value's timestamp + 1 logical tick. This regression allowed `MVCCWriteIntentOp`s to look like they were ignoring closed timestamps. The fix was to either introduce the special purpose code again or to avoid letting `writeTimestamp` regress in this `WriteTooOldError` code path. This PR chose the latter solution. In doing so, this PR also improves upon the benefit already provided by cockroachdb#32487 - it reduces the number of intents that need to be moved on transaction commit. This is because it ensures that if a transaction has an original (read) timestamp below a committed value and a provisional commit (write) timestamp above a committed value, an intent left by that transaction will be in a "committable" state without a need to rewrite it during intent resolution. Release note: None
35081: sql: Fix test flake r=andy-kimball a=andy-kimball Filter out PushTxn/ResolveIntent trace messages, since they represent intent cleanup from earlier operations that interferes with subsequent tests. This was already being done in other variations, so this commit extends that to all the variations that are doing filtering of messages. Fixes #34911 Release note: None 35086: mvcc: don't write below provisional commit timestamp on WriteTooOld errors r=tbg a=nvanbenschoten Fixes #34853. This change fixes #34853, which appears to have been broken by 57d0201#diff-41e9e1b7829f8e25daeb564f42c26017L1514. That change made it so that intents are now written at a transaction's provisional commit timestamp instead of its original timestamp (which is fantastic!). In doing so, it removed special-purpose code to forward the `MVCCLogicalOpDetails` timestamp to the provisional commit timestamp of the transaction because it assumed that the new `writeTimestamp` variable that the change introduced already assumed this role. Unfortunately, the change allowed this `writeTimestamp` variable to regress below the transaction's provisional commit timestamp in subtle cases where a committed value ended up between the transaction's original timestamp and its provisional commit timestamp (a case that generates `WriteTooOldError`s). In this case, we were replacing the `writeTimestamp` with the committed value's timestamp + 1 logical tick. This regression allowed `MVCCWriteIntentOp`s to look like they were ignoring closed timestamps. The fix was to either introduce the special purpose code again or to avoid letting `writeTimestamp` regress in this `WriteTooOldError` code path. This PR chose the latter solution. In doing so, this PR also improves upon the benefit already provided by #32487 - it reduces the number of intents that need to be moved on transaction commit. This is because it ensures that if a transaction has an original (read) timestamp below a committed value and a provisional commit (write) timestamp above a committed value, an intent left by that transaction will be in a "committable" state without a need to rewrite it during intent resolution. Release note: None 35112: kv: assert an absence of retries in TestTxnCoordSenderRetries r=tbg,andreimatei a=nvanbenschoten The test was only asserting that cases that should observe retries did, not the cases that should not observe retries did not. The PR also fixes `TestPropagateTxnOnError`, which wasn't really testing anything. We now set txn.Writing on the client, so it wasn't verifying what it thought it was. We switch it to assert that observed timestamps are propagated on error instead. 35118: storage: Downgrade a spammy log message r=tbg a=bdarnell This log message only occurs on clusters with multiple stores per node, but on those clusters it fires multiple times per second. Release note: None Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Ben Darnell <ben@bendarnell.com>
The fact that transactions write at their original timestamp, and not
their provisional commit timestamp, allows leaving an intent under a
read. The timestamp cache will ensure that the transaction can't
actually commit unless it can bump its intents above the read, but it
will still leave an intent under the read in the meantime.
This can lead to starvation. Intents are meant to function as a sort of
lock on a key. Once a writer lays down an intent, no readers should be
allowed to read above that intent until that intent is resolved.
Otherwise a continual stream of readers could prevent the writer from
ever managing to commit by continually bumping the timestamp cache.
Now consider how CDC's poller works: it reads (tsmin, ts1], then (ts1,
ts2], then (ts2, ts3], and so on, in a tight loop. Since it uses a
time-bound iterator under the hood, reading (ts2, ts3], for example,
cannot return an intent written at ts2. But the idea was that we
inductively guranteed that we never read above an intent. If an intent
was written at ts2, even though the read from (ts2, ts3] would fail to
observe it, the previous read from (ts1, ts2] would have.
Of course, since transactions write at their original timestamp, a
transaction with an original timestamp of ts2 can write an intent at ts2
after the CDC poller has read (ts1, ts2]. (The transaction will be
forced to commit at ts3 or later to be sequenced after the CDC poller's
read, but it will leave the intent at ts2.) The CDC poller's next read,
from (ts2, ts3], thus won't see the intent, nor will any future reads at
higher timestamps. And so the CDC poller will continually bump the
timestamp cache, completely starving the writer.
Fix the problem by writing at the transaction's provisional commit
timestamp (i.e., the timestamp that has been forwarded by the timestamp
cache, if necessary) instead of the transaction's original timestamp.
Writing at the original timestamp was only necessary to prevent a lost
update in snapshot isolation mode, which is no longer supported. In
serializable mode, the anomaly is protected against by the read refresh
mechanism.
Besides fixing the starvation problem, the new behavior is more
intuitive than the old behavior. It also might have some performance
benefits, as it is less likely that intents will need to be bumped at
commit time, which saves on RocksDB writes.
Touches #32433.
Release note: None