-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
storage: write at provisional commit ts, not orig ts
The fact that transactions write at their original timestamp, and not their provisional commit timestamp, allows leaving an intent under a read. The timestamp cache will ensure that the transaction can't actually commit unless it can bump its intents above the read, but it will still leave an intent under the read in the meantime. This can lead to starvation. Intents are meant to function as a sort of lock on a key. Once a writer lays down an intent, no readers should be allowed to read above that intent until that intent is resolved. Otherwise a continual stream of readers could prevent the writer from ever managing to commit by continually bumping the timestamp cache. Now consider how CDC's poller works: it reads (tsmin, ts1], then (ts1, ts2], then (ts2, ts3], and so on, in a tight loop. Since it uses a time-bound iterator under the hood, reading (ts2, ts3], for example, cannot return an intent written at ts2. But the idea was that we inductively guranteed that we never read above an intent. If an intent was written at ts2, even though the read from (ts2, ts3] would fail to observe it, the previous read from (ts1, ts2] would have. Of course, since transactions write at their original timestamp, a transaction with an original timestamp of ts2 can write an intent at ts2 *after* the CDC poller has read (ts1, ts2]. (The transaction will be forced to commit at ts3 or later to be sequenced after the CDC poller's read, but it will leave the intent at ts2.) The CDC poller's next read, from (ts2, ts3], thus won't see the intent, nor will any future reads at higher timestamps. And so the CDC poller will continually bump the timestamp cache, completely starving the writer. Fix the problem by writing at the transaction's provisional commit timestamp (i.e., the timestamp that has been forwarded by the timestamp cache, if necessary) instead of the transaction's original timestamp. Writing at the original timestamp was only necessary to prevent a lost update in snapshot isolation mode, which is no longer supported. In serializable mode, the anomaly is protected against by the read refresh mechanism. Besides fixing the starvation problem, the new behavior is more intuitive than the old behavior. It also might have some performance benefits, as it is less likely that intents will need to be bumped at commit time, which saves on RocksDB writes. Touches #32433. Release note: None
- Loading branch information
Showing
5 changed files
with
92 additions
and
78 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters