Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: write at provisional commit ts, not orig ts #32487

Merged
merged 1 commit into from
Dec 19, 2018

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Nov 19, 2018

The fact that transactions write at their original timestamp, and not
their provisional commit timestamp, allows leaving an intent under a
read. The timestamp cache will ensure that the transaction can't
actually commit unless it can bump its intents above the read, but it
will still leave an intent under the read in the meantime.

This can lead to starvation. Intents are meant to function as a sort of
lock on a key. Once a writer lays down an intent, no readers should be
allowed to read above that intent until that intent is resolved.
Otherwise a continual stream of readers could prevent the writer from
ever managing to commit by continually bumping the timestamp cache.

Now consider how CDC's poller works: it reads (tsmin, ts1], then (ts1,
ts2], then (ts2, ts3], and so on, in a tight loop. Since it uses a
time-bound iterator under the hood, reading (ts2, ts3], for example,
cannot return an intent written at ts2. But the idea was that we
inductively guranteed that we never read above an intent. If an intent
was written at ts2, even though the read from (ts2, ts3] would fail to
observe it, the previous read from (ts1, ts2] would have.

Of course, since transactions write at their original timestamp, a
transaction with an original timestamp of ts2 can write an intent at ts2
after the CDC poller has read (ts1, ts2]. (The transaction will be
forced to commit at ts3 or later to be sequenced after the CDC poller's
read, but it will leave the intent at ts2.) The CDC poller's next read,
from (ts2, ts3], thus won't see the intent, nor will any future reads at
higher timestamps. And so the CDC poller will continually bump the
timestamp cache, completely starving the writer.

Fix the problem by writing at the transaction's provisional commit
timestamp (i.e., the timestamp that has been forwarded by the timestamp
cache, if necessary) instead of the transaction's original timestamp.
Writing at the original timestamp was only necessary to prevent a lost
update in snapshot isolation mode, which is no longer supported. In
serializable mode, the anomaly is protected against by the read refresh
mechanism.

Besides fixing the starvation problem, the new behavior is more
intuitive than the old behavior. It also might have some performance
benefits, as it is less likely that intents will need to be bumped at
commit time, which saves on RocksDB writes.

Touches #32433.

Release note: None

@benesch benesch requested review from bdarnell, andreimatei, tbg and a team November 19, 2018 22:19
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@benesch benesch changed the title storage: write at provisional commit ts, not orig ts [semiwip] storage: write at provisional commit ts, not orig ts Nov 19, 2018
@benesch
Copy link
Contributor Author

benesch commented Nov 19, 2018

The big downside of this change is that we really want TBIs to be useful for CDC in 2.1. CDC in 2.2 will make more use of range feeds and less use of polling so TBIs will be less necessary. But this change is definitely too scary to backport to 2.1.

@tbg
Copy link
Member

tbg commented Nov 19, 2018

Are we exploring ripping out TBIs and pushing logic to C++? That sounded like the least scary venue.

@benesch
Copy link
Contributor Author

benesch commented Nov 19, 2018

Are we exploring ripping out TBIs and pushing logic to C++? That sounded like the least scary venue.

My gut says that that won't be comparable to the benefit you get when you only need to scan the memtable and some L0 files. There is a fundamental limit to how fast scanning n keys can be, and TBIs are currently our only way of reducing n rather than the constant factor.

Perhaps we'd get lucky and things would be fast enough though. It's definitely worth an experiment.

@bdarnell
Copy link
Contributor

The CDC poller's next read,
from (ts2, ts3], thus won't see the intent, nor will any future reads at
higher timestamps. And so the CDC poller will continually bump the
timestamp cache, completely starving the writer.

Oh, this is subtle. The CDC poller doesn't actually care about uncommitted intents, and it will see the committed value at its new timestamp when and if it is able to commit. But the tight loop starves out the transaction so it can't commit.

How does this change solve the problem? We write above the poller's read, but what stops the poller from advancing the timestamp cache again? Is it just that the poller stops advancing while there is a pending intent?

We have some tests that use SNAPSHOT isolation; these tests will (hopefully) become flaky with a change like this. We should get rid of tests that use SNAPSHOT before we make changes that will break them.

But this change is definitely too scary to backport to 2.1.

We can't break SNAPSHOT in 2.1.x; clusters being upgraded from 2.0 may still perform snapshot-isolation transactions in a mixed-version configuration.

If reads still use OrigTimestamp but writes use Timestamp, how does that affect commands that do both like ConditionalPut?

@benesch
Copy link
Contributor Author

benesch commented Nov 20, 2018

Oh, this is subtle. The CDC poller doesn't actually care about uncommitted intents, and it will see the committed value at its new timestamp when and if it is able to commit. But the tight loop starves out the transaction so it can't commit.

I think it's actually possible that the CDC poller misses the value entirely. Say the poller reads at (ts1, ts2], and a concurrent transaction writes at ts1 (its orig timestamp) while getting pushed to ts3. If that transaction takes a while to send its commit RPC, the CDC poller might read (ts2, ts3] before the intent at ts3 gets resolved. And then it will be reading (ts3, ts4] by the time the intent gets resolved and never see the value committed at ts3.

I could be off base here, but I think the intuition is that an intent is a lock on [tintent, ∞). You can't say that you've read everything for t ≥ tintent until that intent is resolved because the intent could commit at any such t.

How does this change solve the problem? We write above the poller's read, but what stops the poller from advancing the timestamp cache again? Is it just that the poller stops advancing while there is a pending intent?

Right. If the poller's ExportRequest catches a WriteIntentError, it won't update the timestamp cache, and so the intent can get bumped.

We can't break SNAPSHOT in 2.1.x; clusters being upgraded from 2.0 may still perform snapshot-isolation transactions in a mixed-version configuration.

Good point.

If reads still use OrigTimestamp but writes use Timestamp, how does that affect commands that do both like ConditionalPut?

I don't know. I suspect it works fine because all reads get refreshed to the commit timestamp anyway, so we're just doing extra work. But I need to think about this more.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod the fact that this really needs tests for all the changes in applyTimestampCache.

I think we should couple this patch with a change to PushTxn which makes a push succeed if it's trying to push the txn at a ts <= its current timestamp (i.e. the timestamps known by the txn record). Like, if txn1 wrote at ts 5, and then it got bumped to ts 10, and now we have a read at ts 7, that read should succeed.
Otherwise, with this change, we can have situations where, because a txn now is leaving intents at different timestamps, whether or not two txn conflict can depend on which intent from the first the second one runs into.
In other words, I'm not so sure about the "the new behavior is more
intuitive than the old behavior" claim; the old one seems more intuitive to me. But otherwise I do think I like this change - there are benefits to writing at timestamps as high as possible.

// but to avoid being starved by short-lived transactions on that key which
// would otherwise not have to go through conflict resolution with txn1.
//
// Again, keep in mind that, when the transaction commits, all the intents are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part of the comment should be kept somewhere. The comment on the timestamp field should be expended to explain how its used for writes, and how a txn might be laying down intents at different timestamps as it goes along.

@@ -2746,6 +2746,7 @@ func (r *Replica) applyTimestampCache(
txn := ba.Txn.Clone()
bumped = txn.Timestamp.Forward(nextTS) || bumped
ba.Txn = &txn
ba.Timestamp = txn.Timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment at this change and below, reiterating that we want to write at the bumped ts.

@andreimatei
Copy link
Contributor

Oh I wanted to say something else too - let's also couple this patch with deleting every trace of snapshot txns from our code. You can do it in a different commit if you want to backport the first (although the backport I guess would need to be put behind a cluster version per Ben's comments).

@bdarnell
Copy link
Contributor

although the backport I guess would need to be put behind a cluster version per Ben's comments

And backports aren't allowed to introduce cluster versions unless it's a really exceptional case.

@andreimatei
Copy link
Contributor

And backports aren't allowed to introduce cluster versions unless it's a really exceptional case.

We were just discussing offline how you don't really need a new cluster version, you could just gate on the 2.1 one.

benesch added a commit to benesch/cockroach that referenced this pull request Dec 5, 2018
Snapshot isolation was disabled in v2.1, so it is safe to remove all
code paths that handle snapshot isolation in v2.2.

This is a necessary precursor to cockroachdb#32487.

Release note: None
benesch added a commit to benesch/cockroach that referenced this pull request Dec 6, 2018
Snapshot isolation was disabled in v2.1, so it is safe to remove all
code paths that handle snapshot isolation in v2.2.

Note that the change to the MVCCMetadata checksum in
pkg/storage/below_raft_protos_test.go is safe, because the actual
encoding of the protobuf is not affected by this change. What is
affected is NewPopulatedMVCCMetadata, which no longer returns a TxnMeta
with a non-zero IsolationType, since the IsolationType field has been
removed.

This is a necessary precursor to cockroachdb#32487.

Release note: None
craig bot pushed a commit that referenced this pull request Dec 6, 2018
32860: kv,storage,sql: purge snapshot isolation r=andreimatei,nvanbenschoten,bdarnell a=benesch

This touches a lot so I cast a wide net for reviewers. Feel free to remove yourself!

---

Snapshot isolation was disabled in v2.1, so it is safe to remove all
code paths that handle snapshot isolation in v2.2.

This is a necessary precursor to #32487.

Release note: None

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
@benesch benesch force-pushed the wts branch 2 times, most recently from a801911 to 26ea8f2 Compare December 10, 2018 22:47
Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, PTAL. I added a test and updated some more comments.

LGTM mod the fact that this really needs tests for all the changes in applyTimestampCache.

Done.

Oh I wanted to say something else too - let's also couple this patch with deleting every trace of snapshot txns from our code. You can do it in a different commit if you want to backport the first (although the backport I guess would need to be put behind a cluster version per Ben's comments)

Done in #32860.

I think we should couple this patch with a change to PushTxn which makes a push succeed if it's trying to push the txn at a ts <= its current timestamp (i.e. the timestamps known by the txn record). Like, if txn1 wrote at ts 5, and then it got bumped to ts 10, and now we have a read at ts 7, that read should succeed.

Isn't this already the case?

// If we're trying to move the timestamp forward, and it's already
// far enough forward, return success.
if args.PushType == roachpb.PUSH_TIMESTAMP && args.PushTo.Less(reply.PusheeTxn.Timestamp) {
// Trivial noop.
return result.Result{}, nil
}

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 2749 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please add a comment at this change and below, reiterating that we want to write at the bumped ts.

Done.

@benesch benesch changed the title [semiwip] storage: write at provisional commit ts, not orig ts storage: write at provisional commit ts, not orig ts Dec 10, 2018
@benesch benesch requested review from a team December 14, 2018 20:29
@benesch
Copy link
Contributor Author

benesch commented Dec 18, 2018

It's greeeeen! 🍏📗💚🥗 Thank goodness.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 12 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/engine/mvcc.go, line 1310 at r3 (raw file):

		// A txn writes intents at its provisional commit timestamp. See the
		// comment on the txn.Timestamp field definition for rationale.
		writeTimestamp.Forward(txn.Timestamp)

Should it be an error for txn.Timestamp to be less than timestamp?


pkg/storage/engine/mvcc_test.go, line 436 at r3 (raw file):

	txn := *txn1
	txn.Sequence++
	txn.Timestamp = hlc.Timestamp{WallTime: 1}

The tests would pass without this, right? (because of the way .Forward is used instead of just overwriting timestamp with txn.Timestamp)


pkg/storage/engine/mvcc_test.go, line 1114 at r3 (raw file):

	ts := []hlc.Timestamp{{Logical: 1}, {Logical: 2}, {Logical: 3}, {Logical: 4}, {Logical: 5}, {Logical: 6}}

	txn1ts := *txn1

Maybe the txn1 global should be replaced with a factory function since it's no longer used as-is anywhere (or is it?).

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod a comment on the interface of MVCCPut.

And a question: What is the timestamp used for populating the write timestamp cache (by DelRange)?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica_test.go, line 2621 at r3 (raw file):

}

func TestReplicaTSCacheForwardsIntentTS(t *testing.T) {

pls add a comment to the test with some words


pkg/storage/replica_test.go, line 2666 at r3 (raw file):

				t.Fatal(pErr)
			}
			iter := tc.engine.NewIterator(engine.IterOptions{Prefix: true})

respect for going medieval


pkg/storage/engine/mvcc.go, line 1310 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Should it be an error for txn.Timestamp to be less than timestamp?

yeah, I also wonder. And this function's interface, taking both a txn and a timestamp, is confusing. Please add some comments to these args. Maybe the timestamp can be documented and asserted to be nil if a txn is passed.


pkg/storage/engine/enginepb/mvcc3.proto, line 63 at r3 (raw file):

  // original timestamp:
  //
  //    1. This timestamp is forwarded by the timestamp cache when this

good comment

@benesch benesch force-pushed the wts branch 4 times, most recently from 39dba20 to 50869b1 Compare December 19, 2018 06:32
Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whew, ok, addressed all the feedback. PTAL as the required changes to the storage/engine tests were rather large.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica_test.go, line 2621 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

pls add a comment to the test with some words

Done.


pkg/storage/engine/mvcc.go, line 1310 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

yeah, I also wonder. And this function's interface, taking both a txn and a timestamp, is confusing. Please add some comments to these args. Maybe the timestamp can be documented and asserted to be nil if a txn is passed.

Yes, it's extremely confusing. We can't use nil because hlc.Timestamp is passed by value. We don't really want to use hlc.Timestamp{} because that already means inline put. (Yes, technically you can't do an inline put within a txn, but seems bad to overload that sentinel.)

The best thing I came up with was to mandate that timestamp matches txn.OrigTimestamp (or txn.RefreshedTimestamp if that's set). This is quite redundant, but it matches today's usage and I can't come up with anything better that doesn't require a major refactor.

Note that enforcing this actually required a ton of changes to the tests in storage/engine, since those tests were frequently violating this new timestamp-must-be-the-txn's-read-timestamp rule.


pkg/storage/engine/mvcc_test.go, line 436 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The tests would pass without this, right? (because of the way .Forward is used instead of just overwriting timestamp with txn.Timestamp)

That's true in the revision you made this comment on, but is no longer true in the latest revision. See comment above for details.


pkg/storage/engine/mvcc_test.go, line 1114 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Maybe the txn1 global should be replaced with a factory function since it's no longer used as-is anywhere (or is it?).

There's still a bunch of tests that use it directly. (Many tests are happy with txn1's default timestamp.)


pkg/storage/engine/enginepb/mvcc3.proto, line 63 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

good comment

Thank you! :)


pkg/storage/replica.go, line 2749 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Done.

No longer relevant.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking:

What is the timestamp used for populating the write timestamp cache (by DelRange)?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/engine/mvcc.go, line 1267 at r4 (raw file):

// for inline puts.)
//
// TODO(benesch): explore alternate function signatures that reduce

umm... I don't think you should be putting your name in TODOs any more (unless you've reconsidered :P).

I'd encourage you to add an options struct to this method that lets one clearly express inline vs intent vs non-transactional, and boil the ocean :)

Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/engine/mvcc.go, line 1267 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

umm... I don't think you should be putting your name in TODOs any more (unless you've reconsidered :P).

I'd encourage you to add an options struct to this method that lets one clearly express inline vs intent vs non-transactional, and boil the ocean :)

Can I put your name on the TODO then? :sinceretroll:

I've got a bunch of merge stuff that's higher priority than this refactor, so I'm going to merge this as-is unless you object strongly. I think there's an 80% chance or so that I can actually get back to this refactor before I leave. (I've already done the refactor for MVCCGet (#33213) and MVCCScan (#32389).)

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/engine/mvcc.go, line 1267 at r4 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Can I put your name on the TODO then? :sinceretroll:

I've got a bunch of merge stuff that's higher priority than this refactor, so I'm going to merge this as-is unless you object strongly. I think there's an 80% chance or so that I can actually get back to this refactor before I leave. (I've already done the refactor for MVCCGet (#33213) and MVCCScan (#32389).)

yeah, use my name

The fact that transactions write at their original timestamp, and not
their provisional commit timestamp, allows leaving an intent under a
read. The timestamp cache will ensure that the transaction can't
actually commit unless it can bump its intents above the read, but it
will still leave an intent under the read in the meantime.

This can lead to starvation. Intents are meant to function as a sort of
lock on a key. Once a writer lays down an intent, no readers should be
allowed to read above that intent until that intent is resolved.
Otherwise a continual stream of readers could prevent the writer from
ever managing to commit by continually bumping the timestamp cache.

Now consider how CDC's poller works: it reads (tsmin, ts1], then (ts1,
ts2], then (ts2, ts3], and so on, in a tight loop. Since it uses a
time-bound iterator under the hood, reading (ts2, ts3], for example,
cannot return an intent written at ts2. But the idea was that we
inductively guranteed that we never read above an intent. If an intent
was written at ts2, even though the read from (ts2, ts3] would fail to
observe it, the previous read from (ts1, ts2] would have.

Of course, since transactions write at their original timestamp, a
transaction with an original timestamp of ts2 can write an intent at ts2
*after* the CDC poller has read (ts1, ts2]. (The transaction will be
forced to commit at ts3 or later to be sequenced after the CDC poller's
read, but it will leave the intent at ts2.) The CDC poller's next read,
from (ts2, ts3], thus won't see the intent, nor will any future reads at
higher timestamps. And so the CDC poller will continually bump the
timestamp cache, completely starving the writer.

Fix the problem by writing at the transaction's provisional commit
timestamp (i.e., the timestamp that has been forwarded by the timestamp
cache, if necessary) instead of the transaction's original timestamp.
Writing at the original timestamp was only necessary to prevent a lost
update in snapshot isolation mode, which is no longer supported. In
serializable mode, the anomaly is protected against by the read refresh
mechanism.

Besides fixing the starvation problem, the new behavior is more
intuitive than the old behavior. It also might have some performance
benefits, as it is less likely that intents will need to be bumped at
commit time, which saves on RocksDB writes.

Note that one conceptual oddity surrounds transactions reading their own
writes: a transaction that is reading at ts1 and writing at ts2 will be
able to read its own writes, even though simple timestamp comparison
would make those writes invisible. This oddity is not introduced by
this commit, though this commit does make this situation more common.

Touches cockroachdb#32433.

Release note: None
@benesch
Copy link
Contributor Author

benesch commented Dec 19, 2018

🎉 🎉 🎉

bors r=andreimatei,bdarnell

craig bot pushed a commit that referenced this pull request Dec 19, 2018
32487: storage: write at provisional commit ts, not orig ts r=andreimatei,bdarnell a=benesch


The fact that transactions write at their original timestamp, and not
their provisional commit timestamp, allows leaving an intent under a
read. The timestamp cache will ensure that the transaction can't
actually commit unless it can bump its intents above the read, but it
will still leave an intent under the read in the meantime.

This can lead to starvation. Intents are meant to function as a sort of
lock on a key. Once a writer lays down an intent, no readers should be
allowed to read above that intent until that intent is resolved.
Otherwise a continual stream of readers could prevent the writer from
ever managing to commit by continually bumping the timestamp cache.

Now consider how CDC's poller works: it reads (tsmin, ts1], then (ts1,
ts2], then (ts2, ts3], and so on, in a tight loop. Since it uses a
time-bound iterator under the hood, reading (ts2, ts3], for example,
cannot return an intent written at ts2. But the idea was that we
inductively guranteed that we never read above an intent. If an intent
was written at ts2, even though the read from (ts2, ts3] would fail to
observe it, the previous read from (ts1, ts2] would have.

Of course, since transactions write at their original timestamp, a
transaction with an original timestamp of ts2 can write an intent at ts2
*after* the CDC poller has read (ts1, ts2]. (The transaction will be
forced to commit at ts3 or later to be sequenced after the CDC poller's
read, but it will leave the intent at ts2.) The CDC poller's next read,
from (ts2, ts3], thus won't see the intent, nor will any future reads at
higher timestamps. And so the CDC poller will continually bump the
timestamp cache, completely starving the writer.

Fix the problem by writing at the transaction's provisional commit
timestamp (i.e., the timestamp that has been forwarded by the timestamp
cache, if necessary) instead of the transaction's original timestamp.
Writing at the original timestamp was only necessary to prevent a lost
update in snapshot isolation mode, which is no longer supported. In
serializable mode, the anomaly is protected against by the read refresh
mechanism.

Besides fixing the starvation problem, the new behavior is more
intuitive than the old behavior. It also might have some performance
benefits, as it is less likely that intents will need to be bumped at
commit time, which saves on RocksDB writes.

Touches #32433.

Release note: None

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
@craig
Copy link
Contributor

craig bot commented Dec 19, 2018

Build succeeded

@craig craig bot merged commit 57d0201 into cockroachdb:master Dec 19, 2018
@benesch benesch deleted the wts branch December 19, 2018 22:43
craig bot pushed a commit that referenced this pull request Jan 7, 2019
33534: Revert "Revert "engineccl: ignore intents beneath start in MVCCIncrem… r=bdarnell,tbg a=benesch

…entalIterator""

This reverts commit ceb4072. This previously caused flakiness in some CDC tests, but that's been fixed by #32487.

Release note: None

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 20, 2019
…rrors

Fixes cockroachdb#34853.

This change fixes cockroachdb#34853, which appears to have been broken by
cockroachdb@57d0201#diff-41e9e1b7829f8e25daeb564f42c26017L1514.
That change made it so that intents are now written at a transaction's
provisional commit timestamp instead of its original timestamp (which is
fantastic!). In doing so, it removed special purpose code to forward the
`MVCCLogicalOpDetails` timestamp to the provisional commit timestamp of the
transaction because it assumed that the new `writeTimestamp` variable that the
change introduced already assumed this role.

Unfortunately, the change allowed this `writeTimestamp` variable to regress
below the transaction's provisional commit timestamp in subtle cases where a
committed value ended up between the transaction's original timestamp and its
provisional commit timestamp (a case that generates `WriteTooOldError`s). In
this case, we were replacing the `writeTimestamp` with the committed value's
timestamp + 1 logical tick. This regression allowed `MVCCWriteIntentOp`s to look
like they were ignoring closed timestamps.

The fix was to either introduce the special purpose code again or to avoid
letting `writeTimestamp` regress in this `WriteTooOldError` code path. This PR
chose the latter solution.

In doing so, this PR also improves upon the benefit already provided by cockroachdb#32487 -
it reduces the number of intents that need to be moved on transaction commit.
This is because it ensures that if a transaction has an original (read)
timestamp below a committed value and a provisional commit (write) timestamp
above a committed value, an intent left by that transaction will be in a
"committable" state without a need to rewrite it during intent resolution.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 20, 2019
…rrors

Fixes cockroachdb#34853.

This change fixes cockroachdb#34853, which appears to have been broken by
cockroachdb@57d0201#diff-41e9e1b7829f8e25daeb564f42c26017L1514.
That change made it so that intents are now written at a transaction's
provisional commit timestamp instead of its original timestamp (which is
fantastic!). In doing so, it removed special purpose code to forward the
`MVCCLogicalOpDetails` timestamp to the provisional commit timestamp of the
transaction because it assumed that the new `writeTimestamp` variable that the
change introduced already assumed this role.

Unfortunately, the change allowed this `writeTimestamp` variable to regress
below the transaction's provisional commit timestamp in subtle cases where a
committed value ended up between the transaction's original timestamp and its
provisional commit timestamp (a case that generates `WriteTooOldError`s). In
this case, we were replacing the `writeTimestamp` with the committed value's
timestamp + 1 logical tick. This regression allowed `MVCCWriteIntentOp`s to look
like they were ignoring closed timestamps.

The fix was to either introduce the special purpose code again or to avoid
letting `writeTimestamp` regress in this `WriteTooOldError` code path. This PR
chose the latter solution.

In doing so, this PR also improves upon the benefit already provided by cockroachdb#32487 -
it reduces the number of intents that need to be moved on transaction commit.
This is because it ensures that if a transaction has an original (read)
timestamp below a committed value and a provisional commit (write) timestamp
above a committed value, an intent left by that transaction will be in a
"committable" state without a need to rewrite it during intent resolution.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 21, 2019
…rrors

Fixes cockroachdb#34853.

This change fixes cockroachdb#34853, which appears to have been broken by
cockroachdb@57d0201#diff-41e9e1b7829f8e25daeb564f42c26017L1514.
That change made it so that intents are now written at a transaction's
provisional commit timestamp instead of its original timestamp (which is
fantastic!). In doing so, it removed special purpose code to forward the
`MVCCLogicalOpDetails` timestamp to the provisional commit timestamp of the
transaction because it assumed that the new `writeTimestamp` variable that the
change introduced already assumed this role.

Unfortunately, the change allowed this `writeTimestamp` variable to regress
below the transaction's provisional commit timestamp in subtle cases where a
committed value ended up between the transaction's original timestamp and its
provisional commit timestamp (a case that generates `WriteTooOldError`s). In
this case, we were replacing the `writeTimestamp` with the committed value's
timestamp + 1 logical tick. This regression allowed `MVCCWriteIntentOp`s to look
like they were ignoring closed timestamps.

The fix was to either introduce the special purpose code again or to avoid
letting `writeTimestamp` regress in this `WriteTooOldError` code path. This PR
chose the latter solution.

In doing so, this PR also improves upon the benefit already provided by cockroachdb#32487 -
it reduces the number of intents that need to be moved on transaction commit.
This is because it ensures that if a transaction has an original (read)
timestamp below a committed value and a provisional commit (write) timestamp
above a committed value, an intent left by that transaction will be in a
"committable" state without a need to rewrite it during intent resolution.

Release note: None
craig bot pushed a commit that referenced this pull request Feb 21, 2019
35081: sql: Fix test flake r=andy-kimball a=andy-kimball

Filter out PushTxn/ResolveIntent trace messages, since they represent intent
cleanup from earlier operations that interferes with subsequent tests. This
was already being done in other variations, so this commit extends that to
all the variations that are doing filtering of messages.

Fixes #34911

Release note: None

35086: mvcc: don't write below provisional commit timestamp on WriteTooOld errors r=tbg a=nvanbenschoten

Fixes #34853.

This change fixes #34853, which appears to have been broken by
57d0201#diff-41e9e1b7829f8e25daeb564f42c26017L1514.
That change made it so that intents are now written at a transaction's
provisional commit timestamp instead of its original timestamp (which is
fantastic!). In doing so, it removed special-purpose code to forward the
`MVCCLogicalOpDetails` timestamp to the provisional commit timestamp of the
transaction because it assumed that the new `writeTimestamp` variable that the
change introduced already assumed this role.

Unfortunately, the change allowed this `writeTimestamp` variable to regress
below the transaction's provisional commit timestamp in subtle cases where a
committed value ended up between the transaction's original timestamp and its
provisional commit timestamp (a case that generates `WriteTooOldError`s). In
this case, we were replacing the `writeTimestamp` with the committed value's
timestamp + 1 logical tick. This regression allowed `MVCCWriteIntentOp`s to look
like they were ignoring closed timestamps.

The fix was to either introduce the special purpose code again or to avoid
letting `writeTimestamp` regress in this `WriteTooOldError` code path. This PR
chose the latter solution.

In doing so, this PR also improves upon the benefit already provided by #32487 -
it reduces the number of intents that need to be moved on transaction commit.
This is because it ensures that if a transaction has an original (read)
timestamp below a committed value and a provisional commit (write) timestamp
above a committed value, an intent left by that transaction will be in a
"committable" state without a need to rewrite it during intent resolution.

Release note: None

35112: kv: assert an absence of retries in TestTxnCoordSenderRetries  r=tbg,andreimatei a=nvanbenschoten

The test was only asserting that cases that should observe retries did, not the cases that should not observe retries did not.

The PR also fixes `TestPropagateTxnOnError`, which wasn't really testing anything. We now set txn.Writing on the client, so it wasn't verifying what it thought it was. We switch it to assert that observed timestamps are propagated on error instead.

35118: storage: Downgrade a spammy log message r=tbg a=bdarnell

This log message only occurs on clusters with multiple stores per
node, but on those clusters it fires multiple times per second.

Release note: None

Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Ben Darnell <ben@bendarnell.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants