-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: minimize retries with 1PC commits #22315
storage: minimize retries with 1PC commits #22315
Conversation
b1d7e68
to
c2a562e
Compare
@nvanbenschoten this now evaluates the full batch for both transactional and non-transactional batches to find the highest timestamp to satisfy write too old errors. Local retries on the replica at evaluation time are allowed for non-transactional batches always and for serializable 1PC transactional batches, as long as they have the |
Thanks @spencerkimball, I'm more comfortable with the change now that we're attempting to minimize the timestamp that these transactions will be pushed to. This eliminates the increase in |
@nvanbenschoten it does eliminate those yes. |
Reviewed 5 of 7 files at r1. pkg/roachpb/api.proto, line 458 at r1 (raw file):
Does it ever make sense to set this for snapshot txns? pkg/roachpb/api.proto, line 461 at r1 (raw file):
replace "use the executing server's hlc clock now timestamp" with "increase their timestamp on the executing server" pkg/roachpb/api.proto, line 462 at r1 (raw file):
Do you think we should make this flag even more general? Something like As discussed with @andreimatei, we don't want to increase our transaction timestamp if we've done anything that relies on our initial timestamp. This includes having performed reads and writes at the timestamp but can also include leaking the timestamp in other ways like reporting it to SQL functions. pkg/storage/replica.go, line 5209 at r1 (raw file):
nit: pull pkg/storage/replica.go, line 5273 at r1 (raw file):
Is this actually possible? Doesn't Comments from Reviewable |
c2a562e
to
057d20f
Compare
Review status: 5 of 7 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful. pkg/roachpb/api.proto, line 458 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We don't track refresh spans for snapshot txns because we don't use them for refreshing, so this optimization doesn't apply to SNAPSHOT. pkg/roachpb/api.proto, line 461 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Oops. this was a leftover bit. pkg/roachpb/api.proto, line 462 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
OK I'm fine with that. Changed it to pkg/storage/replica.go, line 5209 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/replica.go, line 5273 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yes, happens all the time if the batch affects keys which have newer timestamps from the timestamp cache. I improved the comment. Comments from Reviewable |
@nvanbenschoten I realize we can actually avoid retries on non fast path (non-1PC) transactions as well with this knowledge about having no refresh spans. |
057d20f
to
662e706
Compare
Looks like this should reference (and potentially close)
#15797.
On Fri, Feb 2, 2018 at 6:35 PM Spencer Kimball ***@***.***> wrote:
@nvanbenschoten <https://github.com/nvanbenschoten> I realize we can
actually avoid retries on non fast path (non-1PC) transactions as well with
this knowledge about having no refresh spans.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#22315 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135GeShxSMu-G0QQYfE6kd32aWxBrUks5tQ5ujgaJpZM4R2cRc>
.
--
…-- Tobias
|
Fixes #15797 With this change,
Previous sha's performance with same options has:
|
Nice!
…On Fri, Feb 2, 2018, 7:15 PM Spencer Kimball ***@***.***> wrote:
Fixes #15797 <#15797>
With this change, kv --concurrency=8 --cycle-length=1 exhibits:
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
1s 0 2855.9 2855.9 2.8 3.4 5.0 6.8
2s 0 2950.1 2903.0 2.8 3.0 4.7 6.8
3s 0 2886.7 2897.6 2.8 3.3 5.8 7.9
4s 0 2899.1 2898.0 2.8 3.1 5.8 7.3
5s 0 2937.1 2905.8 2.6 3.0 6.3 8.9
6s 0 2915.8 2907.5 2.8 3.0 6.6 7.1
7s 0 2862.1 2901.0 2.8 3.1 7.1 11.5
8s 0 2717.8 2878.1 2.8 3.9 8.9 15.2
9s 0 2750.0 2863.8 2.8 3.5 8.1 11.5
10s 0 2872.2 2864.7 2.8 3.1 8.4 9.4
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
10.0s 0 28649 2864.7 2.8 2.8 3.3 6.8 15.2
Previous sha's performance with same options has:
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
1s 0 462.7 462.7 11.5 48.2 62.9 79.7
2s 0 449.4 456.0 16.3 46.1 65.0 88.1
3s 0 441.6 451.2 16.8 48.2 58.7 79.7
4s 0 458.0 452.9 15.7 46.1 52.4 75.5
5s 0 438.2 450.0 16.8 48.2 60.8 71.3
6s 0 451.4 450.2 10.0 46.1 56.6 71.3
7s 0 361.1 437.5 22.0 56.6 75.5 109.1
8s 0 416.1 434.8 8.9 52.4 75.5 96.5
9s 0 440.3 435.5 13.6 50.3 58.7 62.9
10s 0 451.4 437.0 9.4 46.1 58.7 79.7
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
10.0s 0 4371 437.0 18.3 14.2 48.2 65.0 109.1
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#22315 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135F1R0cL33Zg1owPIaH9YoVlwSra5ks5tQ6U2gaJpZM4R2cRc>
.
--
-- Tobias
|
Yeah, that's a good point. We can always allow transaction timestamps to be pushed forward to avoid retries up until the timestamp is leaked for the first time. To make that change I think we'll need to move the Reviewed 4 of 6 files at r2. pkg/roachpb/data.go, line 1145 at r2 (raw file):
Why was this suddenly caught? Did part of this change cause it to become more important?
Also, It's unfortunate that the discrepancy between Comments from Reviewable |
@spencerkimball, do you mind trying this change out on TPC-C as well? #15797 (comment) implies that TPC-C might also be affected by an approach like this. |
Reviewed 1 of 7 files at r1, 6 of 6 files at r2. pkg/kv/txn_coord_sender.go, line 464 at r2 (raw file):
While this is all fresh in your mind, make sure there is commentary somewhere about why we can't do this for snapshot transactions. (is it just because we don't accumulate refresh spans for non-serializable transactions or is there a deeper reason? If we weren't thinking about getting rid of snapshot completely, it might be worth saving refresh spans just to enable this optimization) Maybe setting the isolation to snapshot should set RefreshValid to false so we don't need the separate IsSerializable check here. pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):
Can the DistSender split the batch below this point? We need to make sure to clear the CanForwardCommitTimestamp flag if that happens. pkg/storage/replica.go, line 5238 at r2 (raw file):
If pkg/storage/replica.go, line 5487 at r2 (raw file):
I think some comments about how we define "returning early" here (i.e. the end of the batch) would be helpful. This is a heuristic that may not always be optimal. Finishing the batch helps us choose the right timestamp in the event that there are multiple conflicts and the client retry will end up writing to the same keys the next time around. If the client behavior changes on retries then laying down the intents on this attempt was wasted work. If running the rest of the batch is beneficial, then the same logic might argue for running the entire remainder of the transaction. Returning a WriteTooOldError at the end of the batch is simply choosing a local minimum in the cost/benefit curve, since the cost of continuing the transaction is much less when we're not requiring additional round trips. Comments from Reviewable |
@nvanbenschoten are there instructions for running TPC-C? Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful. pkg/kv/txn_coord_sender.go, line 464 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
We could do this for snapshot transactions, but I don't think we should. As you point out, we don't accumulate the refresh spans for SNAPSHOT. Not sure it makes sense to add any commentary considering the reason is so straightforward. pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
It can. Originally the name of this flag was pkg/roachpb/data.go, line 1145 at r2 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I added this addition for some unittest cases. It never fires in real life. But turns out that when I first added it, just using pkg/storage/replica.go, line 5238 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
You can get more than one pkg/storage/replica.go, line 5487 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Added to the comment. An important part of the story here is when this returning early situation is triggered by a conditional put (increment is not even really a factor in practice). On a conditional put, the chance that the retry will succeed is pretty slim, and so laying down intents is almost always going to be a bad idea. The reason cputs are likely to fail is that they're often either based on a read of the underlying value which will change on the forwarded retry timestamp, or they're expecting an empty value, which won't be true on the retry. Comments from Reviewable |
Introduce a new `NoRefreshSpans` field to the `EndTransactionRequest` arguments. This specifies that a serializable isolation transaction has encountered no refresh spans. On a 1PC commit, this can be used to avoid serializable restarts by re-executing the 1PC transaction at an appropriately higher timestamp in the event of the timestamp being forwarded by the timestamp cache or because of write-too-old errors. When evaluating a write batch, we now allow a local retry for write too old errors for non-transactional batches, and for serializable 1PC txns where `NoRefreshSpans` is true. Release note: None
662e706
to
c6f9b92
Compare
@spencerkimball yes, the load generator is here. You should just be able to run Review status: 1 of 7 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. Comments from Reviewable |
FWIW |
The results are not appreciably different for the |
Review status: 1 of 7 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful. pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):
Then don't we need to clear this flag (whatever it's called) when DistSender splits the batch? The sub-batches could create refresh spans but the EndTransaction sent on its own would still have NoRefreshSpans set. pkg/roachpb/api.proto, line 463 at r2 (raw file):
What happened to the part about leaking a commit timestamp to the SQL client? Don't we still need to do that? Comments from Reviewable |
Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. pkg/roachpb/api.proto, line 463 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
On a related note, I want to check that we're good on other txns that care about their timestamps particularly - the incremental backup roll-up txns. Are those unaffected by this series of changes? Comments from Reviewable |
Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. pkg/roachpb/api.proto, line 463 at r2 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Comments from Reviewable |
The txn start time was not always the txn commit time, as with SNAPSHOT txns. So Comments from Reviewable |
yeah I am not concerned about snapshot. We never pretended to do anything correct with it anyway :) Meanwhile, I think we can defer working on this until 2.1 work starts, assuming we clearly document what is going on. A user who wishes to preserve the pre-optimization semantics can tweak their SQL code to force the txn to become 2PC, in which case I believe the previous behavior holds (is that correct?) |
I must have missed context here: why does Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. Comments from Reviewable |
test monotonic inserts: There are variants of this that use multiple tables that check this is also true even when the txns do not conflict on writes (just on reads). |
That monotonic insert test as written above however does not use 1PC (because it needs to read the max) so it's not impacted. But I wonder if there are other possible equivalent tests that can use 1PC and that would be impacted. |
This particular PR is about 1PC, but the broader set of changes affects all txns. |
And just to be clear, a disclaimer: please take my comments above with a grain of salt, it's been a while since I last checked what these tests were actually testing. I recall we went through multiple iterations back then to properly define what tests were correct (i.e. the behavior they were testing was strictly required from the most general definition of serializable isolation). |
Ok, this is vastly different than saying |
Yes that is why above I said "only true if the txns have touching keys" |
It might be nice, but I wouldn't prioritize it. The current issue about commit timestamps changing applies to both one- and two-phase transactions. Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
I don't understand. If the batch is split, then the first parts of the batch could create refresh spans, invalidating the NoRefreshSpans flag on the EndTransactionRequest. (I'm talking about the splitting by type of pkg/roachpb/api.proto, line 463 at r2 (raw file):
But unless you're in snapshot isolation, if there's a change in the timestamp the transaction would restart and the SQL layer would see the updated OrigTimestamp. I'm not too concerned personally about most of the SQL timestamp functions (although I remember there was a lot of consideration given to what we should provide here, so maybe I'm forgetting something important). However, Comments from Reviewable |
Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Well the code in this particular PR only executes on the 1PC path, so the batch could not have been split. The change I'm working on now will allow replica-side retry optimization, and that change resets the pkg/roachpb/api.proto, line 463 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Looks like we're having to handle this case after all. The fixes are in an upcoming change. Comments from Reviewable |
Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):
Ah, that's what I was missing.
OK. That change is going to cause problems for mixed-version clusters, since nodes that only have this commit will be incorrectly sending NoRefreshSpans on EndTransactionRequests that have been split. An easy way to get around that would be to change the tag number for the NoRefreshSpans proto field. Comments from Reviewable |
pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I didn't realize we needed to handle mixed-version clusters on these unstable point releases. I don't think that should be a requirement, especially for a problem like this. Comments from Reviewable |
Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
We do rolling upgrades of our continuous and test clusters, and sometimes tell customers to do the same. We generally don't do this with critical data, but it's still good to get into the habit of compatibility between versions at any time. (In this case, the best solution would have been to clear the NoRefreshSpans on split in the original commit, but now that it's there and not cleared I think we should change the tag number to preserve compatibility) Comments from Reviewable |
ts := ba.Timestamp | ||
if br != nil { | ||
ts = br.Timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spencerkimball or @bdarnell, do you know what makes this safe in relation to lease transfers? Specifically, a batch goes through the following steps:
1. update the local node's clock with `ba.Timestamp` when passing through the Store
2. under lock, we check whether a lease transfer is in flight
2a. if so, wait for the lease to stabilize
3. if not, evaluate the batch without ever coordinating with possible lease transfers again
4. bump timestamp cache with batch's timestamp (this code)
From the lease transfers point of view:
1. lock
2. grab a local clock reading
3. start transfer with that time, which will be used as the new leaseholder's tscache lower bound
4. unlock
Since this timestamp is larger than the original ba.Timestamp
, isn't it possible that it's contribution to the timestamp cache is missed? Are we relying on some implicit assumption that this timestamp will never be bumped past clock.Now()
? Is that a fair assumption to make in all cases? Are we assuming that this timestamp will always be based on existing data on the Range, which in turn must be at a lower timestamp than the clock on the Store that houses it? If so, I don't see how we're confident about that conclussion, given that we only set transactional batches' timestamps to the txn's orig timestamp (where it reads) and we don't include a timestamp when resolving intents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we're implicitly relying on an assumption that any KV data that might bump the response timestamp will have been through some path that would have updated the HLC. I don't know whether that's always valid. It sounds like we might be missing some cases to pull timestamps out of the request (like in intent resolution).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safe because the br.Timestamp
is only ever forwarded to the timestamp of a key written to the current range. In other words, the local clock must already be at or past any value of key written to the range. You might want to augment the comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This thought experiment blossomed into the beautiful issue we see in #36431. Over there we note that even if we did update the leaseholder's HLC clock on intent resolution, we'd still run into issues with transaction uncertainty when intents are moved forward.
We'll need to address that issue by enforcing clock updates during txn refreshes. In addition, we should do more to ensure that clock updates are never missed, even when BatchRequests
don't explicitly set a header timestamp. Trying to address this on a case-by-case basis (e.g. #35297 and #36307) seems like a fraught effort. There has been some discussion about adding a gRPC interceptor to deal with clock updates at a lower-level than in the BatchRequest
domain.
This new assertion would fire if bug like the one hypothesized in the following comment exist: cockroachdb#22315 (comment) Release note: None
This new assertion would fire if bug like the one hypothesized in the following comment exist: cockroachdb#22315 (comment) Release note: None
35297: storage: ensure that PushTxn request's timestamp cache updates are safe r=nvanbenschoten a=nvanbenschoten This PR's overall goal is to ensure that the timestamp cache updates performed by `PushTxn` requests are always safe. It does this in a series of steps: 1. it makes its `PushTo` argument inclusive, making it easier to use and easier to assert against (e.g. `req.PushTo <= req.Timestamp`). 2. it removes the `Now` argument and begins using the batch's `Timestamp` field instead. This field is used to update the receiving store's clock _before_ the request is evaluated and before the lease transfer check is performed. 3. it adds an assertion that all timestamp cache updates are safe given their local clock. This will also catch hypothesized bugs like #22315 (comment). I'm planning on getting this in to 19.1. Co-authored-by: Nathan VanBenschoten <[email protected]>
This fixes a bug introduced in cockroachdb#22315 with the introduction of a "local retry" to avoid WriteTooOldError inside of 1PC transactions that were not carrying out any reads. Such a transaction can simply be re-evaluated at a higher timestamp, which can allow it to commit. The bug was that when such a re-evaluation was carried out, we were not discarding the MVCCStats accrued from the first attempt. In effect, the request would thus be double-counted in stats, which would set off the consistency checker. The fix is easy: discard the delta before re-evaluating. Fixes cockroachdb#31870. Release note (bug fix): remove a source of (benign) stats inconsistencies (i.e. the stats of a range not accurately reflecting its contents).
41983: engine: don't create aux dir for read-only engine r=petermattis a=tbg It was mildly irritating that ./cockroach debug raft-log not/there 1 would create `not/there/auxiliary`. Release note: None 41986: storage: fix stats inconsistency due to WriteTooOld optimization r=irfansharif a=tbg This fixes a bug introduced in #22315 with the introduction of a "local retry" to avoid WriteTooOldError inside of 1PC transactions that were not carrying out any reads. Such a transaction can simply be re-evaluated at a higher timestamp, which can allow it to commit. The bug was that when such a re-evaluation was carried out, we were not discarding the MVCCStats accrued from the first attempt. In effect, the request would thus be double-counted in stats, which would set off the consistency checker. The fix is easy: discard the delta before re-evaluating. Fixes #31870. Release note (bug fix): remove a source of (benign) stats inconsistencies (i.e. the stats of a range not accurately reflecting its contents). Co-authored-by: Tobias Schottdorf <[email protected]>
This fixes a bug introduced in cockroachdb#22315 with the introduction of a "local retry" to avoid WriteTooOldError inside of 1PC transactions that were not carrying out any reads. Such a transaction can simply be re-evaluated at a higher timestamp, which can allow it to commit. The bug was that when such a re-evaluation was carried out, we were not discarding the MVCCStats accrued from the first attempt. In effect, the request would thus be double-counted in stats, which would set off the consistency checker. The fix is easy: discard the delta before re-evaluating. Fixes cockroachdb#31870. Release note (bug fix): remove a source of (benign) stats inconsistencies (i.e. the stats of a range not accurately reflecting its contents).
Introduce a new
NoRefreshSpans
field to theEndTransactionRequest
arguments. This specifies that a serializable isolation transaction has
encountered no refresh spans. On a 1PC commit, this can be used to
avoid serializable restarts by re-executing the 1PC transaction at an
appropriately higher timestamp in the event of the timestamp being
forwarded by the timestamp cache or because of write-too-old errors.
When evaluating a write batch, we now allow a local retry for write
too old errors for non-transactional batches, and for serializable
1PC txns where
NoRefreshSpans
is true.Release note: None