kv: closed timestamp can starve read-write txns that take longer than 600ms #51294

nvanbenschoten · 2020-07-10T14:54:07Z

Now that we've dropped kv.closed_timestamp.target_duration down to 3s, we've seen an increase in read-write mutations that get starved by the closed timestamp. The earlier understanding was that these transactions would only be bumped by the closed timestamp if they took over 3s to commit, and would then be given 3s to refresh and complete their final batch if not. This first part is true, but the second part was a misunderstanding.

If a transaction takes more than 3s it will be bumped up to the current closed timestamp's value. It will then hit a retry error on its EndTxn batch. It will refresh at the old closed timestamp value and then try to commit again. If this commit is again bumped by the closed timestamp, we can enter a refresh loop that will terminate after 5 attempts and then be kicked back to the client. Since we're only ever refreshing up to the current closed timestamp, the transaction actually only has kv.closed_timestamp.target_duration * kv.closed_timestamp.close_fraction = 600ms to refresh and commit.

The easiest way to reproduce this is to perform a large DELETE statement in an implicit txn. If the DELETE takes long enough, it can get stuck in a starvation loop. This looks something like:

1. read all rows: Batch{Scan, Scan, ...}
2. try to delete all rows and commit: Batch{Delete, Delete, ..., EndTxn}
3. first two steps take over 3s so batch is pushed to `now - 3s` and hits a retry error when evaluating its EndTxn
4. refresh @ now - 3s
5. retry Batch{Delete, Delete, ..., EndTxn}
6. steps 4 and 5 took over 600ms so batch is pushed to `now - 3s`
7. retry for 5 times until giving up and either returning error to client or retrying in conn_executor

This starvation is clearly undesirable, but it doesn't seem fundamental. There seems to be a few different things we could do here to improve the situation:

if the closed timestamp bumped victims up to the present time instead of its current value, requests would be given more time to refresh and complete: 3s instead of 600ms. This would avoid a large class of these issues, in exchange for increasing the chance that refreshes fail under contention and increasing the contention footprint of transactions that get bumped by the closed timestamp.
we noticed in storage: adaptive closed timestamp target duration #36478 that we don't need to respect the closed timestamp when re-writing intents. This means that we could avoid getting bumped by the closed timestamp in step 5 when retrying the batch of DELETEs. OR, we could revive an optimization we had previously removed and avoid issuing that prefix of the batch entirely after the refresh since it had already succeeded. We probably don't want to do this second part because it gets confusing and may not be as useful in the context of parallel commits.
we also noticed in storage: adaptive closed timestamp target duration #36478 that EndTxn requests don't need to respect the closed timestamp. This means that we could ignore the closed timestamp entirely when issuing an EndTxn.

So in an optimal world, step 5 would be able to ignore the closed timestamp entirely as long as its intents were previously written in step 2.

The text was updated successfully, but these errors were encountered:

aayushshah15 · 2020-07-29T15:19:33Z

While stressing kvnemesis with merges enabled for #50265 (though I can reproduce this failure mode with and without that PR), I ran into timeouts that seem to be caused by roughly what this issue describes.

cockroach/pkg/kv/kvnemesis/kvnemesis_test.go

Lines 71 to 74 in 053ec8b

    
           // kvnemesis found a rare bug with closed timestamps when merges happen 
        
           // on a multinode cluster. Disable the combo for now to keep the test 
        
           // from flaking. See #44878. 
        
           config.Ops.Merge = MergeConfig{}

I was using a relatively low closed timestamp duration (500ms) because that was more likely to lead to scenarios like the one that PR fixes. I noticed that txns that scanned a lot of keys and then wrote something would refresh/retry a lot (what this issue talks about). However, the logs also indicate that a ton of time was being spent pushing conflicting transactions.

and also that the sort of transactions that can end up in this endless retry loop seem to also have long queues of pusher txns waiting on them. I saw over 1000 pending txns in my logs waiting on one such endlessly-retrying txn (highest epoch I saw for it was ~50).

Maybe I’m missing something here, but I don’t see anything preventing this sort of transitive starvation (of the waiting pushers). The select case in MaybeWaitForPush queries for the pushee’s state every 5 seconds and just loops back if the pushee hasn’t been committed or aborted. It doesn’t care whether the pushee has already had its timestamp already bumped somehow.

When I don’t use a low closed timestamp, I can’t reproduce this failure mode.

Edit: Here is a log file from one such run in case someone is interested:
lot_of_pushers_in_wait_queue.zip
(apologies for the extremely verbose logs, the pushee that a ton of pushers were waiting on has txn id: c4a330aa)

andreimatei · 2020-07-29T15:37:34Z

Maybe I’m missing something here, but I don’t see anything preventing this sort of transitive starvation (of the waiting pushers). The select case in MaybeWaitForPush queries for the pushee’s state every 5 seconds and just loops back if the pushee hasn’t been committed or aborted. It doesn’t care whether the pushee has already had its timestamp already bumped somehow.

This is tangential to the issue, but I would have guessed that readers at lower timestamps get unblocked when the pushee bumps its timestamp. Is that not the case?

aayushshah15 · 2020-07-29T15:43:37Z

It's not happening in that select loop in MaybeWaitForPush but I don't know if it is supposed to happen elsewhere in the code.

andreimatei · 2020-08-07T20:16:31Z

@nvanbenschoten what's the story with -^ ?
As you've recently showed me, the lock table unblocks lower readers when it finds out that a conflicting txn's timestamp has been pushed. But what about the txnWaitQueue? If a pushee's timestamps changes, is the txn at the head of the txnWaitQueue supposed to react to that? And is the info supposed to make it through the txnWaitQueue to all the lock tables everywhere?

nvanbenschoten · 2020-08-07T20:34:03Z

If the pushee's timestamp changes due to a push then yes, all txn's in the txnWaitQueue will react to it. The queue reacts to calls to OnTransactionUpdated, which triggers a broadcast to all pusher's here.

However, if the pushee gets bumped due to the closed timestamp or a timestamp cache while issuing one of its requests then no one in the txnWaitQueue will hear about this. We could potentially feed this information to the transaction record during txn heartbeats, but that still might be up to a second later.

And is the info supposed to make it through the txnWaitQueue to all the lock tables everywhere?

Yes, anyone waiting the the txnWaitQueue is also waiting at the head of a lock wait-queue in a lockTable. See this diagram. So if their push succeeds, they'll update the corresponding lock.

nvanbenschoten · 2020-08-10T03:20:27Z

We need to do something here for v20.2 and maybe consider backporting something to v20.1. I suggest the following:

Don't bump EndTxn requests up to the closed timestamp (option 3 from above). It turns out that this is already the case, so there's nothing to actually change here. In applyTimestampCache, we only bump the batch by the closed timestamp if it contains a request with the consultsTSCache flag. Since the introduction of CanCreateTxnRecord, EndTxn has no longer carried this flag.
Split EndTxn from the rest of its batch after 1 (or 2?) successful refreshes in the txnSpanRefresher. The idea here is that after a successful refresh, we want to make sure we make some forward progress with a batch of writes and an EndTxn. What we see in this issue is that because the EndTxn throws an error due to the closed timestamp bump (on the writes earlier in the batch), it prevents the writes earlier in the batch from ever succeeding. This means that they continue to be included in the batch and continue to hit the closed timestamp after each refresh. So the proposal is to issue the writes separately and let them succeeding before issuing the EndTxn. There is precedent for this – we do something similar, for similar but not identical reasons in the DistSender after a transaction has restarted at least once.

With these in place, the timeline of the offending query would look something like the following and we would avoid the indefinite starvation regardless of the closed timestamp duration:

1. read all rows: Batch{Scan, Scan, ...}
2. try to delete all rows and commit: Batch{Delete, Delete, ..., EndTxn}
3. first two steps take over 3s so batch is pushed to `now - 3s` and hits a retry error when evaluating its EndTxn
4. refresh @ now - 3s
5. retry Batch{Delete, Delete, ...}. Gets pushed but succeeds at higher timestamp
6. retry EndTxn, throws error during evaluation
7. refresh @ now - 3s
8. retry EndTxn, succeeds

Option 1 from above is also something we can consider, although we'd need to weigh its trade-offs.

Relates to cockroachdb#51294. This commit adds logic to the txnSpanRefresher to preemptively perform refreshes before issuing requests when doing is guaranteed to be beneficial. We perform a preemptive refresh if either a) doing so would be free because we have not yet accumulated any refresh spans, or b) the batch contains a committing EndTxn request that we know will be rejected if issued. The first case is straightforward. If the transaction has yet to perform any reads but has had its write timestamp bumped, refreshing is a trivial no-op. In this case, refreshing eagerly prevents the transaction for performing any future reads at its current read timestamp. Not doing so preemptively guarantees that we will need to perform a real refresh in the future if the transaction ever performs a read. At best, this would be wasted work. At worst, this could result in the future refresh failing. So we might as well refresh preemptively while doing so is free. Note that this first case here does NOT obviate the need for server-side refreshes. Notably, a transaction's write timestamp might be bumped in the same batch in which it performs its first read. In such cases, a preemptive refresh would not be needed but a reactive refresh would not be a trivial no-op. These situations are common for one-phase commit transactions. The second case is more complex. If the batch contains a committing EndTxn request that we know will need a refresh, we don't want to bother issuing it just for it to be rejected. Instead, preemptively refresh before issuing the EndTxn batch. If we view reads as acquiring a form of optimistic read locks under an optimistic concurrency control scheme (as is discussed in the comment on txnSpanRefresher) then this preemptive refresh immediately before the EndTxn is synonymous with the "validation" phase of a standard OCC transaction model. However, as an optimization compared to standard OCC, the validation phase is only performed when necessary in CockroachDB (i.e. if the transaction's writes have been pushed to higher timestamps). This second case will play into the solution for cockroachdb#51294. Now that we perform these preemptive refreshes when possible, we know that it is always to right choice to split off the EndTxn from the rest of the batch during a txnSpanRefresher auto-retry. Without this change, it was unclear whether the first refresh of an EndTxn batch was caused by earlier requests in the transaction or by other requests in the current batch. Now, it is always caused by other requests in the same batch, so it is always clear that we should split the EndTxn from the rest of the batch immediately after the first refresh. Release notes (performance improvement): validation of optimistic reads is now performed earlier in transactions when doing so can save work. This eliminates certain types of transaction retry errors and avoids wasted RPC traffic.

Fixes cockroachdb#51294. First two commits from cockroachdb#52884. This commit updates the txnSpanRefresher to split off EndTxn requests into their own partial batches on auto-retries after successful refreshes as a means of preventing starvation. This avoids starvation in two ways. First, it helps ensure that we lay down intents if any of the other requests in the batch are writes. Second, it ensures that if any writes are getting pushed due to contention with reads or due to the closed timestamp, they will still succeed and allow the batch to make forward progress. Without this, each retry attempt may get pushed because of writes in the batch and then rejected wholesale when the EndTxn tries to evaluate the pushed batch. When split, the writes will be pushed but succeed, the transaction will be refreshed, and the EndTxn will succeed. I still need to confirm that this fixes this indefinite stall [here](https://github.com/cockroachlabs/misc_projects_glenn/tree/master/rw_blockage#implicit-query-hangs--explict-query-works), but I suspect that it will. Release note (bug fix): A change in v20.1 caused a certain class of bulk UPDATEs and DELETE statements to hang indefinitely if run in an implicit transaction. We now break up these statements to avoid starvation and prevent them from hanging indefinitely.

Relates to cockroachdb#51294. This commit adds logic to the txnSpanRefresher to preemptively perform refreshes before issuing requests when doing is guaranteed to be beneficial. We perform a preemptive refresh if either a) doing so would be free because we have not yet accumulated any refresh spans, or b) the batch contains a committing EndTxn request that we know will be rejected if issued. The first case is straightforward. If the transaction has yet to perform any reads but has had its write timestamp bumped, refreshing is a trivial no-op. In this case, refreshing eagerly prevents the transaction for performing any future reads at its current read timestamp. Not doing so preemptively guarantees that we will need to perform a real refresh in the future if the transaction ever performs a read. At best, this would be wasted work. At worst, this could result in the future refresh failing. So we might as well refresh preemptively while doing so is free. Note that this first case here does NOT obviate the need for server-side refreshes. Notably, a transaction's write timestamp might be bumped in the same batch in which it performs its first read. In such cases, a preemptive refresh would not be needed but a reactive refresh would not be a trivial no-op. These situations are common for one-phase commit transactions. The second case is more complex. If the batch contains a committing EndTxn request that we know will need a refresh, we don't want to bother issuing it just for it to be rejected. Instead, preemptively refresh before issuing the EndTxn batch. If we view reads as acquiring a form of optimistic read locks under an optimistic concurrency control scheme (as is discussed in the comment on txnSpanRefresher) then this preemptive refresh immediately before the EndTxn is synonymous with the "validation" phase of a standard OCC transaction model. However, as an optimization compared to standard OCC, the validation phase is only performed when necessary in CockroachDB (i.e. if the transaction's writes have been pushed to higher timestamps). This second case will play into the solution for cockroachdb#51294. Now that we perform these preemptive refreshes when possible, we know that it is always to right choice to split off the EndTxn from the rest of the batch during a txnSpanRefresher auto-retry. Without this change, it was unclear whether the first refresh of an EndTxn batch was caused by earlier requests in the transaction or by other requests in the current batch. Now, it is always caused by other requests in the same batch, so it is always clear that we should split the EndTxn from the rest of the batch immediately after the first refresh. Release notes (performance improvement): validation of optimistic reads is now performed earlier in transactions when doing so can save work. This eliminates certain types of transaction retry errors and avoids wasted RPC traffic.

52881: sql: disallow non-admin users from dropping admins r=solongordon a=solongordon Fixes #52582 Release note(sql change): Non-admin users with the CREATEROLE are no longer permitted to drop users with the admin role. 52884: kv: preemptively refresh transaction timestamps r=nvanbenschoten a=nvanbenschoten Relates to #51294. This commit adds logic to the txnSpanRefresher to preemptively perform refreshes before issuing requests when doing is guaranteed to be beneficial. We perform a preemptive refresh if either a) doing so would be free because we have not yet accumulated any refresh spans, or b) the batch contains a committing EndTxn request that we know will be rejected if issued. The first case is straightforward. If the transaction has yet to perform any reads but has had its write timestamp bumped, refreshing is a trivial no-op. In this case, refreshing eagerly prevents the transaction for performing any future reads at its current read timestamp. Not doing so preemptively guarantees that we will need to perform a real refresh in the future if the transaction ever performs a read. At best, this would be wasted work. At worst, this could result in the future refresh failing. So we might as well refresh preemptively while doing so is free. Note that this first case here does NOT obviate the need for server-side refreshes. Notably, a transaction's write timestamp might be bumped in the same batch in which it performs its first read. In such cases, a preemptive refresh would not be needed but a reactive refresh would not be a trivial no-op. These situations are common for one-phase commit transactions. The second case is more complex. If the batch contains a committing EndTxn request that we know will need a refresh, we don't want to bother issuing it just for it to be rejected. Instead, preemptively refresh before issuing the EndTxn batch. If we view reads as acquiring a form of optimistic read locks under an optimistic concurrency control scheme (as is discussed in the comment on txnSpanRefresher) then this preemptive refresh immediately before the EndTxn is synonymous with the "validation" phase of a standard OCC transaction model. However, as an optimization compared to standard OCC, the validation phase is only performed when necessary in CockroachDB (i.e. if the transaction's writes have been pushed to higher timestamps). This second case will play into the solution for #51294. Now that we perform these preemptive refreshes when possible, we know that it is always to right choice to split off the EndTxn from the rest of the batch during a txnSpanRefresher auto-retry. Without this change, it was unclear whether the first refresh of an EndTxn batch was caused by earlier requests in the transaction or by other requests in the current batch. Now, it is always caused by other requests in the same batch, so it is always clear that we should split the EndTxn from the rest of the batch immediately after the first refresh. Release notes (performance improvement): validation of optimistic reads is now performed earlier in transactions when doing so can save work. This eliminates certain types of transaction retry errors. 53074: cli: add aliases for userfile commands and switch ls to list r=miretskiy a=adityamaru Added: - `rm` for `userfile delete` - `ls` for `userfile list` Release note (cli change): Adds alias commands `ls` and `rm` for `userfile list` and `userfile delete`. 53086: sql: add privilege validation for different object types r=rohany a=RichardJCai Add logic in Validate to ensure only certain privileges can be granted on certain descriptors. Example: USAGE is invalid on tables, SELECT is invalid on types. Release note: None Co-authored-by: Solon Gordon <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: richardjcai <[email protected]>

52885: kv: split EndTxn into sub-batch on auto-retry after successful refresh r=nvanbenschoten a=nvanbenschoten Fixes #51294. This commit updates the txnSpanRefresher to split off EndTxn requests into their own partial batches on auto-retries after successful refreshes as a means of preventing starvation. This avoids starvation in two ways. First, it helps ensure that we lay down intents if any of the other requests in the batch are writes. Second, it ensures that if any writes are getting pushed due to contention with reads or due to the closed timestamp, they will still succeed and allow the batch to make forward progress. Without this, each retry attempt may get pushed because of writes in the batch and then rejected wholesale when the EndTxn tries to evaluate the pushed batch. When split, the writes will be pushed but succeed, the transaction will be refreshed, and the EndTxn will succeed. I still need to confirm that this fixes this indefinite stall [here](https://github.com/cockroachlabs/misc_projects_glenn/tree/master/rw_blockage#implicit-query-hangs--explict-query-works), but I suspect that it will. Release note (bug fix): A change in v20.1 caused a certain class of bulk UPDATEs and DELETE statements to hang indefinitely if run in an implicit transaction. We now break up these statements to avoid starvation and prevent them from hanging indefinitely. Co-authored-by: Nathan VanBenschoten <[email protected]>

Fixes cockroachdb#51294. First two commits from cockroachdb#52884. This commit updates the txnSpanRefresher to split off EndTxn requests into their own partial batches on auto-retries after successful refreshes as a means of preventing starvation. This avoids starvation in two ways. First, it helps ensure that we lay down intents if any of the other requests in the batch are writes. Second, it ensures that if any writes are getting pushed due to contention with reads or due to the closed timestamp, they will still succeed and allow the batch to make forward progress. Without this, each retry attempt may get pushed because of writes in the batch and then rejected wholesale when the EndTxn tries to evaluate the pushed batch. When split, the writes will be pushed but succeed, the transaction will be refreshed, and the EndTxn will succeed. I still need to confirm that this fixes this indefinite stall [here](https://github.com/cockroachlabs/misc_projects_glenn/tree/master/rw_blockage#implicit-query-hangs--explict-query-works), but I suspect that it will. Release note (bug fix): A change in v20.1 caused a certain class of bulk UPDATEs and DELETE statements to hang indefinitely if run in an implicit transaction. We now break up these statements to avoid starvation and prevent them from hanging indefinitely.

aayushshah15 mentioned this issue Jul 29, 2020

kvserver: prevent follower reads while a range is subsumed #50265

Merged

ajwerner mentioned this issue Aug 13, 2020

kv: pessimistic-mode, replicated read locks to enable large, long running transactions #52768

Closed

lunevalex mentioned this issue Aug 13, 2020

storage: adaptive closed timestamp target duration #36478

Closed

nvanbenschoten mentioned this issue Aug 16, 2020

kv: preemptively refresh transaction timestamps #52884

Merged

nvanbenschoten self-assigned this Aug 17, 2020

nvanbenschoten mentioned this issue Aug 17, 2020

kv: split EndTxn into sub-batch on auto-retry after successful refresh #52885

Merged

craig bot closed this as completed in cc233c5 Aug 21, 2020

nvanbenschoten mentioned this issue Aug 27, 2020

release-20.1: kv: split EndTxn into sub-batch on auto-retry after successful refresh #53561

Merged

nvanbenschoten mentioned this issue Aug 17, 2021

sql/tests: TestMonotonicInserts/distsql=on failed #67802

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: closed timestamp can starve read-write txns that take longer than 600ms #51294

kv: closed timestamp can starve read-write txns that take longer than 600ms #51294

nvanbenschoten commented Jul 10, 2020

aayushshah15 commented Jul 29, 2020 •

edited

Loading

andreimatei commented Jul 29, 2020

aayushshah15 commented Jul 29, 2020

andreimatei commented Aug 7, 2020

nvanbenschoten commented Aug 7, 2020

nvanbenschoten commented Aug 10, 2020

kv: closed timestamp can starve read-write txns that take longer than 600ms #51294

kv: closed timestamp can starve read-write txns that take longer than 600ms #51294

Comments

nvanbenschoten commented Jul 10, 2020

aayushshah15 commented Jul 29, 2020 • edited Loading

andreimatei commented Jul 29, 2020

aayushshah15 commented Jul 29, 2020

andreimatei commented Aug 7, 2020

nvanbenschoten commented Aug 7, 2020

nvanbenschoten commented Aug 10, 2020

aayushshah15 commented Jul 29, 2020 •

edited

Loading