Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: closed timestamp can starve read-write txns that take longer than 600ms #51294

Closed
nvanbenschoten opened this issue Jul 10, 2020 · 6 comments · Fixed by #52885
Closed

kv: closed timestamp can starve read-write txns that take longer than 600ms #51294

nvanbenschoten opened this issue Jul 10, 2020 · 6 comments · Fixed by #52885
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.

Comments

@nvanbenschoten
Copy link
Member

Now that we've dropped kv.closed_timestamp.target_duration down to 3s, we've seen an increase in read-write mutations that get starved by the closed timestamp. The earlier understanding was that these transactions would only be bumped by the closed timestamp if they took over 3s to commit, and would then be given 3s to refresh and complete their final batch if not. This first part is true, but the second part was a misunderstanding.

If a transaction takes more than 3s it will be bumped up to the current closed timestamp's value. It will then hit a retry error on its EndTxn batch. It will refresh at the old closed timestamp value and then try to commit again. If this commit is again bumped by the closed timestamp, we can enter a refresh loop that will terminate after 5 attempts and then be kicked back to the client. Since we're only ever refreshing up to the current closed timestamp, the transaction actually only has kv.closed_timestamp.target_duration * kv.closed_timestamp.close_fraction = 600ms to refresh and commit.

The easiest way to reproduce this is to perform a large DELETE statement in an implicit txn. If the DELETE takes long enough, it can get stuck in a starvation loop. This looks something like:

1. read all rows: Batch{Scan, Scan, ...}
2. try to delete all rows and commit: Batch{Delete, Delete, ..., EndTxn}
3. first two steps take over 3s so batch is pushed to `now - 3s` and hits a retry error when evaluating its EndTxn
4. refresh @ now - 3s
5. retry Batch{Delete, Delete, ..., EndTxn}
6. steps 4 and 5 took over 600ms so batch is pushed to `now - 3s`
7. retry for 5 times until giving up and either returning error to client or retrying in conn_executor 

This starvation is clearly undesirable, but it doesn't seem fundamental. There seems to be a few different things we could do here to improve the situation:

  1. if the closed timestamp bumped victims up to the present time instead of its current value, requests would be given more time to refresh and complete: 3s instead of 600ms. This would avoid a large class of these issues, in exchange for increasing the chance that refreshes fail under contention and increasing the contention footprint of transactions that get bumped by the closed timestamp.
  2. we noticed in storage: adaptive closed timestamp target duration #36478 that we don't need to respect the closed timestamp when re-writing intents. This means that we could avoid getting bumped by the closed timestamp in step 5 when retrying the batch of DELETEs. OR, we could revive an optimization we had previously removed and avoid issuing that prefix of the batch entirely after the refresh since it had already succeeded. We probably don't want to do this second part because it gets confusing and may not be as useful in the context of parallel commits.
  3. we also noticed in storage: adaptive closed timestamp target duration #36478 that EndTxn requests don't need to respect the closed timestamp. This means that we could ignore the closed timestamp entirely when issuing an EndTxn.

So in an optimal world, step 5 would be able to ignore the closed timestamp entirely as long as its intents were previously written in step 2.

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. A-kv-transactions Relating to MVCC and the transactional model. labels Jul 10, 2020
@aayushshah15
Copy link
Contributor

aayushshah15 commented Jul 29, 2020

While stressing kvnemesis with merges enabled for #50265 (though I can reproduce this failure mode with and without that PR), I ran into timeouts that seem to be caused by roughly what this issue describes.

// kvnemesis found a rare bug with closed timestamps when merges happen
// on a multinode cluster. Disable the combo for now to keep the test
// from flaking. See #44878.
config.Ops.Merge = MergeConfig{}

I was using a relatively low closed timestamp duration (500ms) because that was more likely to lead to scenarios like the one that PR fixes. I noticed that txns that scanned a lot of keys and then wrote something would refresh/retry a lot (what this issue talks about). However, the logs also indicate that a ton of time was being spent pushing conflicting transactions.
Screen Shot 2020-07-29 at 4 00 26 AM

and also that the sort of transactions that can end up in this endless retry loop seem to also have long queues of pusher txns waiting on them. I saw over 1000 pending txns in my logs waiting on one such endlessly-retrying txn (highest epoch I saw for it was ~50).

Maybe I’m missing something here, but I don’t see anything preventing this sort of transitive starvation (of the waiting pushers). The select case in MaybeWaitForPush queries for the pushee’s state every 5 seconds and just loops back if the pushee hasn’t been committed or aborted. It doesn’t care whether the pushee has already had its timestamp already bumped somehow.

When I don’t use a low closed timestamp, I can’t reproduce this failure mode.

Edit: Here is a log file from one such run in case someone is interested:
lot_of_pushers_in_wait_queue.zip
(apologies for the extremely verbose logs, the pushee that a ton of pushers were waiting on has txn id: c4a330aa)

@andreimatei
Copy link
Contributor

Maybe I’m missing something here, but I don’t see anything preventing this sort of transitive starvation (of the waiting pushers). The select case in MaybeWaitForPush queries for the pushee’s state every 5 seconds and just loops back if the pushee hasn’t been committed or aborted. It doesn’t care whether the pushee has already had its timestamp already bumped somehow.

This is tangential to the issue, but I would have guessed that readers at lower timestamps get unblocked when the pushee bumps its timestamp. Is that not the case?

@aayushshah15
Copy link
Contributor

It's not happening in that select loop in MaybeWaitForPush but I don't know if it is supposed to happen elsewhere in the code.

@andreimatei
Copy link
Contributor

@nvanbenschoten what's the story with -^ ?
As you've recently showed me, the lock table unblocks lower readers when it finds out that a conflicting txn's timestamp has been pushed. But what about the txnWaitQueue? If a pushee's timestamps changes, is the txn at the head of the txnWaitQueue supposed to react to that? And is the info supposed to make it through the txnWaitQueue to all the lock tables everywhere?

@nvanbenschoten
Copy link
Member Author

If the pushee's timestamp changes due to a push then yes, all txn's in the txnWaitQueue will react to it. The queue reacts to calls to OnTransactionUpdated, which triggers a broadcast to all pusher's here.

However, if the pushee gets bumped due to the closed timestamp or a timestamp cache while issuing one of its requests then no one in the txnWaitQueue will hear about this. We could potentially feed this information to the transaction record during txn heartbeats, but that still might be up to a second later.

And is the info supposed to make it through the txnWaitQueue to all the lock tables everywhere?

Yes, anyone waiting the the txnWaitQueue is also waiting at the head of a lock wait-queue in a lockTable. See this diagram. So if their push succeeds, they'll update the corresponding lock.

@nvanbenschoten
Copy link
Member Author

We need to do something here for v20.2 and maybe consider backporting something to v20.1. I suggest the following:

  1. Don't bump EndTxn requests up to the closed timestamp (option 3 from above). It turns out that this is already the case, so there's nothing to actually change here. In applyTimestampCache, we only bump the batch by the closed timestamp if it contains a request with the consultsTSCache flag. Since the introduction of CanCreateTxnRecord, EndTxn has no longer carried this flag.
  2. Split EndTxn from the rest of its batch after 1 (or 2?) successful refreshes in the txnSpanRefresher. The idea here is that after a successful refresh, we want to make sure we make some forward progress with a batch of writes and an EndTxn. What we see in this issue is that because the EndTxn throws an error due to the closed timestamp bump (on the writes earlier in the batch), it prevents the writes earlier in the batch from ever succeeding. This means that they continue to be included in the batch and continue to hit the closed timestamp after each refresh. So the proposal is to issue the writes separately and let them succeeding before issuing the EndTxn. There is precedent for this – we do something similar, for similar but not identical reasons in the DistSender after a transaction has restarted at least once.

With these in place, the timeline of the offending query would look something like the following and we would avoid the indefinite starvation regardless of the closed timestamp duration:

1. read all rows: Batch{Scan, Scan, ...}
2. try to delete all rows and commit: Batch{Delete, Delete, ..., EndTxn}
3. first two steps take over 3s so batch is pushed to `now - 3s` and hits a retry error when evaluating its EndTxn
4. refresh @ now - 3s
5. retry Batch{Delete, Delete, ...}. Gets pushed but succeeds at higher timestamp
6. retry EndTxn, throws error during evaluation
7. refresh @ now - 3s
8. retry EndTxn, succeeds

Option 1 from above is also something we can consider, although we'd need to weigh its trade-offs.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 16, 2020
Relates to cockroachdb#51294.

This commit adds logic to the txnSpanRefresher to preemptively perform
refreshes before issuing requests when doing is guaranteed to be
beneficial. We perform a preemptive refresh if either a) doing so would
be free because we have not yet accumulated any refresh spans, or b) the
batch contains a committing EndTxn request that we know will be rejected
if issued.

The first case is straightforward. If the transaction has yet to perform
any reads but has had its write timestamp bumped, refreshing is a
trivial no-op. In this case, refreshing eagerly prevents the transaction
for performing any future reads at its current read timestamp. Not doing
so preemptively guarantees that we will need to perform a real refresh
in the future if the transaction ever performs a read. At best, this
would be wasted work. At worst, this could result in the future refresh
failing. So we might as well refresh preemptively while doing so is
free.

Note that this first case here does NOT obviate the need for server-side
refreshes. Notably, a transaction's write timestamp might be bumped in
the same batch in which it performs its first read. In such cases, a
preemptive refresh would not be needed but a reactive refresh would not
be a trivial no-op. These situations are common for one-phase commit
transactions.

The second case is more complex. If the batch contains a committing
EndTxn request that we know will need a refresh, we don't want to bother
issuing it just for it to be rejected. Instead, preemptively refresh
before issuing the EndTxn batch. If we view reads as acquiring a form of
optimistic read locks under an optimistic concurrency control scheme (as
is discussed in the comment on txnSpanRefresher) then this preemptive
refresh immediately before the EndTxn is synonymous with the
"validation" phase of a standard OCC transaction model. However, as an
optimization compared to standard OCC, the validation phase is only
performed when necessary in CockroachDB (i.e. if the transaction's
writes have been pushed to higher timestamps).

This second case will play into the solution for cockroachdb#51294. Now that we
perform these preemptive refreshes when possible, we know that it is
always to right choice to split off the EndTxn from the rest of the
batch during a txnSpanRefresher auto-retry. Without this change, it was
unclear whether the first refresh of an EndTxn batch was caused by
earlier requests in the transaction or by other requests in the current
batch. Now, it is always caused by other requests in the same batch, so
it is always clear that we should split the EndTxn from the rest of the
batch immediately after the first refresh.

Release notes (performance improvement): validation of optimistic reads
is now performed earlier in transactions when doing so can save work.
This eliminates certain types of transaction retry errors and avoids
wasted RPC traffic.
@nvanbenschoten nvanbenschoten self-assigned this Aug 17, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 17, 2020
Fixes cockroachdb#51294.
First two commits from cockroachdb#52884.

This commit updates the txnSpanRefresher to split off EndTxn requests
into their own partial batches on auto-retries after successful
refreshes as a means of preventing starvation. This avoids starvation in
two ways. First, it helps ensure that we lay down intents if any of the
other requests in the batch are writes. Second, it ensures that if any
writes are getting pushed due to contention with reads or due to the
closed timestamp, they will still succeed and allow the batch to make
forward progress. Without this, each retry attempt may get pushed
because of writes in the batch and then rejected wholesale when the
EndTxn tries to evaluate the pushed batch. When split, the writes will
be pushed but succeed, the transaction will be refreshed, and the EndTxn
will succeed.

I still need to confirm that this fixes this indefinite stall
[here](https://github.com/cockroachlabs/misc_projects_glenn/tree/master/rw_blockage#implicit-query-hangs--explict-query-works),
but I suspect that it will.

Release note (bug fix): A change in v20.1 caused a certain class of bulk
UPDATEs and DELETE statements to hang indefinitely if run in an implicit
transaction. We now break up these statements to avoid starvation and
prevent them from hanging indefinitely.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 21, 2020
Relates to cockroachdb#51294.

This commit adds logic to the txnSpanRefresher to preemptively perform
refreshes before issuing requests when doing is guaranteed to be
beneficial. We perform a preemptive refresh if either a) doing so would
be free because we have not yet accumulated any refresh spans, or b) the
batch contains a committing EndTxn request that we know will be rejected
if issued.

The first case is straightforward. If the transaction has yet to perform
any reads but has had its write timestamp bumped, refreshing is a
trivial no-op. In this case, refreshing eagerly prevents the transaction
for performing any future reads at its current read timestamp. Not doing
so preemptively guarantees that we will need to perform a real refresh
in the future if the transaction ever performs a read. At best, this
would be wasted work. At worst, this could result in the future refresh
failing. So we might as well refresh preemptively while doing so is
free.

Note that this first case here does NOT obviate the need for server-side
refreshes. Notably, a transaction's write timestamp might be bumped in
the same batch in which it performs its first read. In such cases, a
preemptive refresh would not be needed but a reactive refresh would not
be a trivial no-op. These situations are common for one-phase commit
transactions.

The second case is more complex. If the batch contains a committing
EndTxn request that we know will need a refresh, we don't want to bother
issuing it just for it to be rejected. Instead, preemptively refresh
before issuing the EndTxn batch. If we view reads as acquiring a form of
optimistic read locks under an optimistic concurrency control scheme (as
is discussed in the comment on txnSpanRefresher) then this preemptive
refresh immediately before the EndTxn is synonymous with the
"validation" phase of a standard OCC transaction model. However, as an
optimization compared to standard OCC, the validation phase is only
performed when necessary in CockroachDB (i.e. if the transaction's
writes have been pushed to higher timestamps).

This second case will play into the solution for cockroachdb#51294. Now that we
perform these preemptive refreshes when possible, we know that it is
always to right choice to split off the EndTxn from the rest of the
batch during a txnSpanRefresher auto-retry. Without this change, it was
unclear whether the first refresh of an EndTxn batch was caused by
earlier requests in the transaction or by other requests in the current
batch. Now, it is always caused by other requests in the same batch, so
it is always clear that we should split the EndTxn from the rest of the
batch immediately after the first refresh.

Release notes (performance improvement): validation of optimistic reads
is now performed earlier in transactions when doing so can save work.
This eliminates certain types of transaction retry errors and avoids
wasted RPC traffic.
craig bot pushed a commit that referenced this issue Aug 21, 2020
52881: sql: disallow non-admin users from dropping admins r=solongordon a=solongordon

Fixes #52582

Release note(sql change): Non-admin users with the CREATEROLE are no
longer permitted to drop users with the admin role.

52884: kv: preemptively refresh transaction timestamps r=nvanbenschoten a=nvanbenschoten

Relates to #51294.

This commit adds logic to the txnSpanRefresher to preemptively perform refreshes before issuing requests when doing is guaranteed to be beneficial. We perform a preemptive refresh if either a) doing so would be free because we have not yet accumulated any refresh spans, or b) the batch contains a committing EndTxn request that we know will be rejected if issued.

The first case is straightforward. If the transaction has yet to perform any reads but has had its write timestamp bumped, refreshing is a trivial no-op. In this case, refreshing eagerly prevents the transaction for performing any future reads at its current read timestamp. Not doing so preemptively guarantees that we will need to perform a real refresh in the future if the transaction ever performs a read. At best, this would be wasted work. At worst, this could result in the future refresh failing. So we might as well refresh preemptively while doing so is free.

Note that this first case here does NOT obviate the need for server-side refreshes. Notably, a transaction's write timestamp might be bumped in the same batch in which it performs its first read. In such cases, a preemptive refresh would not be needed but a reactive refresh would not be a trivial no-op. These situations are common for one-phase commit transactions.

The second case is more complex. If the batch contains a committing EndTxn request that we know will need a refresh, we don't want to bother issuing it just for it to be rejected. Instead, preemptively refresh before issuing the EndTxn batch. If we view reads as acquiring a form of optimistic read locks under an optimistic concurrency control scheme (as is discussed in the comment on txnSpanRefresher) then this preemptive refresh immediately before the EndTxn is synonymous with the "validation" phase of a standard OCC transaction model. However, as an optimization compared to standard OCC, the validation phase is only performed when necessary in CockroachDB (i.e. if the transaction's writes have been pushed to higher timestamps).

This second case will play into the solution for #51294. Now that we perform these preemptive refreshes when possible, we know that it is always to right choice to split off the EndTxn from the rest of the batch during a txnSpanRefresher auto-retry. Without this change, it was unclear whether the first refresh of an EndTxn batch was caused by earlier requests in the transaction or by other requests in the current batch. Now, it is always caused by other requests in the same batch, so it is always clear that we should split the EndTxn from the rest of the batch immediately after the first refresh.

Release notes (performance improvement): validation of optimistic reads is now performed earlier in transactions when doing so can save work. This eliminates certain types of transaction retry errors.

53074: cli: add aliases for userfile commands and switch ls to list r=miretskiy a=adityamaru

Added:
- `rm` for `userfile delete`
- `ls` for `userfile list`

Release note (cli change): Adds alias commands `ls` and `rm` for
`userfile list` and `userfile delete`.

53086: sql: add privilege validation for different object types r=rohany a=RichardJCai

Add logic in Validate to ensure only certain privileges can be
granted on certain descriptors.

Example: USAGE is invalid on tables, SELECT is invalid on types.

Release note: None

Co-authored-by: Solon Gordon <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Aditya Maru <[email protected]>
Co-authored-by: richardjcai <[email protected]>
craig bot pushed a commit that referenced this issue Aug 21, 2020
52885: kv: split EndTxn into sub-batch on auto-retry after successful refresh r=nvanbenschoten a=nvanbenschoten

Fixes #51294.

This commit updates the txnSpanRefresher to split off EndTxn requests into their own partial batches on auto-retries after successful refreshes as a means of preventing starvation. This avoids starvation in two ways. First, it helps ensure that we lay down intents if any of the other requests in the batch are writes. Second, it ensures that if any writes are getting pushed due to contention with reads or due to the closed timestamp, they will still succeed and allow the batch to make forward progress. Without this, each retry attempt may get pushed because of writes in the batch and then rejected wholesale when the EndTxn tries to evaluate the pushed batch. When split, the writes will be pushed but succeed, the transaction will be refreshed, and the EndTxn will succeed.

I still need to confirm that this fixes this indefinite stall [here](https://github.com/cockroachlabs/misc_projects_glenn/tree/master/rw_blockage#implicit-query-hangs--explict-query-works), but I suspect that it will.

Release note (bug fix): A change in v20.1 caused a certain class of bulk UPDATEs and DELETE statements to hang indefinitely if run in an implicit transaction. We now break up these statements to avoid starvation and prevent them from hanging indefinitely.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in cc233c5 Aug 21, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 31, 2020
Fixes cockroachdb#51294.
First two commits from cockroachdb#52884.

This commit updates the txnSpanRefresher to split off EndTxn requests
into their own partial batches on auto-retries after successful
refreshes as a means of preventing starvation. This avoids starvation in
two ways. First, it helps ensure that we lay down intents if any of the
other requests in the batch are writes. Second, it ensures that if any
writes are getting pushed due to contention with reads or due to the
closed timestamp, they will still succeed and allow the batch to make
forward progress. Without this, each retry attempt may get pushed
because of writes in the batch and then rejected wholesale when the
EndTxn tries to evaluate the pushed batch. When split, the writes will
be pushed but succeed, the transaction will be refreshed, and the EndTxn
will succeed.

I still need to confirm that this fixes this indefinite stall
[here](https://github.com/cockroachlabs/misc_projects_glenn/tree/master/rw_blockage#implicit-query-hangs--explict-query-works),
but I suspect that it will.

Release note (bug fix): A change in v20.1 caused a certain class of bulk
UPDATEs and DELETE statements to hang indefinitely if run in an implicit
transaction. We now break up these statements to avoid starvation and
prevent them from hanging indefinitely.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants