kv: eliminate write-too-old deferral mechanism #102751

nvanbenschoten · 2023-05-03T17:34:43Z

Blind-writes can currently defer the handling of WriteTooOld errors on write-write conflicts until after they have written their intent. This serves as a form of pessimistic locking to avoid starvation in the case of contending blind writes to the same key. The mechanism was introduced in CockroachDB early in its life, unintentionally removed in #38668, and then revived by #44654 (issue: #44653).

This mechanism has been confusing, complex, and error-prone. For example, it requires mvccPutInternal and its callers to perform all side-effects (e.g. populate WriteBatches) even if a WriteTooOld error has been thrown, in case it is deferred above in evaluateBatch. We've wanted to remove this mechanism since at least 79c711d#diff-c9f3b8fbff25265dd51555fa7099eae566e5904c4e158332afba5551332ca5bcR293, but the time wasn't right.

The mechanism is also problematic for weaker isolation levels. Specifically, it interacts poorly with parallel commits, conflating read-write conflicts with write-write conflicts and permitting the following bug:

1. read on "a" @ 10
2. read on "b" @ 10
3. parallel commit write "a" @ 10, write "b" @ 10; [{Put("a"), EndTxn}, {Put("b")}]
4. write "a" gets timestamp cache pushed to 15, stages because of weak isolation level
5. write "b" gets write-too-old pushed to 14
6. implicit commit @ 15 with write-write conflict. Uh oh!

Since the deferral mechanism was re-introduced, CockroachDB's KV layer has evolved in two ways which obviate the need for it:

Implicit and explicit SELECT FOR UPDATE. Acquiring locks earlier in the lifecycle of an UPDATE/UPSERT statement avoids the kinds of starvation issues that the deferral mechanism was meant to solve. Transaction contention is resolved ahead of the intent write, so the intent write does not typically hit a WriteTooOld error and does not need to write an intent in the presence of a write-write conflict just to avoid starvation.
Server-side refreshes. Many transactions are able to ignore a WriteTooOld error through a less error-prone server-side mechanism called server-side refreshes. This mechanism catches errors that could cause a transaction to retry and adjusts the transaction's timestamp in response, under latches.

With these two mechanisms in place, we are ready to eliminate the write-too-old deferral mechanism.

Jira issue: CRDB-27634

The text was updated successfully, but these errors were encountered:

Informs cockroachdb#102751. This commit eliminates the write-too-old deferral mechanism, where blind-write BatchRequests that hit a WriteTooOld error would successfully write intents and then return a Transaction proto with the WriteTooOld flag to the client. The client would then immediately refresh to remove this flag. However, in the intermediate period, the written intents would act as locks to ensure that if the refresh succeeded, the writer would have exclusive access to the previously written keys and would not hit a WriteTooOld error on its next attempt. The rationale for the removal of this mechanism is outlined in cockroachdb#102751. At a high-level, the mechanism is complex, error-prone, and sufficiently unnecessary today due to unreplicated locks and server-side refreshes. It also interacts poorly with weak isolation levels, which further motivates its removal. Cases where the write-too-old deferral mechanism is still hypothetically useful are difficult to construct, especially from SQL's limited use of KV. They require the following conditions to all hold: 1. a blind-writing BatchRequest (containing Put or Delete, but not ConditionalPut) 2. a BatchRequest without the CanForwardReadTimestamp flag (needs client-side refresh) 3. a write-write conflict that will not cause a refresh to fail These requirement are almost always contradictory. A write-write conflict implies a failed refresh if the refresher has already read the conflicting key. So the cases where this mechanism help are limited to those where the writer has not already read the conflicting key. However, SQL rarely issues blind-write KV requests keys that it has not already read. The cases where this might come up are fast-path DELETE statements that issue DeleteRequest (not DeleteRangeRequest) and fast-path UPSERT statements that write all columns in a table. If either of these are heavily contended and take place in multi-statement transactions that previously read, this mechanism could help. However, I suspect that these scenarios are very uncommon. If customers do see them, they can avoid problems by using SELECT FOR UPDATE earlier in the transaction or by using Read Committed (which effectively resets the CanForwardReadTimestamp flag on each statement boundary). The commit does not yet remove the Transaction.WriteTooOld field, as this must remain until compatibility with v23.1 nodes is no longer a concern. Release note: None

Informs cockroachdb#102751. This commit eliminates the MVCC-level portion of the write-too-old deferral mechanism. It adjusts `mvccPutInternal` to return immediately when a write-write version conflict is encountered with a WriteTooOld error and without also writing an intent after the conflicting value. This partial-success, partial-error state is no longer needed now that KV no longer defers write-too-old error handling. As a result, we can remove the complexity. Release note: None

sumeerbhola · 2023-05-05T17:48:06Z

This mechanism catches errors that could cause a transaction to retry and adjusts the transaction's timestamp in response, under latches.

So is this what prevents starvation and loss of performance in the non-SFU case? IIUC, this would mean that the MVCC layer will return an error without doing the write, and the server-side refresh will bump the timestamp (while holding the latch and the lock reservation) and retry the write successfully. So just some extra iterator seeks for the second Put attempt. Am I correct?

nvanbenschoten · 2023-05-05T18:17:20Z

You are correct that the server-side refresh will avoid starvation in cases where that is possible. However, this PR won't introduce any new iterator seeks, as we were previously performing a server-side refresh (when permitted by the client) after the deferred write-too-old error anyway:

cockroach/pkg/kv/kvserver/replica_write.go

Lines 671 to 674 in 18fa115

    
           if pErr == nil { 
        
           	wto := br.Txn != nil && br.Txn.WriteTooOld 
        
           	success = !wto 
        
           } else {

So the case where this could make a difference is when a server-side refresh is not possible but a client-side refresh is possible and succeeds. The description on #102808 goes into more detail about this. It discusses the limited cases where the write-too-old deferral mechanism could have hypothetically still been useful for avoiding starvation, even with SFU and server-side refreshes. The scenarios that meet all requirements are involved and I don't know of any real workloads that behave like them.

You can much more easily construct scenarios where the client-side refresh is bound to fail and the write-too-old deferral is detrimental.

Informs cockroachdb#102751. This commit eliminates the write-too-old deferral mechanism, where blind-write BatchRequests that hit a WriteTooOld error would successfully write intents and then return a Transaction proto with the WriteTooOld flag to the client. The client would then immediately refresh to remove this flag. However, in the intermediate period, the written intents would act as locks to ensure that if the refresh succeeded, the writer would have exclusive access to the previously written keys and would not hit a WriteTooOld error on its next attempt. The rationale for the removal of this mechanism is outlined in cockroachdb#102751. At a high-level, the mechanism is complex, error-prone, and sufficiently unnecessary today due to unreplicated locks and server-side refreshes. It also interacts poorly with weak isolation levels, which further motivates its removal. Cases where the write-too-old deferral mechanism is still hypothetically useful are difficult to construct, especially from SQL's limited use of KV. They require the following conditions to all hold: 1. a blind-writing BatchRequest (containing Put or Delete, but not ConditionalPut) 2. a BatchRequest without the CanForwardReadTimestamp flag (needs client-side refresh) 3. a write-write conflict that will not cause a refresh to fail These requirement are almost always contradictory. A write-write conflict implies a failed refresh if the refresher has already read the conflicting key. So the cases where this mechanism help are limited to those where the writer has not already read the conflicting key. However, SQL rarely issues blind-write KV requests keys that it has not already read. The cases where this might come up are fast-path DELETE statements that issue DeleteRequest (not DeleteRangeRequest) and fast-path UPSERT statements that write all columns in a table. If either of these are heavily contended and take place in multi-statement transactions that previously read, this mechanism could help. However, I suspect that these scenarios are very uncommon. If customers do see them, they can avoid problems by using SELECT FOR UPDATE earlier in the transaction or by using Read Committed (which effectively resets the CanForwardReadTimestamp flag on each statement boundary). The commit does not yet remove the Transaction.WriteTooOld field, as this must remain until compatibility with v23.1 nodes is no longer a concern. Release note: None

Informs cockroachdb#102751. This commit eliminates the MVCC-level portion of the write-too-old deferral mechanism. It adjusts `mvccPutInternal` to return immediately when a write-write version conflict is encountered with a WriteTooOld error and without also writing an intent after the conflicting value. This partial-success, partial-error state is no longer needed now that KV no longer defers write-too-old error handling. As a result, we can remove the complexity. Release note: None

Informs cockroachdb#102751. This commit eliminates the write-too-old deferral mechanism, where blind-write BatchRequests that hit a WriteTooOld error would successfully write intents and then return a Transaction proto with the WriteTooOld flag to the client. The client would then immediately refresh to remove this flag. However, in the intermediate period, the written intents would act as locks to ensure that if the refresh succeeded, the writer would have exclusive access to the previously written keys and would not hit a WriteTooOld error on its next attempt. The rationale for the removal of this mechanism is outlined in cockroachdb#102751. At a high-level, the mechanism is complex, error-prone, and sufficiently unnecessary today due to unreplicated locks and server-side refreshes. It also interacts poorly with weak isolation levels, which further motivates its removal. Cases where the write-too-old deferral mechanism is still hypothetically useful are difficult to construct, especially from SQL's limited use of KV. They require the following conditions to all hold: 1. a blind-writing BatchRequest (containing Put or Delete, but not ConditionalPut) 2. a BatchRequest without the CanForwardReadTimestamp flag (needs client-side refresh) 3. a write-write conflict that will not cause a refresh to fail These requirement are almost always contradictory. A write-write conflict implies a failed refresh if the refresher has already read the conflicting key. So the cases where this mechanism help are limited to those where the writer has not already read the conflicting key. However, SQL rarely issues blind-write KV requests keys that it has not already read. The cases where this might come up are fast-path DELETE statements that issue DeleteRequest (not DeleteRangeRequest) and fast-path UPSERT statements that write all columns in a table. If either of these are heavily contended and take place in multi-statement transactions that previously read, this mechanism could help. However, I suspect that these scenarios are very uncommon. If customers do see them, they can avoid problems by using SELECT FOR UPDATE earlier in the transaction or by using Read Committed (which effectively resets the CanForwardReadTimestamp flag on each statement boundary). The commit does not yet remove the Transaction.WriteTooOld field, as this must remain until compatibility with v23.1 nodes is no longer a concern. Release note: None

102808: kv: eliminate write-too-old deferral mechanism r=nvanbenschoten a=nvanbenschoten KV half of #102751. This commit eliminates the write-too-old deferral mechanism, where blind-write BatchRequests that hit a WriteTooOld error would successfully write intents and then return a Transaction proto with the WriteTooOld flag to the client. The client would then immediately refresh to remove this flag. However, in the intermediate period, the written intents would act as locks to ensure that if the refresh succeeded, the writer would have exclusive access to the previously written keys and would not hit a WriteTooOld error on its next attempt. The rationale for the removal of this mechanism is outlined in #102751. At a high-level, the mechanism is complex, error-prone, and sufficiently unnecessary today due to unreplicated locks and server-side refreshes. It also interacts poorly with weak isolation levels, which further motivates its removal. Cases where the write-too-old deferral mechanism is still hypothetically useful are difficult to construct, especially from SQL's limited use of KV. They require the following conditions to all hold: 1. a blind-writing BatchRequest (containing Put or Delete, but not ConditionalPut) 2. a BatchRequest without the CanForwardReadTimestamp flag (needs client-side refresh) 3. a write-write conflict that will not cause a refresh to fail These requirement are almost always contradictory. A write-write conflict implies a failed refresh if the refresher has already read the conflicting key. So the cases where this mechanism help are limited to those where the writer has not already read the conflicting key. However, SQL rarely issues blind-write KV requests keys that it has not already read. The cases where this might come up are fast-path DELETE statements that issue DeleteRequest (not DeleteRangeRequest) and fast-path UPSERT statements that write all columns in a table. If either of these are heavily contended and take place in multi-statement transactions that previously read, this mechanism could help. However, I suspect that these scenarios are very uncommon. If customers do see them, they can avoid problems by using SELECT FOR UPDATE earlier in the transaction or by using Read Committed (which effectively resets the CanForwardReadTimestamp flag on each statement boundary). The commit does not yet remove the Transaction.WriteTooOld field, as this must remain until compatibility with v23.1 nodes is no longer a concern. Release note: None 103353: sql: skip flaky TestTxnContentionEventsTable r=gtr a=gtr Informs #102660. Release note: None 103354: admission: fix comment in admission.go r=bananabrick a=bananabrick Epic: none Release note: None 103404: concurrency: fix bug in lockStateInfo r=nvanbenschoten a=arulajmani All waiting readers are considered to be actively waiting at a lock; there's no concept of inactive waiting readers in the lock table. Previously, when converting a lockState into a roachpb.LockStateInfo, we would erroneously denote any waiting readers as inactive. This patch fixes this and adds a test for it. This patch also fixes how reservations are designated as active or inactive. Previously, reservations would be marked as active waiters. This isn't true -- reservation holders do not actively wait at a lock they hold reservations for. They only wait at other locks or proceed to evaluation. Now, we mark reservations as inactive waiters as well. The diff in existing tests is a result of this change. Epic: none Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: gtr <[email protected]> Co-authored-by: Arjun Nair <[email protected]> Co-authored-by: Arul Ajmani <[email protected]>

Informs cockroachdb#102751. This commit eliminates the MVCC-level portion of the write-too-old deferral mechanism. It adjusts `mvccPutInternal` to return immediately when a write-write version conflict is encountered with a WriteTooOld error and without also writing an intent after the conflicting value. This partial-success, partial-error state is no longer needed now that KV no longer defers write-too-old error handling. As a result, we can remove the complexity. Release note: None

nvanbenschoten self-assigned this May 3, 2023

nvanbenschoten mentioned this issue May 5, 2023

kv: eliminate write-too-old deferral mechanism #102808

Merged

nvanbenschoten mentioned this issue May 5, 2023

storage: don't write intents on write-too-old errors #102809

Merged

craig bot closed this as completed in 2e6d3c5 May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: eliminate write-too-old deferral mechanism #102751

kv: eliminate write-too-old deferral mechanism #102751

nvanbenschoten commented May 3, 2023 •

edited by cockroach-jira-scripts

Loading

sumeerbhola commented May 5, 2023

nvanbenschoten commented May 5, 2023

kv: eliminate write-too-old deferral mechanism #102751

kv: eliminate write-too-old deferral mechanism #102751

Comments

nvanbenschoten commented May 3, 2023 • edited by cockroach-jira-scripts Loading

sumeerbhola commented May 5, 2023

nvanbenschoten commented May 5, 2023

nvanbenschoten commented May 3, 2023 •

edited by cockroach-jira-scripts

Loading