-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: eliminate write-too-old deferral mechanism #102751
Comments
Informs cockroachdb#102751. This commit eliminates the write-too-old deferral mechanism, where blind-write BatchRequests that hit a WriteTooOld error would successfully write intents and then return a Transaction proto with the WriteTooOld flag to the client. The client would then immediately refresh to remove this flag. However, in the intermediate period, the written intents would act as locks to ensure that if the refresh succeeded, the writer would have exclusive access to the previously written keys and would not hit a WriteTooOld error on its next attempt. The rationale for the removal of this mechanism is outlined in cockroachdb#102751. At a high-level, the mechanism is complex, error-prone, and sufficiently unnecessary today due to unreplicated locks and server-side refreshes. It also interacts poorly with weak isolation levels, which further motivates its removal. Cases where the write-too-old deferral mechanism is still hypothetically useful are difficult to construct, especially from SQL's limited use of KV. They require the following conditions to all hold: 1. a blind-writing BatchRequest (containing Put or Delete, but not ConditionalPut) 2. a BatchRequest without the CanForwardReadTimestamp flag (needs client-side refresh) 3. a write-write conflict that will not cause a refresh to fail These requirement are almost always contradictory. A write-write conflict implies a failed refresh if the refresher has already read the conflicting key. So the cases where this mechanism help are limited to those where the writer has not already read the conflicting key. However, SQL rarely issues blind-write KV requests keys that it has not already read. The cases where this might come up are fast-path DELETE statements that issue DeleteRequest (not DeleteRangeRequest) and fast-path UPSERT statements that write all columns in a table. If either of these are heavily contended and take place in multi-statement transactions that previously read, this mechanism could help. However, I suspect that these scenarios are very uncommon. If customers do see them, they can avoid problems by using SELECT FOR UPDATE earlier in the transaction or by using Read Committed (which effectively resets the CanForwardReadTimestamp flag on each statement boundary). The commit does not yet remove the Transaction.WriteTooOld field, as this must remain until compatibility with v23.1 nodes is no longer a concern. Release note: None
Informs cockroachdb#102751. This commit eliminates the MVCC-level portion of the write-too-old deferral mechanism. It adjusts `mvccPutInternal` to return immediately when a write-write version conflict is encountered with a WriteTooOld error and without also writing an intent after the conflicting value. This partial-success, partial-error state is no longer needed now that KV no longer defers write-too-old error handling. As a result, we can remove the complexity. Release note: None
So is this what prevents starvation and loss of performance in the non-SFU case? IIUC, this would mean that the MVCC layer will return an error without doing the write, and the server-side refresh will bump the timestamp (while holding the latch and the lock reservation) and retry the write successfully. So just some extra iterator seeks for the second Put attempt. Am I correct? |
You are correct that the server-side refresh will avoid starvation in cases where that is possible. However, this PR won't introduce any new iterator seeks, as we were previously performing a server-side refresh (when permitted by the client) after the deferred write-too-old error anyway: cockroach/pkg/kv/kvserver/replica_write.go Lines 671 to 674 in 18fa115
So the case where this could make a difference is when a server-side refresh is not possible but a client-side refresh is possible and succeeds. The description on #102808 goes into more detail about this. It discusses the limited cases where the write-too-old deferral mechanism could have hypothetically still been useful for avoiding starvation, even with SFU and server-side refreshes. The scenarios that meet all requirements are involved and I don't know of any real workloads that behave like them. You can much more easily construct scenarios where the client-side refresh is bound to fail and the write-too-old deferral is detrimental. |
Informs cockroachdb#102751. This commit eliminates the write-too-old deferral mechanism, where blind-write BatchRequests that hit a WriteTooOld error would successfully write intents and then return a Transaction proto with the WriteTooOld flag to the client. The client would then immediately refresh to remove this flag. However, in the intermediate period, the written intents would act as locks to ensure that if the refresh succeeded, the writer would have exclusive access to the previously written keys and would not hit a WriteTooOld error on its next attempt. The rationale for the removal of this mechanism is outlined in cockroachdb#102751. At a high-level, the mechanism is complex, error-prone, and sufficiently unnecessary today due to unreplicated locks and server-side refreshes. It also interacts poorly with weak isolation levels, which further motivates its removal. Cases where the write-too-old deferral mechanism is still hypothetically useful are difficult to construct, especially from SQL's limited use of KV. They require the following conditions to all hold: 1. a blind-writing BatchRequest (containing Put or Delete, but not ConditionalPut) 2. a BatchRequest without the CanForwardReadTimestamp flag (needs client-side refresh) 3. a write-write conflict that will not cause a refresh to fail These requirement are almost always contradictory. A write-write conflict implies a failed refresh if the refresher has already read the conflicting key. So the cases where this mechanism help are limited to those where the writer has not already read the conflicting key. However, SQL rarely issues blind-write KV requests keys that it has not already read. The cases where this might come up are fast-path DELETE statements that issue DeleteRequest (not DeleteRangeRequest) and fast-path UPSERT statements that write all columns in a table. If either of these are heavily contended and take place in multi-statement transactions that previously read, this mechanism could help. However, I suspect that these scenarios are very uncommon. If customers do see them, they can avoid problems by using SELECT FOR UPDATE earlier in the transaction or by using Read Committed (which effectively resets the CanForwardReadTimestamp flag on each statement boundary). The commit does not yet remove the Transaction.WriteTooOld field, as this must remain until compatibility with v23.1 nodes is no longer a concern. Release note: None
Informs cockroachdb#102751. This commit eliminates the MVCC-level portion of the write-too-old deferral mechanism. It adjusts `mvccPutInternal` to return immediately when a write-write version conflict is encountered with a WriteTooOld error and without also writing an intent after the conflicting value. This partial-success, partial-error state is no longer needed now that KV no longer defers write-too-old error handling. As a result, we can remove the complexity. Release note: None
Informs cockroachdb#102751. This commit eliminates the write-too-old deferral mechanism, where blind-write BatchRequests that hit a WriteTooOld error would successfully write intents and then return a Transaction proto with the WriteTooOld flag to the client. The client would then immediately refresh to remove this flag. However, in the intermediate period, the written intents would act as locks to ensure that if the refresh succeeded, the writer would have exclusive access to the previously written keys and would not hit a WriteTooOld error on its next attempt. The rationale for the removal of this mechanism is outlined in cockroachdb#102751. At a high-level, the mechanism is complex, error-prone, and sufficiently unnecessary today due to unreplicated locks and server-side refreshes. It also interacts poorly with weak isolation levels, which further motivates its removal. Cases where the write-too-old deferral mechanism is still hypothetically useful are difficult to construct, especially from SQL's limited use of KV. They require the following conditions to all hold: 1. a blind-writing BatchRequest (containing Put or Delete, but not ConditionalPut) 2. a BatchRequest without the CanForwardReadTimestamp flag (needs client-side refresh) 3. a write-write conflict that will not cause a refresh to fail These requirement are almost always contradictory. A write-write conflict implies a failed refresh if the refresher has already read the conflicting key. So the cases where this mechanism help are limited to those where the writer has not already read the conflicting key. However, SQL rarely issues blind-write KV requests keys that it has not already read. The cases where this might come up are fast-path DELETE statements that issue DeleteRequest (not DeleteRangeRequest) and fast-path UPSERT statements that write all columns in a table. If either of these are heavily contended and take place in multi-statement transactions that previously read, this mechanism could help. However, I suspect that these scenarios are very uncommon. If customers do see them, they can avoid problems by using SELECT FOR UPDATE earlier in the transaction or by using Read Committed (which effectively resets the CanForwardReadTimestamp flag on each statement boundary). The commit does not yet remove the Transaction.WriteTooOld field, as this must remain until compatibility with v23.1 nodes is no longer a concern. Release note: None
Informs cockroachdb#102751. This commit eliminates the write-too-old deferral mechanism, where blind-write BatchRequests that hit a WriteTooOld error would successfully write intents and then return a Transaction proto with the WriteTooOld flag to the client. The client would then immediately refresh to remove this flag. However, in the intermediate period, the written intents would act as locks to ensure that if the refresh succeeded, the writer would have exclusive access to the previously written keys and would not hit a WriteTooOld error on its next attempt. The rationale for the removal of this mechanism is outlined in cockroachdb#102751. At a high-level, the mechanism is complex, error-prone, and sufficiently unnecessary today due to unreplicated locks and server-side refreshes. It also interacts poorly with weak isolation levels, which further motivates its removal. Cases where the write-too-old deferral mechanism is still hypothetically useful are difficult to construct, especially from SQL's limited use of KV. They require the following conditions to all hold: 1. a blind-writing BatchRequest (containing Put or Delete, but not ConditionalPut) 2. a BatchRequest without the CanForwardReadTimestamp flag (needs client-side refresh) 3. a write-write conflict that will not cause a refresh to fail These requirement are almost always contradictory. A write-write conflict implies a failed refresh if the refresher has already read the conflicting key. So the cases where this mechanism help are limited to those where the writer has not already read the conflicting key. However, SQL rarely issues blind-write KV requests keys that it has not already read. The cases where this might come up are fast-path DELETE statements that issue DeleteRequest (not DeleteRangeRequest) and fast-path UPSERT statements that write all columns in a table. If either of these are heavily contended and take place in multi-statement transactions that previously read, this mechanism could help. However, I suspect that these scenarios are very uncommon. If customers do see them, they can avoid problems by using SELECT FOR UPDATE earlier in the transaction or by using Read Committed (which effectively resets the CanForwardReadTimestamp flag on each statement boundary). The commit does not yet remove the Transaction.WriteTooOld field, as this must remain until compatibility with v23.1 nodes is no longer a concern. Release note: None
102808: kv: eliminate write-too-old deferral mechanism r=nvanbenschoten a=nvanbenschoten KV half of #102751. This commit eliminates the write-too-old deferral mechanism, where blind-write BatchRequests that hit a WriteTooOld error would successfully write intents and then return a Transaction proto with the WriteTooOld flag to the client. The client would then immediately refresh to remove this flag. However, in the intermediate period, the written intents would act as locks to ensure that if the refresh succeeded, the writer would have exclusive access to the previously written keys and would not hit a WriteTooOld error on its next attempt. The rationale for the removal of this mechanism is outlined in #102751. At a high-level, the mechanism is complex, error-prone, and sufficiently unnecessary today due to unreplicated locks and server-side refreshes. It also interacts poorly with weak isolation levels, which further motivates its removal. Cases where the write-too-old deferral mechanism is still hypothetically useful are difficult to construct, especially from SQL's limited use of KV. They require the following conditions to all hold: 1. a blind-writing BatchRequest (containing Put or Delete, but not ConditionalPut) 2. a BatchRequest without the CanForwardReadTimestamp flag (needs client-side refresh) 3. a write-write conflict that will not cause a refresh to fail These requirement are almost always contradictory. A write-write conflict implies a failed refresh if the refresher has already read the conflicting key. So the cases where this mechanism help are limited to those where the writer has not already read the conflicting key. However, SQL rarely issues blind-write KV requests keys that it has not already read. The cases where this might come up are fast-path DELETE statements that issue DeleteRequest (not DeleteRangeRequest) and fast-path UPSERT statements that write all columns in a table. If either of these are heavily contended and take place in multi-statement transactions that previously read, this mechanism could help. However, I suspect that these scenarios are very uncommon. If customers do see them, they can avoid problems by using SELECT FOR UPDATE earlier in the transaction or by using Read Committed (which effectively resets the CanForwardReadTimestamp flag on each statement boundary). The commit does not yet remove the Transaction.WriteTooOld field, as this must remain until compatibility with v23.1 nodes is no longer a concern. Release note: None 103353: sql: skip flaky TestTxnContentionEventsTable r=gtr a=gtr Informs #102660. Release note: None 103354: admission: fix comment in admission.go r=bananabrick a=bananabrick Epic: none Release note: None 103404: concurrency: fix bug in lockStateInfo r=nvanbenschoten a=arulajmani All waiting readers are considered to be actively waiting at a lock; there's no concept of inactive waiting readers in the lock table. Previously, when converting a lockState into a roachpb.LockStateInfo, we would erroneously denote any waiting readers as inactive. This patch fixes this and adds a test for it. This patch also fixes how reservations are designated as active or inactive. Previously, reservations would be marked as active waiters. This isn't true -- reservation holders do not actively wait at a lock they hold reservations for. They only wait at other locks or proceed to evaluation. Now, we mark reservations as inactive waiters as well. The diff in existing tests is a result of this change. Epic: none Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: gtr <[email protected]> Co-authored-by: Arjun Nair <[email protected]> Co-authored-by: Arul Ajmani <[email protected]>
Informs cockroachdb#102751. This commit eliminates the MVCC-level portion of the write-too-old deferral mechanism. It adjusts `mvccPutInternal` to return immediately when a write-write version conflict is encountered with a WriteTooOld error and without also writing an intent after the conflicting value. This partial-success, partial-error state is no longer needed now that KV no longer defers write-too-old error handling. As a result, we can remove the complexity. Release note: None
Blind-writes can currently defer the handling of
WriteTooOld
errors on write-write conflicts until after they have written their intent. This serves as a form of pessimistic locking to avoid starvation in the case of contending blind writes to the same key. The mechanism was introduced in CockroachDB early in its life, unintentionally removed in #38668, and then revived by #44654 (issue: #44653).This mechanism has been confusing, complex, and error-prone. For example, it requires
mvccPutInternal
and its callers to perform all side-effects (e.g. populate WriteBatches) even if a WriteTooOld error has been thrown, in case it is deferred above inevaluateBatch
. We've wanted to remove this mechanism since at least 79c711d#diff-c9f3b8fbff25265dd51555fa7099eae566e5904c4e158332afba5551332ca5bcR293, but the time wasn't right.The mechanism is also problematic for weaker isolation levels. Specifically, it interacts poorly with parallel commits, conflating read-write conflicts with write-write conflicts and permitting the following bug:
Since the deferral mechanism was re-introduced, CockroachDB's KV layer has evolved in two ways which obviate the need for it:
With these two mechanisms in place, we are ready to eliminate the write-too-old deferral mechanism.
Jira issue: CRDB-27634
The text was updated successfully, but these errors were encountered: