kv: support server-side refreshes of uncertainty errors #75905

nvanbenschoten · 2022-02-02T23:09:14Z

Extracted from #73732, with relevant comments addressed.

This commit adds support for server-side refreshes of
ReadWithinUncertaintyIntervalErrors. This serves as a performance optimization
for transactional requests, which now benefit from this new capability to
refresh away ReadWithinUncertaintyIntervalErrors early in their transaction,
before they've accumulated any refresh spans. There's some complexity around
supporting this form of server-side retry, because it must be done above
latching, instead of below. However, the recent refactoring in #73557 has made
this possible to support cleanly.

Specifically, we now handle ReadWithinUncertaintyIntervalError as a concurrency
error in the executeWithConcurrencyRetries retry loop. This is different from
other server-side retries, which are hit during writes and can be handled without
releasing latches. This difference stems from the difference in how read and write
latches behave. Write latches protect their MVCC timestamp and any later time.
Meanwhile, read latches protect their MVCC timestamp and any earlier time. This
means that a request holding read latches that hits an uncertainty error can't
refresh without dropping those latches and acquiring new ones.

This is also a prerequisite to giving non-transactional requests uncertainty
intervals (#73732), because we don't want ReadWithinUncertaintyIntervalErrors to
reach the client for non-transactional requests. Conveniently, because
non-transactional requests are always scoped to a single-range, those that hit
uncertainty errors will always be able to retry on the server, so these errors
will never bubble up to the client that initiated the request.

Release note (performance improvement): Certain forms of automatically retried
"read uncertainty" errors are now retried more efficiently, avoiding a network
round trip.

cockroach-teamcity · 2022-02-02T23:09:21Z

This change is

andreimatei

Please put more words in the commit message, spelling out how the patch does what it does. Spell out that we now do some refreshes above latching, in executeWithConcurrencyRetries, whereas the pre-existing ones remain below latches.
But actually - would we lose much if we kept only the above-latches retries? Cause the code would probably be significantly simpler.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)

pkg/kv/kvserver/replica_batch_updates.go, line 213 at r1 (raw file):

// not be bumped.
func tryBumpBatchTimestamp(
	ctx context.Context, ba *roachpb.BatchRequest, g *concurrency.Guard, ts hlc.Timestamp,

comment that g can be nil

pkg/kv/kvserver/replica_evaluate.go, line 526 at r1 (raw file):

// indicates that it is not holding latches and can therefore more freely adjust
// its timestamp because it will re-acquire latches at whatever timestamp the
// batch is bumped to.

please clarify even more that this function is called from different places - both above latching and below. Although I suspected it during the review, it took me a while to map where this is called from.

pkg/kv/kvserver/replica_send.go, line 397 at r1 (raw file):

// accessing shared data structures.
//
// If the execution function hits a concurrency error like a WriteIntentError or

consider taking the opportunity to clarify that latches are re-acquired for each retry, potentially at different timestamps for each retry

pkg/kv/kvserver/replica_send.go, line 514 at r1 (raw file):

			// the uncertainty error, then it will have a new timestamp. Recompute the
			// latch and lock spans accordingly.
			latchSpans, lockSpans, requestEvalKind, err = r.collectSpans(ba)

consider having a single call to collectSpans in this function, inside the loop, guarded by a needUpdateSpans bool that is set here. I think that would clarify that the SequenceReq call might depend on what happened on the previous attempt.

pkg/kv/kvserver/replica_send.go, line 580 at r1 (raw file):

	case *roachpb.ReadWithinUncertaintyIntervalError:
		// If a request hits a ReadWithinUncertaintyIntervalError, it was performing
		// a non-locking read and encountered a (committed or provisional) write

how do you know that it's non-locking and why does it matter? Consider adding more words to the comment.

pkg/kv/kvserver/replica_send.go, line 585 at r1 (raw file):

		// adjust its timestamp and retry on the server.
		//
		// This is similar other server-side retries that we allow below latching,

similar to

pkg/kv/kvserver/replica_send.go, line 734 at r1 (raw file):

}

func (r *Replica) handleReadWithinUncertaintyIntervalError(

please give this a comment

pkg/kv/kvserver/replica_send.go, line 743 at r1 (raw file):

	// latchSpans, because we have already released our latches and plan to
	// re-acquire them if the retry is allowed.
	if !canDoServersideRetry(ctx, pErr, ba, nil, nil, nil) {

please put comments on these nils.

Extracted from cockroachdb#73732, with relevant comments addressed. This commit adds support for server-side refreshes of `ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization for transactional requests, which now benefit from this new capability to refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction, before they've accumulated any refresh spans. There's some complexity around supporting this form of server-side retry, because it must be done above latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made this possible to support cleanly. Specifically, we now handle `ReadWithinUncertaintyIntervalError` as a concurrency error in the `executeWithConcurrencyRetries` retry loop. This is different from other server-side retries, which are hit during writes and can be handled without releasing latches. This difference stems from the difference in how read and write latches behave. Write latches protect their MVCC timestamp and any later time. Meanwhile, read latches protect their MVCC timestamp and any earlier time. This means that a request holding read latches that hits an uncertainty error can't refresh without dropping those latches and acquiring new ones. This is also a prerequisite to giving non-transactional requests uncertainty intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to reach the client for non-transactional requests. Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors will always be able to retry on the server, so these errors will never bubble up to the client that initiated the request. Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.

nvanbenschoten

TFTR!

Spell out that we now do some refreshes above latching, in executeWithConcurrencyRetries, whereas the pre-existing ones remain below latches.

Done.

But actually - would we lose much if we kept only the above-latches retries? Cause the code would probably be significantly simpler.

I considered this, but I don't think we would want to do that. The reason that ReadWithinUncertaintyIntervalErrors are ok to be handled above latches is that there is a limit to how many times a request can hit them — eventually, its read timestamp exceeds its fixed uncertainty window and its done refreshing. This isn't the case for other forms of server-side retries (e.g. WriteTooOld errors), which could be indefinitely starved if we released latches between retries. I added some wording around this to the commit message and to the code.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)

pkg/kv/kvserver/replica_batch_updates.go, line 213 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

comment that g can be nil

Done.

pkg/kv/kvserver/replica_evaluate.go, line 526 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please clarify even more that this function is called from different places - both above latching and below. Although I suspected it during the review, it took me a while to map where this is called from.

Done.

pkg/kv/kvserver/replica_send.go, line 397 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

consider taking the opportunity to clarify that latches are re-acquired for each retry, potentially at different timestamps for each retry

This isn't true in all cases (depending on the type of error), so the inline comments down below are the best source of truth for this.

pkg/kv/kvserver/replica_send.go, line 514 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

consider having a single call to collectSpans in this function, inside the loop, guarded by a needUpdateSpans bool that is set here. I think that would clarify that the SequenceReq call might depend on what happened on the previous attempt.

Done.

pkg/kv/kvserver/replica_send.go, line 580 at r1 (raw file):
Locking reads don't hit uncertainty errors, they always hit WTO errors instead.

Consider adding more words to the comment.

Done.

pkg/kv/kvserver/replica_send.go, line 585 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

similar to

Done.

pkg/kv/kvserver/replica_send.go, line 743 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please put comments on these nils.

Done.

nvanbenschoten · 2022-02-08T07:21:10Z

bors r+

craig · 2022-02-08T09:30:59Z

Build succeeded:

GitHub CI (Cockroach)

nvanbenschoten requested a review from andreimatei February 2, 2022 23:09

nvanbenschoten requested a review from a team as a code owner February 2, 2022 23:09

nvanbenschoten force-pushed the nvanbenschoten/serverSideUncertaintyRefresh branch 2 times, most recently from 318fe86 to 22b4952 Compare February 3, 2022 04:22

nvanbenschoten mentioned this pull request Feb 3, 2022

kv: give non-transactional requests uncertainty intervals #73732

Merged

nvanbenschoten force-pushed the nvanbenschoten/serverSideUncertaintyRefresh branch from 22b4952 to 944ab5a Compare February 5, 2022 19:10

andreimatei approved these changes Feb 7, 2022

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/serverSideUncertaintyRefresh branch from 944ab5a to a095546 Compare February 8, 2022 04:37

nvanbenschoten commented Feb 8, 2022

View reviewed changes

craig bot merged commit 03c50b0 into cockroachdb:master Feb 8, 2022

cockroach-teamcity mentioned this pull request Feb 8, 2022

kv: support server-side refreshes of uncertainty errors cockroachdb/docs#12966

Open

nvanbenschoten deleted the nvanbenschoten/serverSideUncertaintyRefresh branch February 8, 2022 13:51

tbg mentioned this pull request Feb 14, 2022

raftlog: introduce & use in loqrecovery, kvserver #76126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: support server-side refreshes of uncertainty errors #75905

kv: support server-side refreshes of uncertainty errors #75905

nvanbenschoten commented Feb 2, 2022 •

edited

Loading

cockroach-teamcity commented Feb 2, 2022

andreimatei left a comment

nvanbenschoten left a comment

nvanbenschoten commented Feb 8, 2022

craig bot commented Feb 8, 2022

kv: support server-side refreshes of uncertainty errors #75905

kv: support server-side refreshes of uncertainty errors #75905

Conversation

nvanbenschoten commented Feb 2, 2022 • edited Loading

cockroach-teamcity commented Feb 2, 2022

andreimatei left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Feb 8, 2022

craig bot commented Feb 8, 2022

nvanbenschoten commented Feb 2, 2022 •

edited

Loading