Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: support server-side refreshes of uncertainty errors #75905

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Feb 2, 2022

Extracted from #73732, with relevant comments addressed.

This commit adds support for server-side refreshes of
ReadWithinUncertaintyIntervalErrors. This serves as a performance optimization
for transactional requests, which now benefit from this new capability to
refresh away ReadWithinUncertaintyIntervalErrors early in their transaction,
before they've accumulated any refresh spans. There's some complexity around
supporting this form of server-side retry, because it must be done above
latching, instead of below. However, the recent refactoring in #73557 has made
this possible to support cleanly.

Specifically, we now handle ReadWithinUncertaintyIntervalError as a concurrency
error in the executeWithConcurrencyRetries retry loop. This is different from
other server-side retries, which are hit during writes and can be handled without
releasing latches. This difference stems from the difference in how read and write
latches behave. Write latches protect their MVCC timestamp and any later time.
Meanwhile, read latches protect their MVCC timestamp and any earlier time. This
means that a request holding read latches that hits an uncertainty error can't
refresh without dropping those latches and acquiring new ones.

This is also a prerequisite to giving non-transactional requests uncertainty
intervals (#73732), because we don't want ReadWithinUncertaintyIntervalErrors to
reach the client for non-transactional requests. Conveniently, because
non-transactional requests are always scoped to a single-range, those that hit
uncertainty errors will always be able to retry on the server, so these errors
will never bubble up to the client that initiated the request.

Release note (performance improvement): Certain forms of automatically retried
"read uncertainty" errors are now retried more efficiently, avoiding a network
round trip.

@nvanbenschoten nvanbenschoten requested a review from a team as a code owner February 2, 2022 23:09
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/serverSideUncertaintyRefresh branch 2 times, most recently from 318fe86 to 22b4952 Compare February 3, 2022 04:22
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/serverSideUncertaintyRefresh branch from 22b4952 to 944ab5a Compare February 5, 2022 19:10
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Please put more words in the commit message, spelling out how the patch does what it does. Spell out that we now do some refreshes above latching, in executeWithConcurrencyRetries, whereas the pre-existing ones remain below latches.
But actually - would we lose much if we kept only the above-latches retries? Cause the code would probably be significantly simpler.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)


pkg/kv/kvserver/replica_batch_updates.go, line 213 at r1 (raw file):

// not be bumped.
func tryBumpBatchTimestamp(
	ctx context.Context, ba *roachpb.BatchRequest, g *concurrency.Guard, ts hlc.Timestamp,

comment that g can be nil


pkg/kv/kvserver/replica_evaluate.go, line 526 at r1 (raw file):

// indicates that it is not holding latches and can therefore more freely adjust
// its timestamp because it will re-acquire latches at whatever timestamp the
// batch is bumped to.

please clarify even more that this function is called from different places - both above latching and below. Although I suspected it during the review, it took me a while to map where this is called from.


pkg/kv/kvserver/replica_send.go, line 397 at r1 (raw file):

// accessing shared data structures.
//
// If the execution function hits a concurrency error like a WriteIntentError or

consider taking the opportunity to clarify that latches are re-acquired for each retry, potentially at different timestamps for each retry


pkg/kv/kvserver/replica_send.go, line 514 at r1 (raw file):

			// the uncertainty error, then it will have a new timestamp. Recompute the
			// latch and lock spans accordingly.
			latchSpans, lockSpans, requestEvalKind, err = r.collectSpans(ba)

consider having a single call to collectSpans in this function, inside the loop, guarded by a needUpdateSpans bool that is set here. I think that would clarify that the SequenceReq call might depend on what happened on the previous attempt.


pkg/kv/kvserver/replica_send.go, line 580 at r1 (raw file):

	case *roachpb.ReadWithinUncertaintyIntervalError:
		// If a request hits a ReadWithinUncertaintyIntervalError, it was performing
		// a non-locking read and encountered a (committed or provisional) write

how do you know that it's non-locking and why does it matter? Consider adding more words to the comment.


pkg/kv/kvserver/replica_send.go, line 585 at r1 (raw file):

		// adjust its timestamp and retry on the server.
		//
		// This is similar other server-side retries that we allow below latching,

similar to


pkg/kv/kvserver/replica_send.go, line 734 at r1 (raw file):

}

func (r *Replica) handleReadWithinUncertaintyIntervalError(

please give this a comment


pkg/kv/kvserver/replica_send.go, line 743 at r1 (raw file):

	// latchSpans, because we have already released our latches and plan to
	// re-acquire them if the retry is allowed.
	if !canDoServersideRetry(ctx, pErr, ba, nil, nil, nil) {

please put comments on these nils.

Extracted from cockroachdb#73732, with relevant comments addressed.

This commit adds support for server-side refreshes of
`ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization
for transactional requests, which now benefit from this new capability to
refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction,
before they've accumulated any refresh spans. There's some complexity around
supporting this form of server-side retry, because it must be done above
latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made
this possible to support cleanly.

Specifically, we now handle `ReadWithinUncertaintyIntervalError` as a concurrency
error in the `executeWithConcurrencyRetries` retry loop. This is different from
other server-side retries, which are hit during writes and can be handled without
releasing latches. This difference stems from the difference in how read and write
latches behave. Write latches protect their MVCC timestamp and any later time.
Meanwhile, read latches protect their MVCC timestamp and any earlier time. This
means that a request holding read latches that hits an uncertainty error can't
refresh without dropping those latches and acquiring new ones.

This is also a prerequisite to giving non-transactional requests uncertainty
intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to
reach the client for non-transactional requests. Conveniently, because
non-transactional requests are always scoped to a single-range, those that hit
uncertainty errors will always be able to retry on the server, so these errors
will never bubble up to the client that initiated the request.

Release note (performance improvement): Certain forms of automatically retried
"read uncertainty" errors are now retried more efficiently, avoiding a network
round trip.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/serverSideUncertaintyRefresh branch from 944ab5a to a095546 Compare February 8, 2022 04:37
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Spell out that we now do some refreshes above latching, in executeWithConcurrencyRetries, whereas the pre-existing ones remain below latches.

Done.

But actually - would we lose much if we kept only the above-latches retries? Cause the code would probably be significantly simpler.

I considered this, but I don't think we would want to do that. The reason that ReadWithinUncertaintyIntervalErrors are ok to be handled above latches is that there is a limit to how many times a request can hit them — eventually, its read timestamp exceeds its fixed uncertainty window and its done refreshing. This isn't the case for other forms of server-side retries (e.g. WriteTooOld errors), which could be indefinitely starved if we released latches between retries. I added some wording around this to the commit message and to the code.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)


pkg/kv/kvserver/replica_batch_updates.go, line 213 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

comment that g can be nil

Done.


pkg/kv/kvserver/replica_evaluate.go, line 526 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please clarify even more that this function is called from different places - both above latching and below. Although I suspected it during the review, it took me a while to map where this is called from.

Done.


pkg/kv/kvserver/replica_send.go, line 397 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

consider taking the opportunity to clarify that latches are re-acquired for each retry, potentially at different timestamps for each retry

This isn't true in all cases (depending on the type of error), so the inline comments down below are the best source of truth for this.


pkg/kv/kvserver/replica_send.go, line 514 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

consider having a single call to collectSpans in this function, inside the loop, guarded by a needUpdateSpans bool that is set here. I think that would clarify that the SequenceReq call might depend on what happened on the previous attempt.

Done.


pkg/kv/kvserver/replica_send.go, line 580 at r1 (raw file):
Locking reads don't hit uncertainty errors, they always hit WTO errors instead.

Consider adding more words to the comment.

Done.


pkg/kv/kvserver/replica_send.go, line 585 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

similar to

Done.


pkg/kv/kvserver/replica_send.go, line 743 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please put comments on these nils.

Done.

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 8, 2022

Build succeeded:

@craig craig bot merged commit 03c50b0 into cockroachdb:master Feb 8, 2022
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/serverSideUncertaintyRefresh branch February 8, 2022 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants