-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: support server-side refreshes of uncertainty errors #75905
kv: support server-side refreshes of uncertainty errors #75905
Conversation
318fe86
to
22b4952
Compare
22b4952
to
944ab5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put more words in the commit message, spelling out how the patch does what it does. Spell out that we now do some refreshes above latching, in executeWithConcurrencyRetries
, whereas the pre-existing ones remain below latches.
But actually - would we lose much if we kept only the above-latches retries? Cause the code would probably be significantly simpler.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/kvserver/replica_batch_updates.go, line 213 at r1 (raw file):
// not be bumped. func tryBumpBatchTimestamp( ctx context.Context, ba *roachpb.BatchRequest, g *concurrency.Guard, ts hlc.Timestamp,
comment that g
can be nil
pkg/kv/kvserver/replica_evaluate.go, line 526 at r1 (raw file):
// indicates that it is not holding latches and can therefore more freely adjust // its timestamp because it will re-acquire latches at whatever timestamp the // batch is bumped to.
please clarify even more that this function is called from different places - both above latching and below. Although I suspected it during the review, it took me a while to map where this is called from.
pkg/kv/kvserver/replica_send.go, line 397 at r1 (raw file):
// accessing shared data structures. // // If the execution function hits a concurrency error like a WriteIntentError or
consider taking the opportunity to clarify that latches are re-acquired for each retry, potentially at different timestamps for each retry
pkg/kv/kvserver/replica_send.go, line 514 at r1 (raw file):
// the uncertainty error, then it will have a new timestamp. Recompute the // latch and lock spans accordingly. latchSpans, lockSpans, requestEvalKind, err = r.collectSpans(ba)
consider having a single call to collectSpans
in this function, inside the loop, guarded by a needUpdateSpans
bool that is set here. I think that would clarify that the SequenceReq
call might depend on what happened on the previous attempt.
pkg/kv/kvserver/replica_send.go, line 580 at r1 (raw file):
case *roachpb.ReadWithinUncertaintyIntervalError: // If a request hits a ReadWithinUncertaintyIntervalError, it was performing // a non-locking read and encountered a (committed or provisional) write
how do you know that it's non-locking and why does it matter? Consider adding more words to the comment.
pkg/kv/kvserver/replica_send.go, line 585 at r1 (raw file):
// adjust its timestamp and retry on the server. // // This is similar other server-side retries that we allow below latching,
similar to
pkg/kv/kvserver/replica_send.go, line 734 at r1 (raw file):
} func (r *Replica) handleReadWithinUncertaintyIntervalError(
please give this a comment
pkg/kv/kvserver/replica_send.go, line 743 at r1 (raw file):
// latchSpans, because we have already released our latches and plan to // re-acquire them if the retry is allowed. if !canDoServersideRetry(ctx, pErr, ba, nil, nil, nil) {
please put comments on these nils.
Extracted from cockroachdb#73732, with relevant comments addressed. This commit adds support for server-side refreshes of `ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization for transactional requests, which now benefit from this new capability to refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction, before they've accumulated any refresh spans. There's some complexity around supporting this form of server-side retry, because it must be done above latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made this possible to support cleanly. Specifically, we now handle `ReadWithinUncertaintyIntervalError` as a concurrency error in the `executeWithConcurrencyRetries` retry loop. This is different from other server-side retries, which are hit during writes and can be handled without releasing latches. This difference stems from the difference in how read and write latches behave. Write latches protect their MVCC timestamp and any later time. Meanwhile, read latches protect their MVCC timestamp and any earlier time. This means that a request holding read latches that hits an uncertainty error can't refresh without dropping those latches and acquiring new ones. This is also a prerequisite to giving non-transactional requests uncertainty intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to reach the client for non-transactional requests. Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors will always be able to retry on the server, so these errors will never bubble up to the client that initiated the request. Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.
944ab5a
to
a095546
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Spell out that we now do some refreshes above latching, in executeWithConcurrencyRetries, whereas the pre-existing ones remain below latches.
Done.
But actually - would we lose much if we kept only the above-latches retries? Cause the code would probably be significantly simpler.
I considered this, but I don't think we would want to do that. The reason that ReadWithinUncertaintyIntervalErrors
are ok to be handled above latches is that there is a limit to how many times a request can hit them — eventually, its read timestamp exceeds its fixed uncertainty window and its done refreshing. This isn't the case for other forms of server-side retries (e.g. WriteTooOld errors), which could be indefinitely starved if we released latches between retries. I added some wording around this to the commit message and to the code.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/kvserver/replica_batch_updates.go, line 213 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
comment that
g
can benil
Done.
pkg/kv/kvserver/replica_evaluate.go, line 526 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
please clarify even more that this function is called from different places - both above latching and below. Although I suspected it during the review, it took me a while to map where this is called from.
Done.
pkg/kv/kvserver/replica_send.go, line 397 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
consider taking the opportunity to clarify that latches are re-acquired for each retry, potentially at different timestamps for each retry
This isn't true in all cases (depending on the type of error), so the inline comments down below are the best source of truth for this.
pkg/kv/kvserver/replica_send.go, line 514 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
consider having a single call to
collectSpans
in this function, inside the loop, guarded by aneedUpdateSpans
bool that is set here. I think that would clarify that theSequenceReq
call might depend on what happened on the previous attempt.
Done.
pkg/kv/kvserver/replica_send.go, line 580 at r1 (raw file):
Locking reads don't hit uncertainty errors, they always hit WTO errors instead.
Consider adding more words to the comment.
Done.
pkg/kv/kvserver/replica_send.go, line 585 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
similar to
Done.
pkg/kv/kvserver/replica_send.go, line 743 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
please put comments on these nils.
Done.
bors r+ |
Build succeeded: |
Extracted from #73732, with relevant comments addressed.
This commit adds support for server-side refreshes of
ReadWithinUncertaintyIntervalError
s. This serves as a performance optimizationfor transactional requests, which now benefit from this new capability to
refresh away
ReadWithinUncertaintyIntervalErrors
early in their transaction,before they've accumulated any refresh spans. There's some complexity around
supporting this form of server-side retry, because it must be done above
latching, instead of below. However, the recent refactoring in #73557 has made
this possible to support cleanly.
Specifically, we now handle
ReadWithinUncertaintyIntervalError
as a concurrencyerror in the
executeWithConcurrencyRetries
retry loop. This is different fromother server-side retries, which are hit during writes and can be handled without
releasing latches. This difference stems from the difference in how read and write
latches behave. Write latches protect their MVCC timestamp and any later time.
Meanwhile, read latches protect their MVCC timestamp and any earlier time. This
means that a request holding read latches that hits an uncertainty error can't
refresh without dropping those latches and acquiring new ones.
This is also a prerequisite to giving non-transactional requests uncertainty
intervals (#73732), because we don't want ReadWithinUncertaintyIntervalErrors to
reach the client for non-transactional requests. Conveniently, because
non-transactional requests are always scoped to a single-range, those that hit
uncertainty errors will always be able to retry on the server, so these errors
will never bubble up to the client that initiated the request.
Release note (performance improvement): Certain forms of automatically retried
"read uncertainty" errors are now retried more efficiently, avoiding a network
round trip.