release-21.2: kv: redirect follower reads to leaseholder on contention #71884

blathers-crl · 2021-10-22T19:58:34Z

Backport 1/1 commits from #70382 on behalf of @nvanbenschoten.

/cc @cockroachdb/release

This commit adjusts the handling of follower reads to redirect to the leaseholder immediately if a conflicting intent is observed while reading. This replaces the previous behavior of attempting to resolve the intents from the follower using an inefficient method (i.e. without batching and with multiple follower<->leaseholder hops) and then re-evaluating after the resolution had completed.

In general, waiting for conflicting intents on the leaseholder instead of on a follower is preferable because:

the leaseholder is notified of and reactive to lock-table state transitions.
the leaseholder is able to more efficiently resolve intents, if necessary, without the risk of multiple follower<->leaseholder round-trips compounding. If the follower was to attempt to resolve multiple intents during a follower read then the PushTxn and ResolveIntent requests would quickly be more expensive (in terms of latency) than simply redirecting the entire read request to the leaseholder and letting the leaseholder coordinate the intent resolution.
after the leaseholder has received a response from a ResolveIntent request, it has a guarantee that the intent resolution has been applied locally and that no future read will observe the intent. This is not true on follower replicas. Due to the asynchronous nature of Raft, both due to quorum voting and due to async commit acknowledgement from leaders to followers, it is possible for a ResolveIntent request to complete and then for a future read on a follower to observe the pre-resolution state of the intent. This effect is transient and will eventually disappear once the follower catches up on its Raft log, but it creates an opportunity for momentary thrashing if a follower read was to resolve an intent and then immediately attempt to read again.

This behavior of redirecting follower read attempts to the leaseholder replica if they encounter conflicting intents on a follower means that follower read eligibility is a function of the "resolved timestamp" over a read's key span, and not just the "closed timestamp" over its key span. Architecturally, this is consistent with Google Spanner, who maintains a concept of "safe time", "paxos safe time", "transaction manager safe time". "safe time" is analogous to the "resolved timestamp" in CockroachDB and "paxos safe time" is analogous to the "closed timestamp" in CockroachDB. In Spanner, it is the "safe time" of a replica that determines follower read eligibility.

There are some downsides to this change which I think are interesting to point out, but I don't think are meaningfully concerning:

we don't detect the difference between the resolved timestamp and the closed timestamp until after we have begun evaluating the follower read and scanning MVCC data. This lazy detection of follower read eligibility can lead to wasted work. In the future, we may consider making this detection eager once we address kv: only scan separated intents span for QueryResolvedTimestamp requests #69717.
redirecting follower reads to leaseholders can lead to large response payloads being shipped over wide-area network links. So far, this PR has compared the latency of multiple WAN hops for intent resolution to a single WAN hop for read redirection, but that doesn't recognize the potential asymmetry in cost, at least at the extreme, between control-plane requests like PushTxn and ResolveIntent and data-plane requests like Scan and Get. In the future, I'd like to recognize this asymmetry explore ideas around never redirecting the data-plane portion of follower reads to leaseholders and instead only ever sending control-plane requests to proactively close time and relay log positions back to the followers. This is similar to what Spanner does, see https://www.cockroachlabs.com/blog/follower-reads-stale-data/#comparing-cockroachdb-with-spanner. For now, though, I don't think redirecting marginally more often is concerning.

Release note (performance improvement): follower reads that encounter many abandoned intents are now able to efficiently resolve those intents. This resolves an asymmetry where follower reads were previously less efficient at resolving abandoned intents than regular reads evaluated on a leaseholder.

Release justification: needed to avoid slow intent resolution for important customer workloads.

blathers-crl · 2021-10-22T19:58:37Z

cockroach-teamcity · 2021-10-22T19:58:43Z

This change is

andreimatei · 2021-10-25T16:30:16Z

How come you want to backport this? Is anyone clamoring for it?

nvanbenschoten · 2021-10-25T17:28:10Z

Yes, the customer that hit this (support#1220) was asking for it to be backported back to v20.2. It can't go back that far without a large lift, but it can make it back to v21.1.

andreimatei · 2021-10-25T17:48:11Z

LGTM then

…

On Mon, Oct 25, 2021 at 1:28 PM Nathan VanBenschoten < ***@***.***> wrote: Yes, the customer that hit this (support#1220) was asking for it to be backported back to v20.2. It can't go back that far without a large lift, but it can make it back to v21.1. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#71884 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAC4C4PTUGIX3FBKCTVEAEDUIWHTLANCNFSM5GRI3RYQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Fixes #57686. This commit adjusts the handling of follower reads to redirect to the leaseholder immediately if a conflicting intent is observed while reading. This replaces the previous behavior of attempting to resolve the intents from the follower using an inefficient method (i.e. without batching and with multiple follower<->leaseholder hops) and then re-evaluating after the resolution had completed. In general, waiting for conflicting intents on the leaseholder instead of on a follower is preferable because: - the leaseholder is notified of and reactive to lock-table state transitions. - the leaseholder is able to more efficiently resolve intents, if necessary, without the risk of multiple follower<->leaseholder round-trips compounding. If the follower was to attempt to resolve multiple intents during a follower read then the PushTxn and ResolveIntent requests would quickly be more expensive (in terms of latency) than simply redirecting the entire read request to the leaseholder and letting the leaseholder coordinate the intent resolution. - after the leaseholder has received a response from a ResolveIntent request, it has a guarantee that the intent resolution has been applied locally and that no future read will observe the intent. This is not true on follower replicas. Due to the asynchronous nature of Raft, both due to quorum voting and due to async commit acknowledgement from leaders to followers, it is possible for a ResolveIntent request to complete and then for a future read on a follower to observe the pre-resolution state of the intent. This effect is transient and will eventually disappear once the follower catches up on its Raft log, but it creates an opportunity for momentary thrashing if a follower read was to resolve an intent and then immediately attempt to read again. This behavior of redirecting follower read attempts to the leaseholder replica if they encounter conflicting intents on a follower means that follower read eligibility is a function of the "resolved timestamp" over a read's key span, and not just the "closed timestamp" over its key span. Architecturally, this is consistent with Google Spanner, who maintains a concept of "safe time", "paxos safe time", "transaction manager safe time". "safe time" is analogous to the "resolved timestamp" in CockroachDB and "paxos safe time" is analogous to the "closed timestamp" in CockroachDB. In Spanner, it is the "safe time" of a replica that determines follower read eligibility. There are some downsides to this change which I think are interesting to point out, but I don't think are meaningfully concerning: 1. we don't detect the difference between the resolved timestamp and the closed timestamp until after we have begun evaluating the follower read and scanning MVCC data. This lazy detection of follower read eligibility can lead to wasted work. In the future, we may consider making this detection eager once we address #69717. 2. redirecting follower reads to leaseholders can lead to large response payloads being shipped over wide-area network links. So far, this PR has compared the latency of multiple WAN hops for intent resolution to a single WAN hop for read redirection, but that doesn't recognize the potential asymmetry in cost, at least at the extreme, between control-plane requests like `PushTxn` and `ResolveIntent` and data-plane requests like `Scan` and `Get`. In the future, I'd like to recognize this asymmetry explore ideas around never redirecting the data-plane portion of follower reads to leaseholders and instead only ever sending control-plane requests to proactively close time and relay log positions back to the followers. This is similar to what Spanner does, see https://www.cockroachlabs.com/blog/follower-reads-stale-data/#comparing-cockroachdb-with-spanner. For now, though, I don't think redirecting marginally more often is concerning. Release note (performance improvement): follower reads that encounter many abandoned intents are now able to efficiently resolve those intents. This resolves an asymmetry where follower reads were previously less efficient at resolving abandoned intents than regular reads evaluated on a leaseholder.

blathers-crl bot requested a review from a team as a code owner October 22, 2021 19:58

blathers-crl bot force-pushed the blathers/backport-release-21.2-70382 branch from a0e9feb to 745c4ee Compare October 22, 2021 19:58

blathers-crl bot requested review from andreimatei, irfansharif and nvanbenschoten October 22, 2021 19:58

blathers-crl bot assigned nvanbenschoten Oct 22, 2021

irfansharif approved these changes Oct 22, 2021

View reviewed changes

rafiss added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Nov 29, 2021

nvanbenschoten force-pushed the blathers/backport-release-21.2-70382 branch from 745c4ee to 80e3a7a Compare December 6, 2021 04:04

nvanbenschoten merged commit abe8933 into release-21.2 Dec 6, 2021

nvanbenschoten deleted the blathers/backport-release-21.2-70382 branch December 6, 2021 16:39

cockroach-teamcity mentioned this pull request Dec 6, 2021

release-21.2: kv: redirect follower reads to leaseholder on contention cockroachdb/docs#12511

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-21.2: kv: redirect follower reads to leaseholder on contention #71884

release-21.2: kv: redirect follower reads to leaseholder on contention #71884

blathers-crl bot commented Oct 22, 2021 •

edited by nvanbenschoten

Loading

blathers-crl bot commented Oct 22, 2021 •

edited by nvanbenschoten

Loading

cockroach-teamcity commented Oct 22, 2021

andreimatei commented Oct 25, 2021

nvanbenschoten commented Oct 25, 2021

andreimatei commented Oct 25, 2021 via email

release-21.2: kv: redirect follower reads to leaseholder on contention #71884

release-21.2: kv: redirect follower reads to leaseholder on contention #71884

Conversation

blathers-crl bot commented Oct 22, 2021 • edited by nvanbenschoten Loading

blathers-crl bot commented Oct 22, 2021 • edited by nvanbenschoten Loading

cockroach-teamcity commented Oct 22, 2021

andreimatei commented Oct 25, 2021

nvanbenschoten commented Oct 25, 2021

andreimatei commented Oct 25, 2021 via email

blathers-crl bot commented Oct 22, 2021 •

edited by nvanbenschoten

Loading

blathers-crl bot commented Oct 22, 2021 •

edited by nvanbenschoten

Loading