Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.2: kv: redirect follower reads to leaseholder on contention #71884

Merged
merged 1 commit into from
Dec 6, 2021

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Oct 22, 2021

Backport 1/1 commits from #70382 on behalf of @nvanbenschoten.

/cc @cockroachdb/release


Fixes #57686.

This commit adjusts the handling of follower reads to redirect to the leaseholder immediately if a conflicting intent is observed while reading. This replaces the previous behavior of attempting to resolve the intents from the follower using an inefficient method (i.e. without batching and with multiple follower<->leaseholder hops) and then re-evaluating after the resolution had completed.

In general, waiting for conflicting intents on the leaseholder instead of on a follower is preferable because:

  • the leaseholder is notified of and reactive to lock-table state transitions.
  • the leaseholder is able to more efficiently resolve intents, if necessary, without the risk of multiple follower<->leaseholder round-trips compounding. If the follower was to attempt to resolve multiple intents during a follower read then the PushTxn and ResolveIntent requests would quickly be more expensive (in terms of latency) than simply redirecting the entire read request to the leaseholder and letting the leaseholder coordinate the intent resolution.
  • after the leaseholder has received a response from a ResolveIntent request, it has a guarantee that the intent resolution has been applied locally and that no future read will observe the intent. This is not true on follower replicas. Due to the asynchronous nature of Raft, both due to quorum voting and due to async commit acknowledgement from leaders to followers, it is possible for a ResolveIntent request to complete and then for a future read on a follower to observe the pre-resolution state of the intent. This effect is transient and will eventually disappear once the follower catches up on its Raft log, but it creates an opportunity for momentary thrashing if a follower read was to resolve an intent and then immediately attempt to read again.

This behavior of redirecting follower read attempts to the leaseholder replica if they encounter conflicting intents on a follower means that follower read eligibility is a function of the "resolved timestamp" over a read's key span, and not just the "closed timestamp" over its key span. Architecturally, this is consistent with Google Spanner, who maintains a concept of "safe time", "paxos safe time", "transaction manager safe time". "safe time" is analogous to the "resolved timestamp" in CockroachDB and "paxos safe time" is analogous to the "closed timestamp" in CockroachDB. In Spanner, it is the "safe time" of a replica that determines follower read eligibility.

There are some downsides to this change which I think are interesting to point out, but I don't think are meaningfully concerning:

  1. we don't detect the difference between the resolved timestamp and the closed timestamp until after we have begun evaluating the follower read and scanning MVCC data. This lazy detection of follower read eligibility can lead to wasted work. In the future, we may consider making this detection eager once we address kv: only scan separated intents span for QueryResolvedTimestamp requests #69717.
  2. redirecting follower reads to leaseholders can lead to large response payloads being shipped over wide-area network links. So far, this PR has compared the latency of multiple WAN hops for intent resolution to a single WAN hop for read redirection, but that doesn't recognize the potential asymmetry in cost, at least at the extreme, between control-plane requests like PushTxn and ResolveIntent and data-plane requests like Scan and Get. In the future, I'd like to recognize this asymmetry explore ideas around never redirecting the data-plane portion of follower reads to leaseholders and instead only ever sending control-plane requests to proactively close time and relay log positions back to the followers. This is similar to what Spanner does, see https://www.cockroachlabs.com/blog/follower-reads-stale-data/#comparing-cockroachdb-with-spanner. For now, though, I don't think redirecting marginally more often is concerning.

Release note (performance improvement): follower reads that encounter many abandoned intents are now able to efficiently resolve those intents. This resolves an asymmetry where follower reads were previously less efficient at resolving abandoned intents than regular reads evaluated on a leaseholder.


Release justification: needed to avoid slow intent resolution for important customer workloads.

@blathers-crl blathers-crl bot requested a review from a team as a code owner October 22, 2021 19:58
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-21.2-70382 branch from a0e9feb to 745c4ee Compare October 22, 2021 19:58
@blathers-crl
Copy link
Author

blathers-crl bot commented Oct 22, 2021

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor

How come you want to backport this? Is anyone clamoring for it?

@nvanbenschoten
Copy link
Member

Yes, the customer that hit this (support#1220) was asking for it to be backported back to v20.2. It can't go back that far without a large lift, but it can make it back to v21.1.

@andreimatei
Copy link
Contributor

andreimatei commented Oct 25, 2021 via email

@rafiss rafiss added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Nov 29, 2021
Fixes #57686.

This commit adjusts the handling of follower reads to redirect to the
leaseholder immediately if a conflicting intent is observed while
reading. This replaces the previous behavior of attempting to resolve
the intents from the follower using an inefficient method (i.e. without
batching and with multiple follower<->leaseholder hops) and then
re-evaluating after the resolution had completed.

In general, waiting for conflicting intents on the leaseholder instead
of on a follower is preferable because:
- the leaseholder is notified of and reactive to lock-table state
  transitions.
- the leaseholder is able to more efficiently resolve intents, if
  necessary, without the risk of multiple follower<->leaseholder
  round-trips compounding. If the follower was to attempt to resolve
  multiple intents during a follower read then the PushTxn and
  ResolveIntent requests would quickly be more expensive (in terms of
  latency) than simply redirecting the entire read request to the
  leaseholder and letting the leaseholder coordinate the intent
  resolution.
- after the leaseholder has received a response from a ResolveIntent
  request, it has a guarantee that the intent resolution has been applied
  locally and that no future read will observe the intent. This is not
  true on follower replicas. Due to the asynchronous nature of Raft, both
  due to quorum voting and due to async commit acknowledgement from
  leaders to followers, it is possible for a ResolveIntent request to
  complete and then for a future read on a follower to observe the
  pre-resolution state of the intent. This effect is transient and will
  eventually disappear once the follower catches up on its Raft log, but
  it creates an opportunity for momentary thrashing if a follower read
  was to resolve an intent and then immediately attempt to read again.

This behavior of redirecting follower read attempts to the leaseholder
replica if they encounter conflicting intents on a follower means that
follower read eligibility is a function of the "resolved timestamp" over
a read's key span, and not just the "closed timestamp" over its key
span. Architecturally, this is consistent with Google Spanner, who
maintains a concept of "safe time", "paxos safe time", "transaction
manager safe time". "safe time" is analogous to the "resolved timestamp"
in CockroachDB and "paxos safe time" is analogous to the "closed
timestamp" in CockroachDB. In Spanner, it is the "safe time" of a
replica that determines follower read eligibility.

There are some downsides to this change which I think are interesting to
point out, but I don't think are meaningfully concerning:
1. we don't detect the difference between the resolved timestamp and the
   closed timestamp until after we have begun evaluating the follower
   read and scanning MVCC data. This lazy detection of follower read
   eligibility can lead to wasted work. In the future, we may consider
   making this detection eager once we address #69717.
2. redirecting follower reads to leaseholders can lead to large response
   payloads being shipped over wide-area network links. So far, this PR has
   compared the latency of multiple WAN hops for intent resolution to a
   single WAN hop for read redirection, but that doesn't recognize the
   potential asymmetry in cost, at least at the extreme, between
   control-plane requests like `PushTxn` and `ResolveIntent` and data-plane
   requests like `Scan` and `Get`. In the future, I'd like to recognize
   this asymmetry explore ideas around never redirecting the data-plane
   portion of follower reads to leaseholders and instead only ever sending
   control-plane requests to proactively close time and relay log positions
   back to the followers. This is similar to what Spanner does, see
   https://www.cockroachlabs.com/blog/follower-reads-stale-data/#comparing-cockroachdb-with-spanner.
   For now, though, I don't think redirecting marginally more often is
   concerning.

Release note (performance improvement): follower reads that encounter many
abandoned intents are now able to efficiently resolve those intents. This
resolves an asymmetry where follower reads were previously less efficient at
resolving abandoned intents than regular reads evaluated on a leaseholder.
@nvanbenschoten nvanbenschoten force-pushed the blathers/backport-release-21.2-70382 branch from 745c4ee to 80e3a7a Compare December 6, 2021 04:04
@nvanbenschoten nvanbenschoten merged commit abe8933 into release-21.2 Dec 6, 2021
@nvanbenschoten nvanbenschoten deleted the blathers/backport-release-21.2-70382 branch December 6, 2021 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants