-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: only scan separated intents span for QueryResolvedTimestamp requests #69717
Closed
Tracked by
#67562
Labels
A-kv-transactions
Relating to MVCC and the transactional model.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
T-kv
KV Team
Comments
nvanbenschoten
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
A-kv-transactions
Relating to MVCC and the transactional model.
T-kv
KV Team
labels
Sep 1, 2021
14 tasks
shralex
pushed a commit
to shralex/cockroach
that referenced
this issue
Sep 29, 2021
Previously, a request called “QueryResolvedTimestamp” that is used to determine the “resolved timestamp” of a key range on a given replica cost O(num_keys_in_span), because it needed to find all intents in the span and intents were interleaved with key-value versions. A multi-release project to separate out intents from MVCC data is now far enough along that we can make this request more efficient by scanning only the lock-table keyspace, reducing its cost to O(num_locks_in_span). Release note: None Release justification: This is going into 22.1 where the migration to separated intents is complete Fixes: cockroachdb#69717
craig bot
pushed a commit
that referenced
this issue
Oct 3, 2021
70852: kvserver: QueryResolvedTimestamp should look for intents in LockTable r=shralex a=shralex Previously, a request called “QueryResolvedTimestamp” that is used to determine the “resolved timestamp” of a key range on a given replica cost O(num_keys_in_span), because it needed to find all intents in the span and intents were interleaved with key-value versions. A multi-release project to separate out intents from MVCC data is now far enough along that we can make this request more efficient by scanning only the lock-table keyspace, reducing its cost to O(num_locks_in_span). Release note: None Release justification: This is going into 22.1 where the migration to separated intents is complete Fixes: #69717 Co-authored-by: Alexander Shraer <[email protected]>
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 9, 2021
Fixes cockroachdb#57686. This commit adjusts the handling of follower reads to redirect to the leaseholder immediately if a conflicting intent is observed while reading. This replaces the previous behavior of attempting to resolve the intents from the follower using an inefficient method (i.e. without batching and with multiple follower<->leaseholder hops) and then re-evaluating after the resolution had completed. In general, waiting for conflicting intents on the leaseholder instead of on a follower is preferable because: - the leaseholder is notified of and reactive to lock-table state transitions. - the leaseholder is able to more efficiently resolve intents, if necessary, without the risk of multiple follower<->leaseholder round-trips compounding. If the follower was to attempt to resolve multiple intents during a follower read then the PushTxn and ResolveIntent requests would quickly be more expensive (in terms of latency) than simply redirecting the entire read request to the leaseholder and letting the leaseholder coordinate the intent resolution. - after the leaseholder has received a response from a ResolveIntent request, it has a guarantee that the intent resolution has been applied locally and that no future read will observe the intent. This is not true on follower replicas. Due to the asynchronous nature of Raft, both due to quorum voting and due to async commit acknowledgement from leaders to followers, it is possible for a ResolveIntent request to complete and then for a future read on a follower to observe the pre-resolution state of the intent. This effect is transient and will eventually disappear once the follower catches up on its Raft log, but it creates an opportunity for momentary thrashing if a follower read was to resolve an intent and then immediately attempt to read again. This behavior of redirecting follower read attempts to the leaseholder replica if they encounter conflicting intents on a follower means that follower read eligibility is a function of the "resolved timestamp" over a read's key span, and not just the "closed timestamp" over its key span. Architecturally, this is consistent with Google Spanner, who maintains a concept of "safe time", "paxos safe time", "transaction manager safe time". "safe time" is analogous to the "resolved timestamp" in CockroachDB and "paxos safe time" is analogous to the "closed timestamp" in CockroachDB. In Spanner, it is the "safe time" of a replica that determines follower read eligibility. There are some downsides to this change which I think are interesting to point out, but I don't think are meaningfully concerning: 1. we don't detect the difference between the resolved timestamp and the closed timestamp until after we have begun evaluating the follower read and scanning MVCC data. This lazy detection of follower read eligibility can lead to wasted work. In the future, we may consider making this detection eager once we address cockroachdb#69717. 2. redirecting follower reads to leaseholders can lead to large response payloads being shipped over wide-area network links. So far, this PR has compared the latency of multiple WAN hops for intent resolution to a single WAN hop for read redirection, but that doesn't recognize the potential asymmetry in cost, at least at the extreme, between control-plane requests like `PushTxn` and `ResolveIntent` and data-plane requests like `Scan` and `Get`. In the future, I'd like to recognize this asymmetry explore ideas around never redirecting the data-plane portion of follower reads to leaseholders and instead only ever sending control-plane requests to proactively close time and relay log positions back to the followers. This is similar to what Spanner does, see https://www.cockroachlabs.com/blog/follower-reads-stale-data/#comparing-cockroachdb-with-spanner. For now, though, I don't think redirecting marginally more often is concerning. Release note (performance improvement): follower reads that encounter many abandoned intents are now able to efficiently resolve those intents. This resolves an asymmetry where follower reads were previously less efficient at resolving abandoned intents than regular reads evaluated on a leaseholder.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 22, 2021
Fixes cockroachdb#57686. This commit adjusts the handling of follower reads to redirect to the leaseholder immediately if a conflicting intent is observed while reading. This replaces the previous behavior of attempting to resolve the intents from the follower using an inefficient method (i.e. without batching and with multiple follower<->leaseholder hops) and then re-evaluating after the resolution had completed. In general, waiting for conflicting intents on the leaseholder instead of on a follower is preferable because: - the leaseholder is notified of and reactive to lock-table state transitions. - the leaseholder is able to more efficiently resolve intents, if necessary, without the risk of multiple follower<->leaseholder round-trips compounding. If the follower was to attempt to resolve multiple intents during a follower read then the PushTxn and ResolveIntent requests would quickly be more expensive (in terms of latency) than simply redirecting the entire read request to the leaseholder and letting the leaseholder coordinate the intent resolution. - after the leaseholder has received a response from a ResolveIntent request, it has a guarantee that the intent resolution has been applied locally and that no future read will observe the intent. This is not true on follower replicas. Due to the asynchronous nature of Raft, both due to quorum voting and due to async commit acknowledgement from leaders to followers, it is possible for a ResolveIntent request to complete and then for a future read on a follower to observe the pre-resolution state of the intent. This effect is transient and will eventually disappear once the follower catches up on its Raft log, but it creates an opportunity for momentary thrashing if a follower read was to resolve an intent and then immediately attempt to read again. This behavior of redirecting follower read attempts to the leaseholder replica if they encounter conflicting intents on a follower means that follower read eligibility is a function of the "resolved timestamp" over a read's key span, and not just the "closed timestamp" over its key span. Architecturally, this is consistent with Google Spanner, who maintains a concept of "safe time", "paxos safe time", "transaction manager safe time". "safe time" is analogous to the "resolved timestamp" in CockroachDB and "paxos safe time" is analogous to the "closed timestamp" in CockroachDB. In Spanner, it is the "safe time" of a replica that determines follower read eligibility. There are some downsides to this change which I think are interesting to point out, but I don't think are meaningfully concerning: 1. we don't detect the difference between the resolved timestamp and the closed timestamp until after we have begun evaluating the follower read and scanning MVCC data. This lazy detection of follower read eligibility can lead to wasted work. In the future, we may consider making this detection eager once we address cockroachdb#69717. 2. redirecting follower reads to leaseholders can lead to large response payloads being shipped over wide-area network links. So far, this PR has compared the latency of multiple WAN hops for intent resolution to a single WAN hop for read redirection, but that doesn't recognize the potential asymmetry in cost, at least at the extreme, between control-plane requests like `PushTxn` and `ResolveIntent` and data-plane requests like `Scan` and `Get`. In the future, I'd like to recognize this asymmetry explore ideas around never redirecting the data-plane portion of follower reads to leaseholders and instead only ever sending control-plane requests to proactively close time and relay log positions back to the followers. This is similar to what Spanner does, see https://www.cockroachlabs.com/blog/follower-reads-stale-data/#comparing-cockroachdb-with-spanner. For now, though, I don't think redirecting marginally more often is concerning. Release note (performance improvement): follower reads that encounter many abandoned intents are now able to efficiently resolve those intents. This resolves an asymmetry where follower reads were previously less efficient at resolving abandoned intents than regular reads evaluated on a leaseholder.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 22, 2021
Fixes cockroachdb#57686. This commit adjusts the handling of follower reads to redirect to the leaseholder immediately if a conflicting intent is observed while reading. This replaces the previous behavior of attempting to resolve the intents from the follower using an inefficient method (i.e. without batching and with multiple follower<->leaseholder hops) and then re-evaluating after the resolution had completed. In general, waiting for conflicting intents on the leaseholder instead of on a follower is preferable because: - the leaseholder is notified of and reactive to lock-table state transitions. - the leaseholder is able to more efficiently resolve intents, if necessary, without the risk of multiple follower<->leaseholder round-trips compounding. If the follower was to attempt to resolve multiple intents during a follower read then the PushTxn and ResolveIntent requests would quickly be more expensive (in terms of latency) than simply redirecting the entire read request to the leaseholder and letting the leaseholder coordinate the intent resolution. - after the leaseholder has received a response from a ResolveIntent request, it has a guarantee that the intent resolution has been applied locally and that no future read will observe the intent. This is not true on follower replicas. Due to the asynchronous nature of Raft, both due to quorum voting and due to async commit acknowledgement from leaders to followers, it is possible for a ResolveIntent request to complete and then for a future read on a follower to observe the pre-resolution state of the intent. This effect is transient and will eventually disappear once the follower catches up on its Raft log, but it creates an opportunity for momentary thrashing if a follower read was to resolve an intent and then immediately attempt to read again. This behavior of redirecting follower read attempts to the leaseholder replica if they encounter conflicting intents on a follower means that follower read eligibility is a function of the "resolved timestamp" over a read's key span, and not just the "closed timestamp" over its key span. Architecturally, this is consistent with Google Spanner, who maintains a concept of "safe time", "paxos safe time", "transaction manager safe time". "safe time" is analogous to the "resolved timestamp" in CockroachDB and "paxos safe time" is analogous to the "closed timestamp" in CockroachDB. In Spanner, it is the "safe time" of a replica that determines follower read eligibility. There are some downsides to this change which I think are interesting to point out, but I don't think are meaningfully concerning: 1. we don't detect the difference between the resolved timestamp and the closed timestamp until after we have begun evaluating the follower read and scanning MVCC data. This lazy detection of follower read eligibility can lead to wasted work. In the future, we may consider making this detection eager once we address cockroachdb#69717. 2. redirecting follower reads to leaseholders can lead to large response payloads being shipped over wide-area network links. So far, this PR has compared the latency of multiple WAN hops for intent resolution to a single WAN hop for read redirection, but that doesn't recognize the potential asymmetry in cost, at least at the extreme, between control-plane requests like `PushTxn` and `ResolveIntent` and data-plane requests like `Scan` and `Get`. In the future, I'd like to recognize this asymmetry explore ideas around never redirecting the data-plane portion of follower reads to leaseholders and instead only ever sending control-plane requests to proactively close time and relay log positions back to the followers. This is similar to what Spanner does, see https://www.cockroachlabs.com/blog/follower-reads-stale-data/#comparing-cockroachdb-with-spanner. For now, though, I don't think redirecting marginally more often is concerning. Release note (performance improvement): follower reads that encounter many abandoned intents are now able to efficiently resolve those intents. This resolves an asymmetry where follower reads were previously less efficient at resolving abandoned intents than regular reads evaluated on a leaseholder.
craig bot
pushed a commit
that referenced
this issue
Oct 22, 2021
70382: kv: redirect follower reads to leaseholder on contention r=irfansharif,andreimatei a=nvanbenschoten Fixes #57686. This commit adjusts the handling of follower reads to redirect to the leaseholder immediately if a conflicting intent is observed while reading. This replaces the previous behavior of attempting to resolve the intents from the follower using an inefficient method (i.e. without batching and with multiple follower<->leaseholder hops) and then re-evaluating after the resolution had completed. In general, waiting for conflicting intents on the leaseholder instead of on a follower is preferable because: - the leaseholder is notified of and reactive to lock-table state transitions. - the leaseholder is able to more efficiently resolve intents, if necessary, without the risk of multiple follower<->leaseholder round-trips compounding. If the follower was to attempt to resolve multiple intents during a follower read then the `PushTxn` and `ResolveIntent` requests would quickly be more expensive (in terms of latency) than simply redirecting the entire read request to the leaseholder and letting the leaseholder coordinate the intent resolution. - after the leaseholder has received a response from a `ResolveIntent` request, it has a guarantee that the intent resolution has been applied locally and that no future read will observe the intent. This is not true on follower replicas. Due to the asynchronous nature of Raft, both due to quorum voting and due to async commit acknowledgement from leaders to followers, it is possible for a `ResolveIntent` request to complete and then for a future read on a follower to observe the pre-resolution state of the intent. This effect is transient and will eventually disappear once the follower catches up on its Raft log, but it creates an opportunity for momentary thrashing if a follower read was to resolve an intent and then immediately attempt to read again. This behavior of redirecting follower read attempts to the leaseholder replica if they encounter conflicting intents on a follower means that follower read eligibility is a function of the "resolved timestamp" over a read's key span, and not just the "closed timestamp" over its key span. Architecturally, this is consistent with Google Spanner, who maintains a concept of "safe time", "paxos safe time", "transaction manager safe time". "safe time" is analogous to the "resolved timestamp" in CockroachDB and "paxos safe time" is analogous to the "closed timestamp" in CockroachDB. In Spanner, it is the "safe time" of a replica that determines follower read eligibility. There are some downsides to this change which I think are interesting to point out, but I don't think are meaningfully concerning: 1. we don't detect the difference between the resolved timestamp and the closed timestamp until after we have begun evaluating the follower read and scanning MVCC data. This lazy detection of follower read eligibility can lead to wasted work. In the future, we may consider making this detection eager once we address #69717. 2. redirecting follower reads to leaseholders can lead to large response payloads being shipped over wide-area network links. So far, this PR has compared the latency of multiple WAN hops for intent resolution to a single WAN hop for read redirection, but that doesn't recognize the potential asymmetry in cost, at least at the extreme, between control-plane requests like `PushTxn` and `ResolveIntent` and data-plane requests like `Scan` and `Get`. In the future, I'd like to recognize this asymmetry explore ideas around never redirecting the data-plane portion of follower reads to leaseholders and instead only ever sending control-plane requests to proactively close time and relay log positions back to the followers. This is similar to what Spanner does, see https://www.cockroachlabs.com/blog/follower-reads-stale-data/#comparing-cockroachdb-with-spanner. For now, though, I don't think redirecting marginally more often is concerning. Release note (performance improvement): follower reads that encounter many abandoned intents are now able to efficiently resolve those intents. This resolves an asymmetry where follower reads were previously less efficient at resolving abandoned intents than regular reads evaluated on a leaseholder. Co-authored-by: Nathan VanBenschoten <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Oct 22, 2021
Fixes #57686. This commit adjusts the handling of follower reads to redirect to the leaseholder immediately if a conflicting intent is observed while reading. This replaces the previous behavior of attempting to resolve the intents from the follower using an inefficient method (i.e. without batching and with multiple follower<->leaseholder hops) and then re-evaluating after the resolution had completed. In general, waiting for conflicting intents on the leaseholder instead of on a follower is preferable because: - the leaseholder is notified of and reactive to lock-table state transitions. - the leaseholder is able to more efficiently resolve intents, if necessary, without the risk of multiple follower<->leaseholder round-trips compounding. If the follower was to attempt to resolve multiple intents during a follower read then the PushTxn and ResolveIntent requests would quickly be more expensive (in terms of latency) than simply redirecting the entire read request to the leaseholder and letting the leaseholder coordinate the intent resolution. - after the leaseholder has received a response from a ResolveIntent request, it has a guarantee that the intent resolution has been applied locally and that no future read will observe the intent. This is not true on follower replicas. Due to the asynchronous nature of Raft, both due to quorum voting and due to async commit acknowledgement from leaders to followers, it is possible for a ResolveIntent request to complete and then for a future read on a follower to observe the pre-resolution state of the intent. This effect is transient and will eventually disappear once the follower catches up on its Raft log, but it creates an opportunity for momentary thrashing if a follower read was to resolve an intent and then immediately attempt to read again. This behavior of redirecting follower read attempts to the leaseholder replica if they encounter conflicting intents on a follower means that follower read eligibility is a function of the "resolved timestamp" over a read's key span, and not just the "closed timestamp" over its key span. Architecturally, this is consistent with Google Spanner, who maintains a concept of "safe time", "paxos safe time", "transaction manager safe time". "safe time" is analogous to the "resolved timestamp" in CockroachDB and "paxos safe time" is analogous to the "closed timestamp" in CockroachDB. In Spanner, it is the "safe time" of a replica that determines follower read eligibility. There are some downsides to this change which I think are interesting to point out, but I don't think are meaningfully concerning: 1. we don't detect the difference between the resolved timestamp and the closed timestamp until after we have begun evaluating the follower read and scanning MVCC data. This lazy detection of follower read eligibility can lead to wasted work. In the future, we may consider making this detection eager once we address #69717. 2. redirecting follower reads to leaseholders can lead to large response payloads being shipped over wide-area network links. So far, this PR has compared the latency of multiple WAN hops for intent resolution to a single WAN hop for read redirection, but that doesn't recognize the potential asymmetry in cost, at least at the extreme, between control-plane requests like `PushTxn` and `ResolveIntent` and data-plane requests like `Scan` and `Get`. In the future, I'd like to recognize this asymmetry explore ideas around never redirecting the data-plane portion of follower reads to leaseholders and instead only ever sending control-plane requests to proactively close time and relay log positions back to the followers. This is similar to what Spanner does, see https://www.cockroachlabs.com/blog/follower-reads-stale-data/#comparing-cockroachdb-with-spanner. For now, though, I don't think redirecting marginally more often is concerning. Release note (performance improvement): follower reads that encounter many abandoned intents are now able to efficiently resolve those intents. This resolves an asymmetry where follower reads were previously less efficient at resolving abandoned intents than regular reads evaluated on a leaseholder.
nvanbenschoten
added a commit
that referenced
this issue
Dec 6, 2021
Fixes #57686. This commit adjusts the handling of follower reads to redirect to the leaseholder immediately if a conflicting intent is observed while reading. This replaces the previous behavior of attempting to resolve the intents from the follower using an inefficient method (i.e. without batching and with multiple follower<->leaseholder hops) and then re-evaluating after the resolution had completed. In general, waiting for conflicting intents on the leaseholder instead of on a follower is preferable because: - the leaseholder is notified of and reactive to lock-table state transitions. - the leaseholder is able to more efficiently resolve intents, if necessary, without the risk of multiple follower<->leaseholder round-trips compounding. If the follower was to attempt to resolve multiple intents during a follower read then the PushTxn and ResolveIntent requests would quickly be more expensive (in terms of latency) than simply redirecting the entire read request to the leaseholder and letting the leaseholder coordinate the intent resolution. - after the leaseholder has received a response from a ResolveIntent request, it has a guarantee that the intent resolution has been applied locally and that no future read will observe the intent. This is not true on follower replicas. Due to the asynchronous nature of Raft, both due to quorum voting and due to async commit acknowledgement from leaders to followers, it is possible for a ResolveIntent request to complete and then for a future read on a follower to observe the pre-resolution state of the intent. This effect is transient and will eventually disappear once the follower catches up on its Raft log, but it creates an opportunity for momentary thrashing if a follower read was to resolve an intent and then immediately attempt to read again. This behavior of redirecting follower read attempts to the leaseholder replica if they encounter conflicting intents on a follower means that follower read eligibility is a function of the "resolved timestamp" over a read's key span, and not just the "closed timestamp" over its key span. Architecturally, this is consistent with Google Spanner, who maintains a concept of "safe time", "paxos safe time", "transaction manager safe time". "safe time" is analogous to the "resolved timestamp" in CockroachDB and "paxos safe time" is analogous to the "closed timestamp" in CockroachDB. In Spanner, it is the "safe time" of a replica that determines follower read eligibility. There are some downsides to this change which I think are interesting to point out, but I don't think are meaningfully concerning: 1. we don't detect the difference between the resolved timestamp and the closed timestamp until after we have begun evaluating the follower read and scanning MVCC data. This lazy detection of follower read eligibility can lead to wasted work. In the future, we may consider making this detection eager once we address #69717. 2. redirecting follower reads to leaseholders can lead to large response payloads being shipped over wide-area network links. So far, this PR has compared the latency of multiple WAN hops for intent resolution to a single WAN hop for read redirection, but that doesn't recognize the potential asymmetry in cost, at least at the extreme, between control-plane requests like `PushTxn` and `ResolveIntent` and data-plane requests like `Scan` and `Get`. In the future, I'd like to recognize this asymmetry explore ideas around never redirecting the data-plane portion of follower reads to leaseholders and instead only ever sending control-plane requests to proactively close time and relay log positions back to the followers. This is similar to what Spanner does, see https://www.cockroachlabs.com/blog/follower-reads-stale-data/#comparing-cockroachdb-with-spanner. For now, though, I don't think redirecting marginally more often is concerning. Release note (performance improvement): follower reads that encounter many abandoned intents are now able to efficiently resolve those intents. This resolves an asymmetry where follower reads were previously less efficient at resolving abandoned intents than regular reads evaluated on a leaseholder.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-transactions
Relating to MVCC and the transactional model.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
T-kv
KV Team
Related to #67562 and #67554.
This optimization was originally outlined in the bounded staleness RFC.
Today, a
QueryResolvedTimestamp
request performs a scan over the MVCC keyspace to retrieve the intents in its target key span.cockroach/pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go
Lines 92 to 105 in ffbbf81
With intents now separated out into a separate keyspace, we should be able to perform a more efficient scan over the lock table keyspace to retrieve the active intents on the
QueryResolvedTimestamp
's target key span. This avoids the need to merge the lock table iterator with an MVCC iterator. As a result, we turn an O(num_keys_in_span) operation into an O(num_locks_in_span) operation.This is a reasonably hard blocker for stage 2 of bounded staleness reads.
The text was updated successfully, but these errors were encountered: