-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: always transfer expiration-based leases during lease transfers #81764
Labels
A-kv
Anything in KV that doesn't belong in a more specific category.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
N-followup
Needs followup.
O-postmortem
Originated from a Postmortem action item.
T-kv
KV Team
Comments
nvanbenschoten
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
A-kv
Anything in KV that doesn't belong in a more specific category.
labels
May 24, 2022
Closed
2 tasks
lunevalex
added
O-postmortem
Originated from a Postmortem action item.
N-followup
Needs followup.
labels
May 26, 2022
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jun 12, 2022
Fixes cockroachdb#81763. Part of cockroachdb#81561. \### Background When performing a lease transfer, the outgoing leaseholder revokes its lease before proposing the lease transfer request, meaning that it promises to stop using the previous lease to serve reads or writes. The lease transfer request is then proposed and committed to the Raft log, at which point the new lease officially becomes active. However, this new lease is not usable until the incoming leaseholder applies the Raft entry that contains the lease transfer and notices that it is now the leaseholder for the range. The effect of this handoff is that there exists a "power vacuum" time period when the outgoing leaseholder has revoked its previous lease but the incoming leaseholder has not yet applied its new lease. During this time period, a range is effectively unavailable for strong reads and writes, because no replica will act as the leaseholder. Instead, requests that require the lease will be redirected back and forth between the outgoing leaseholder and the incoming leaseholder (the client backs off). To minimize the disruption caused by lease transfers, we need to minimize this time period. We assume that if a lease transfer target is sufficiently caught up on its log such that it will be able to apply the lease transfer through log entry application then this unavailability window will be acceptable. This may be a faulty assumption in cases with severe replication lag, but we must balance any heuristics here that attempts to determine "too much lag" with the possibility of starvation of lease transfers under sustained write load and a resulting sustained replication lag. See cockroachdb#38065 and cockroachdb#42379, which removed such a heuristic. For now, we don't try to make such a determination. \### Patch Details However, with this change, we now draw a distinction between lease transfer targets that will be able to apply the lease transfer through log entry application and those that will require a Raft snapshot to catch up and apply the lease transfer. Raft snapshots are more expensive than Raft entry replication. They are also significantly more likely to be delayed due to queueing behind other snapshot traffic in the system. This potential for delay makes transferring a lease to a replica that needs a snapshot very risky, as doing so has the effect of inducing range unavailability until the snapshot completes, which could take seconds, minutes, or hours. In the future, we will likely get better at prioritizing snapshots to improve the responsiveness of snapshots that are needed to recover availability. However, even in this world, it is not worth inducing unavailability that can only be recovered through a Raft snapshot. It is better to catch the desired lease target up on the log first and then initiate the lease transfer once its log is connected to the leader's. For this reason, unless we can guarantee that the lease transfer target does not need a Raft snapshot, we don't let it through. This commit adds protection against such risky lease transfers at two levels. First, it includes hardened protection in the Replica proposal buffer, immediately before handing the lease transfer proposal off to `etcd/raft`. Second, it includes best-effort protection before a Replica begins to initiate a lease transfer in `AdminTransferLease`, which all lease transfer operations flow through. The change includes protection at two levels because rejecting a lease transfer in the proposal buffer after we have revoked our current lease is more disruptive than doing so earlier, before we have revoked our current lease. Best-effort protection is also able to respond more gracefully to invalid targets (e.g. they pick the next best target). However, the check in the Replica proposal buffer is the only place where the protection is airtight against race conditions because the check is performed: 1. by the current Raft leader, else the proposal will fail 2. while holding latches that prevent interleaving log truncation \### Remaining Work With this change, there is a single known race which can lead to an incoming leaseholder needing a snapshot. This is the case when a leader/leaseholder transfers the lease and then quickly loses Raft leadership to a peer that has a shorter log. Even if the older leader could have caught the incoming leaseholder up on its log, the new leader may not be able to because its log may not go back as far. Such a scenario has been possible since we stopped ensuring that all replicas have logs that start at the same index. For more details, see the discussion about cockroachdb#35701 in cockroachdb#81561. This race is theoretical — we have not seen it in practice. It's not clear whether we will try to address it or rely on a mitigation like the one described in cockroachdb#81764 to limit its blast radius. ---- Release note (bug fix): Range lease transfers are no longer permitted to follower replicas that may require a Raft snapshot. This ensures that lease transfers are never delayed behind snapshots, which could previously create range unavailability until the snapshot completed. Lease transfers now err on the side of caution and are only allowed when the outgoing leaseholder can guarantee that the incoming leaseholder does not need a snapshot.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jun 21, 2022
Fixes cockroachdb#81763. Part of cockroachdb#81561. \### Background When performing a lease transfer, the outgoing leaseholder revokes its lease before proposing the lease transfer request, meaning that it promises to stop using the previous lease to serve reads or writes. The lease transfer request is then proposed and committed to the Raft log, at which point the new lease officially becomes active. However, this new lease is not usable until the incoming leaseholder applies the Raft entry that contains the lease transfer and notices that it is now the leaseholder for the range. The effect of this handoff is that there exists a "power vacuum" time period when the outgoing leaseholder has revoked its previous lease but the incoming leaseholder has not yet applied its new lease. During this time period, a range is effectively unavailable for strong reads and writes, because no replica will act as the leaseholder. Instead, requests that require the lease will be redirected back and forth between the outgoing leaseholder and the incoming leaseholder (the client backs off). To minimize the disruption caused by lease transfers, we need to minimize this time period. We assume that if a lease transfer target is sufficiently caught up on its log such that it will be able to apply the lease transfer through log entry application then this unavailability window will be acceptable. This may be a faulty assumption in cases with severe replication lag, but we must balance any heuristics here that attempts to determine "too much lag" with the possibility of starvation of lease transfers under sustained write load and a resulting sustained replication lag. See cockroachdb#38065 and cockroachdb#42379, which removed such a heuristic. For now, we don't try to make such a determination. \### Patch Details However, with this change, we now draw a distinction between lease transfer targets that will be able to apply the lease transfer through log entry application and those that will require a Raft snapshot to catch up and apply the lease transfer. Raft snapshots are more expensive than Raft entry replication. They are also significantly more likely to be delayed due to queueing behind other snapshot traffic in the system. This potential for delay makes transferring a lease to a replica that needs a snapshot very risky, as doing so has the effect of inducing range unavailability until the snapshot completes, which could take seconds, minutes, or hours. In the future, we will likely get better at prioritizing snapshots to improve the responsiveness of snapshots that are needed to recover availability. However, even in this world, it is not worth inducing unavailability that can only be recovered through a Raft snapshot. It is better to catch the desired lease target up on the log first and then initiate the lease transfer once its log is connected to the leader's. For this reason, unless we can guarantee that the lease transfer target does not need a Raft snapshot, we don't let it through. This commit adds protection against such risky lease transfers at two levels. First, it includes hardened protection in the Replica proposal buffer, immediately before handing the lease transfer proposal off to `etcd/raft`. Second, it includes best-effort protection before a Replica begins to initiate a lease transfer in `AdminTransferLease`, which all lease transfer operations flow through. The change includes protection at two levels because rejecting a lease transfer in the proposal buffer after we have revoked our current lease is more disruptive than doing so earlier, before we have revoked our current lease. Best-effort protection is also able to respond more gracefully to invalid targets (e.g. they pick the next best target). However, the check in the Replica proposal buffer is the only place where the protection is airtight against race conditions because the check is performed: 1. by the current Raft leader, else the proposal will fail 2. while holding latches that prevent interleaving log truncation \### Remaining Work With this change, there is a single known race which can lead to an incoming leaseholder needing a snapshot. This is the case when a leader/leaseholder transfers the lease and then quickly loses Raft leadership to a peer that has a shorter log. Even if the older leader could have caught the incoming leaseholder up on its log, the new leader may not be able to because its log may not go back as far. Such a scenario has been possible since we stopped ensuring that all replicas have logs that start at the same index. For more details, see the discussion about cockroachdb#35701 in cockroachdb#81561. This race is theoretical — we have not seen it in practice. It's not clear whether we will try to address it or rely on a mitigation like the one described in cockroachdb#81764 to limit its blast radius. ---- Release note (bug fix): Range lease transfers are no longer permitted to follower replicas that may require a Raft snapshot. This ensures that lease transfers are never delayed behind snapshots, which could previously create range unavailability until the snapshot completed. Lease transfers now err on the side of caution and are only allowed when the outgoing leaseholder can guarantee that the incoming leaseholder does not need a snapshot.
craig bot
pushed a commit
that referenced
this issue
Jun 22, 2022
82560: sql: removed redundant parameter in method to schedule sql stats compaction r=rafiss a=surahman The `crdb_internal.schedule_sql_stats_compaction` SQL function does not require the `byte` string parameter and has thus been removed. Closes #78332. Jira issue: [CRDB-14071](https://cockroachlabs.atlassian.net/browse/CRDB-14071) `@Azhng` 82758: kv: don't allow lease transfers to replicas in need of snapshot r=nvanbenschoten a=nvanbenschoten Fixes #81763. Fixes #79385. Part of #81561. ### Background When performing a lease transfer, the outgoing leaseholder revokes its lease before proposing the lease transfer request, meaning that it promises to stop using the previous lease to serve reads or writes. The lease transfer request is then proposed and committed to the Raft log, at which point the new lease officially becomes active. However, this new lease is not usable until the incoming leaseholder applies the Raft entry that contains the lease transfer and notices that it is now the leaseholder for the range. The effect of this handoff is that there exists a "power vacuum" time period when the outgoing leaseholder has revoked its previous lease but the incoming leaseholder has not yet applied its new lease. During this time period, a range is effectively unavailable for strong reads and writes, because no replica will act as the leaseholder. Instead, requests that require the lease will be redirected back and forth between the outgoing leaseholder and the incoming leaseholder (the client backs off). To minimize the disruption caused by lease transfers, we need to minimize this time period. We assume that if a lease transfer target is sufficiently caught up on its log such that it will be able to apply the lease transfer through log entry application then this unavailability window will be acceptable. This may be a faulty assumption in cases with severe replication lag, but we must balance any heuristics here that attempts to determine "too much lag" with the possibility of starvation of lease transfers under sustained write load and a resulting sustained replication lag. See #38065 and #42379, which removed such a heuristic. For now, we don't try to make such a determination. ### Patch Details However, with this change, we now draw a distinction between lease transfer targets that will be able to apply the lease transfer through log entry application and those that will require a Raft snapshot to catch up and apply the lease transfer. Raft snapshots are more expensive than Raft entry replication. They are also significantly more likely to be delayed due to queueing behind other snapshot traffic in the system. This potential for delay makes transferring a lease to a replica that needs a snapshot very risky, as doing so has the effect of inducing range unavailability until the snapshot completes, which could take seconds, minutes, or hours. In the future, we will likely get better at prioritizing snapshots to improve the responsiveness of snapshots that are needed to recover availability. However, even in this world, it is not worth inducing unavailability that can only be recovered through a Raft snapshot. It is better to catch the desired lease target up on the log first and then initiate the lease transfer once its log is connected to the leader's. For this reason, unless we can guarantee that the lease transfer target does not need a Raft snapshot, we don't let it through. This commit adds protection against such risky lease transfers at two levels. First, it includes hardened protection in the Replica proposal buffer, immediately before handing the lease transfer proposal off to etcd/raft. Second, it includes best-effort protection before a Replica begins to initiate a lease transfer in `AdminTransferLease`, which all lease transfer operations flow through. The change includes protection at two levels because rejecting a lease transfer in the proposal buffer after we have revoked our current lease is more disruptive than doing so earlier, before we have revoked our current lease. Best-effort protection is also able to respond more gracefully to invalid targets (e.g. they pick the next best target). However, the check in the Replica proposal buffer is the only place where the protection is airtight against race conditions because the check is performed: 1. by the current Raft leader, else the proposal will fail 2. while holding latches that prevent interleaving log truncation ### Remaining Work With this change, there is a single known race which can lead to an incoming leaseholder needing a snapshot. This is the case when a leader/leaseholder transfers the lease and then quickly loses Raft leadership to a peer that has a shorter log. Even if the older leader could have caught the incoming leaseholder up on its log, the new leader may not be able to because its log may not go back as far. Such a scenario has been possible since we stopped ensuring that all replicas have logs that start at the same index. For more details, see the discussion about #35701 in #81561. This race is theoretical — we have not seen it in practice. It's not clear whether we will try to address it or rely on a mitigation like the one described in #81764 to limit its blast radius. ---- Release note (bug fix): Range lease transfers are no longer permitted to follower replicas that may require a Raft snapshot. This ensures that lease transfers are never delayed behind snapshots, which could previously create range unavailability until the snapshot completed. Lease transfers now err on the side of caution and are only allowed when the outgoing leaseholder can guarantee that the incoming leaseholder does not need a snapshot. 83109: asim: batch workload events r=kvoli a=kvoli This patch introduces batching for load events. Previously, load events were generated per-key and applied individually to the simulator state by finding the range containing that key. This patch batches load events on the same key, then applies the load events in ascending order, over the range tree. This results in a speedup of 5x on a 32 store, 32k replicas, 16k QPS cluster. Release note: None Co-authored-by: Saad Ur Rahman <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Austen McClernon <[email protected]>
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Aug 6, 2022
Fixes cockroachdb#81764. In addition to ranges that unconditionally require expiration-based leases (node liveness and earlier), we also use them during lease transfers for all other ranges. After acquiring such expiration-based leases, the leaseholders are expected to soon upgrade them to the more efficient epoch-based ones. By transferring an expiration-based lease, we can limit the effect of an ill-advised lease transfer since the in coming leaseholder needs to recognize itself as such within a few seconds; if it doesn't (we accidentally sent the lease to a replica in need of a snapshot), the lease is up for grabs. If we simply transferred epoch based leases, it would be possible for the new leaseholder in need of a snapshot to maintain its lease if the node it was on is able to heartbeat its liveness record. Release note: None.
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Aug 10, 2022
Fixes cockroachdb#81764. In addition to ranges that unconditionally require expiration-based leases (node liveness and earlier), we also use them during lease transfers for all other ranges. After acquiring such expiration-based leases, the leaseholders are expected to soon upgrade them to the more efficient epoch-based ones. By transferring an expiration-based lease, we can limit the effect of an ill-advised lease transfer since the in coming leaseholder needs to recognize itself as such within a few seconds; if it doesn't (we accidentally sent the lease to a replica in need of a snapshot), the lease is up for grabs. If we simply transferred epoch based leases, it would be possible for the new leaseholder in need of a snapshot to maintain its lease if the node it was on is able to heartbeat its liveness record. Release note: None.
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Aug 19, 2022
Fixes cockroachdb#81764. In addition to ranges that unconditionally require expiration-based leases (node liveness and earlier), we also use them during lease transfers for all other ranges. After acquiring such expiration-based leases, the leaseholders are expected to soon upgrade them to the more efficient epoch-based ones. By transferring an expiration-based lease, we can limit the effect of an ill-advised lease transfer since the in coming leaseholder needs to recognize itself as such within a few seconds; if it doesn't (we accidentally sent the lease to a replica in need of a snapshot), the lease is up for grabs. If we simply transferred epoch based leases, it would be possible for the new leaseholder in need of a snapshot to maintain its lease if the node it was on is able to heartbeat its liveness record. Release note: None.
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Aug 19, 2022
Fixes cockroachdb#81764. In addition to ranges that unconditionally require expiration-based leases (node liveness and earlier), we also use them during lease transfers for all other ranges. After acquiring such expiration-based leases, the leaseholders are expected to soon upgrade them to the more efficient epoch-based ones. By transferring an expiration-based lease, we can limit the effect of an ill-advised lease transfer since the in coming leaseholder needs to recognize itself as such within a few seconds; if it doesn't (we accidentally sent the lease to a replica in need of a snapshot), the lease is up for grabs. If we simply transferred epoch based leases, it would be possible for the new leaseholder in need of a snapshot to maintain its lease if the node it was on is able to heartbeat its liveness record. Release note: None. Release justification:
craig bot
pushed a commit
that referenced
this issue
Aug 20, 2022
85629: kvserver: always transfer expiration-based leases r=irfansharif a=irfansharif Fixes #81764. In addition to ranges that unconditionally require expiration-based leases (node liveness and earlier), we now also use them during lease transfers for all other ranges. After acquiring such expiration-based leases, the leaseholders are expected to soon upgrade them to the more efficient epoch-based ones. By transferring an expiration-based lease, we can limit the effect of an ill-advised lease transfer since the incoming leaseholder needs to recognize itself as such within a few seconds; if it doesn't (we accidentally sent the lease to a replica in need of a snapshot), the lease is up for grabs. If we simply transferred epoch based leases, it would be possible for the new leaseholder in need of a snapshot to maintain its lease if the node it was on is able to heartbeat its liveness record. Release note: None. Release justification: stability work 86269: ttl: validate ttl_expiration_expression is a TIMESTAMPTZ r=ajwerner,postamar a=rafiss fixes #86410 fixes #85531 fixes #84725 Previously the only validation is that the expression is valid SQL syntax. Now the expression must also be a TIMESTAMPTZ which will prevent errors in the TTL job when comparing to the cutoff. Validation is also added so that a column cannot be dropped or altered in type if it is used in the TTL expression. If the column is renamed, the expression is automatically updated. Release note: None Release justification: low risk changes to validation Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Evan Wall <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv
Anything in KV that doesn't belong in a more specific category.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
N-followup
Needs followup.
O-postmortem
Originated from a Postmortem action item.
T-kv
KV Team
Action item from #81561.
Described in #81561 (comment).
Jira issue: CRDB-16064
Epic CRDB-16160
The text was updated successfully, but these errors were encountered: