kvserver: mitigate chronic lagging rangefeeds due to rac2 send queues #136214

kvoli · 2024-11-26T15:43:18Z

Now that entries can be arbitrarily delayed by RACv2 pacing, it is also now more likely that a rangefeed planned on a chronically behind store may also have corresponding chronic rangefeed lag.

This issue is to mitigate chronic rangefeed lag that results from RACv2 pacing entries at a quorum speed.

Jira issue: CRDB-44928

blathers-crl · 2024-12-03T15:16:12Z

Hi @kvoli, please add branch-* labels to identify which branch(es) this GA-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

sumeerbhola · 2024-12-03T15:18:30Z

related issue #119490

See comments in this patch. To run the test added, which verifies the rangefeed moves: ```bash dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f TestFlowControlSendQueueRangeFeed --show-logs --rewrite --timeout=100s ``` To verify that the rangefeed doesn't move without this patch, I used this diff, which hardcodes the threshold to be infinity, which the main logic added here is gated on. There's an open question, does the mux rangefeed cancel every rangefeed on a node when it receives an error? That seems broken and I could see why the stuck rangefeed watcher needed to go, if that is true. But then, how do replica removals work? Does every rangefeed need to be restarted? Here are some sparse notes I took, in case it has any use: initial replica selection uses OptimizeReplicaOrder, same as other dist_sender reqs called from newTransportForRange - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go#L613-L613 - locality > latency > attr affinity the store local replica rangefeed entry registration (post-store routing) is Replica.RangeFeed - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/replica_rangefeed.go#L242-L242 closed ts updater - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/store.go#L2472-L2472 Informs: cockroachdb#136214 Release note: None

See comments in this patch. To run the test added, which verifies the rangefeed moves: ```bash dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f TestFlowControlSendQueueRangeFeed --show-logs --rewrite --timeout=100s ``` To verify that the rangefeed doesn't move without this patch, I used this diff, which hardcodes the threshold to be longer than the test timeout, which the main logic added here is gated on: ```diff diff --git a/pkg/kv/kvserver/replica_rangefeed.go b/pkg/kv/kvserver/replica_rangefeed.go index ad2bc7967b9..91c4ed2a50a 100644 --- a/pkg/kv/kvserver/replica_rangefeed.go +++ b/pkg/kv/kvserver/replica_rangefeed.go @@ -907,7 +907,7 @@ func (r *Replica) handleClosedTimestampUpdateRaftMuLocked( // and be more easily understood. // TODO(kvoli): Check that the mux rangefeed is not cancelling every // rangefeed on a node when the signal fires, returning an error. - extremelySlowLagThresh := 20 * slowClosedTSThresh + extremelySlowLagThresh := 1 * time.Hour extremelySlowCloseThresh := 10 * slowClosedTSThresh shouldCancelTooFarBehind := false extremelyFarBehindStartTime := time.Time{} ``` Which times out waiting for the rangefeed to move: ``` goroutine 66 [sleep]: time.Sleep(0x3b9aca00) GOROOT/src/runtime/time.go:195 +0xfc github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration(0xa7a358200, 0x1400e686d20) pkg/util/retry/retry.go:220 +0x78 github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithinError(0x1400689f960, 0xa7a358200) pkg/testutils/soon.go:77 +0x60 github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithin({0x10d07cbc8, 0x14004da81a0}, 0x1400689f960, 0xa7a358200) pkg/testutils/soon.go:56 +0x44 github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon({0x10d07cbc8, 0x14004da81a0}, 0x1400689f960) pkg/testutils/soon.go:38 +0x4c github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestFlowControlSendQueueRangeFeed(0x14004da81a0) pkg/kv/kvserver/flow_control_integration_test.go:5903 +0x10b8 testing.tRunner(0x14004da81a0, 0x10d019568) GOROOT/src/testing/testing.go:1689 +0xec created by testing.(*T).Run in goroutine 1 GOROOT/src/testing/testing.go:1742 +0x318 ``` There's an open question, does the mux rangefeed cancel every rangefeed on a node when it receives an error? That seems broken and I could see why the stuck rangefeed watcher needed to go, if that is true. But then, how do replica removals work? Does every rangefeed need to be restarted? Here are some sparse notes I took, in case it has any use: initial replica selection uses OptimizeReplicaOrder, same as other dist_sender reqs called from newTransportForRange - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go#L613-L613 - locality > latency > attr affinity the store local replica rangefeed entry registration (post-store routing) is Replica.RangeFeed - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/replica_rangefeed.go#L242-L242 closed ts updater - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/store.go#L2472-L2472 Informs: cockroachdb#136214 Release note: None

When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ```` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ```` TODO(kvoli): - metric for cancellation signal triggering server side - new error value for cancel reason that is specific to lag - roachtest which simulates a Fixes: cockroachdb#136214 Release note: None

When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ```` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ```` TODO(kvoli): - metric for cancellation signal triggering server side - new error value for cancel reason that is specific to lag - reject planning rangefeed - clean-up added tests - roachtest which ensures many rangefeeds on a specific store, limits that store's write bw and then asserts on the lag resolving via the rangefeed moving. Fixes: cockroachdb#136214 Release note: None

When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ```` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ```` TODO(kvoli): - reject planning rangefeed - roachtest which ensures many rangefeeds on a specific store, limits that store's write bw and then asserts on the lag resolving via the rangefeed moving. Fixes: cockroachdb#136214 Release note: None

When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustaned over time, without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None

When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None

Add a new counter metric, `kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is incremented each time a rangefeed is cancelled server-side due to a chronically lagging closed timestamp (see cockroachdb#137531). Part of: cockroachdb#136214 Release note: None

When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None

Add a new counter metric, `kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is incremented each time a rangefeed is cancelled server-side due to a chronically lagging closed timestamp (see cockroachdb#137531). Part of: cockroachdb#136214 Release note: None

When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None

Add a new counter metric, `kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is incremented each time a rangefeed is cancelled server-side due to a chronically lagging closed timestamp (see cockroachdb#137531). Part of: cockroachdb#136214 Release note: None

When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which has been observed to be sustained over time. Without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None

Add a new counter metric, `kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is incremented each time a rangefeed is cancelled server-side due to a chronically lagging closed timestamp (see cockroachdb#137531). Part of: cockroachdb#136214 Release note: None

137531: kv: replan rangefeeds with chronic closed ts lag r=sumeerbhola,wenyihu6 a=kvoli When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration. --- kvserver: add metric for rangefeed cancellations due to lag Add a new counter metric, `kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is incremented each time a rangefeed is cancelled server-side due to a chronically lagging closed timestamp (see #137531). Fixes: #136214 Release note: None Co-authored-by: Austen McClernon <[email protected]>

kvoli added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team A-kv-rangefeed Rangefeed infrastructure, server+client A-replication-admission-control-v2 Related to introduction of replication AC v2 labels Nov 26, 2024

kvoli changed the title ~~kvserver: mitigate chronic lagging changefeeds due to rac2 send queues~~ kvserver: mitigate chronic lagging rangefeeds due to rac2 send queues Nov 26, 2024

kvoli added the GA-blocker label Dec 3, 2024

kvoli added the branch-release-25.1 label Dec 3, 2024

kvoli mentioned this issue Dec 5, 2024

kv: replan rangefeeds with chronic closed ts lag #136778

Closed

kvoli self-assigned this Dec 11, 2024

kvoli mentioned this issue Dec 16, 2024

kv: replan rangefeeds with chronic closed ts lag #137531

Merged

mohini-crl mentioned this issue Dec 16, 2024

kv: replan rangefeeds with chronic closed ts lag mohini-crl/cockroach#8

Merged

craig bot closed this as completed in ad6f23c Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: mitigate chronic lagging rangefeeds due to rac2 send queues #136214

kvserver: mitigate chronic lagging rangefeeds due to rac2 send queues #136214

kvoli commented Nov 26, 2024 •

edited

Loading

blathers-crl bot commented Dec 3, 2024

sumeerbhola commented Dec 3, 2024

kvserver: mitigate chronic lagging rangefeeds due to rac2 send queues #136214

kvserver: mitigate chronic lagging rangefeeds due to rac2 send queues #136214

Comments

kvoli commented Nov 26, 2024 • edited Loading

blathers-crl bot commented Dec 3, 2024

sumeerbhola commented Dec 3, 2024

kvoli commented Nov 26, 2024 •

edited

Loading