-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: mitigate chronic lagging rangefeeds due to rac2 send queues #136214
Closed
Labels
A-kv-rangefeed
Rangefeed infrastructure, server+client
A-replication-admission-control-v2
Related to introduction of replication AC v2
branch-release-25.1
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
GA-blocker
T-kv
KV Team
Comments
kvoli
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
A-kv-rangefeed
Rangefeed infrastructure, server+client
A-replication-admission-control-v2
Related to introduction of replication AC v2
labels
Nov 26, 2024
kvoli
changed the title
kvserver: mitigate chronic lagging changefeeds due to rac2 send queues
kvserver: mitigate chronic lagging rangefeeds due to rac2 send queues
Nov 26, 2024
Hi @kvoli, please add branch-* labels to identify which branch(es) this GA-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
related issue #119490 |
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 5, 2024
See comments in this patch. To run the test added, which verifies the rangefeed moves: ```bash dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f TestFlowControlSendQueueRangeFeed --show-logs --rewrite --timeout=100s ``` To verify that the rangefeed doesn't move without this patch, I used this diff, which hardcodes the threshold to be infinity, which the main logic added here is gated on. There's an open question, does the mux rangefeed cancel every rangefeed on a node when it receives an error? That seems broken and I could see why the stuck rangefeed watcher needed to go, if that is true. But then, how do replica removals work? Does every rangefeed need to be restarted? Here are some sparse notes I took, in case it has any use: initial replica selection uses OptimizeReplicaOrder, same as other dist_sender reqs called from newTransportForRange - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go#L613-L613 - locality > latency > attr affinity the store local replica rangefeed entry registration (post-store routing) is Replica.RangeFeed - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/replica_rangefeed.go#L242-L242 closed ts updater - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/store.go#L2472-L2472 Informs: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 5, 2024
See comments in this patch. To run the test added, which verifies the rangefeed moves: ```bash dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f TestFlowControlSendQueueRangeFeed --show-logs --rewrite --timeout=100s ``` To verify that the rangefeed doesn't move without this patch, I used this diff, which hardcodes the threshold to be longer than the test timeout, which the main logic added here is gated on: ```diff diff --git a/pkg/kv/kvserver/replica_rangefeed.go b/pkg/kv/kvserver/replica_rangefeed.go index ad2bc7967b9..91c4ed2a50a 100644 --- a/pkg/kv/kvserver/replica_rangefeed.go +++ b/pkg/kv/kvserver/replica_rangefeed.go @@ -907,7 +907,7 @@ func (r *Replica) handleClosedTimestampUpdateRaftMuLocked( // and be more easily understood. // TODO(kvoli): Check that the mux rangefeed is not cancelling every // rangefeed on a node when the signal fires, returning an error. - extremelySlowLagThresh := 20 * slowClosedTSThresh + extremelySlowLagThresh := 1 * time.Hour extremelySlowCloseThresh := 10 * slowClosedTSThresh shouldCancelTooFarBehind := false extremelyFarBehindStartTime := time.Time{} ``` Which times out waiting for the rangefeed to move: ``` goroutine 66 [sleep]: time.Sleep(0x3b9aca00) GOROOT/src/runtime/time.go:195 +0xfc github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration(0xa7a358200, 0x1400e686d20) pkg/util/retry/retry.go:220 +0x78 github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithinError(0x1400689f960, 0xa7a358200) pkg/testutils/soon.go:77 +0x60 github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithin({0x10d07cbc8, 0x14004da81a0}, 0x1400689f960, 0xa7a358200) pkg/testutils/soon.go:56 +0x44 github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon({0x10d07cbc8, 0x14004da81a0}, 0x1400689f960) pkg/testutils/soon.go:38 +0x4c github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestFlowControlSendQueueRangeFeed(0x14004da81a0) pkg/kv/kvserver/flow_control_integration_test.go:5903 +0x10b8 testing.tRunner(0x14004da81a0, 0x10d019568) GOROOT/src/testing/testing.go:1689 +0xec created by testing.(*T).Run in goroutine 1 GOROOT/src/testing/testing.go:1742 +0x318 ``` There's an open question, does the mux rangefeed cancel every rangefeed on a node when it receives an error? That seems broken and I could see why the stuck rangefeed watcher needed to go, if that is true. But then, how do replica removals work? Does every rangefeed need to be restarted? Here are some sparse notes I took, in case it has any use: initial replica selection uses OptimizeReplicaOrder, same as other dist_sender reqs called from newTransportForRange - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go#L613-L613 - locality > latency > attr affinity the store local replica rangefeed entry registration (post-store routing) is Replica.RangeFeed - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/replica_rangefeed.go#L242-L242 closed ts updater - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/store.go#L2472-L2472 Informs: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 5, 2024
See comments in this patch. To run the test added, which verifies the rangefeed moves: ```bash dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f TestFlowControlSendQueueRangeFeed --show-logs --rewrite --timeout=100s ``` To verify that the rangefeed doesn't move without this patch, I used this diff, which hardcodes the threshold to be longer than the test timeout, which the main logic added here is gated on: ```diff diff --git a/pkg/kv/kvserver/replica_rangefeed.go b/pkg/kv/kvserver/replica_rangefeed.go index ad2bc7967b9..91c4ed2a50a 100644 --- a/pkg/kv/kvserver/replica_rangefeed.go +++ b/pkg/kv/kvserver/replica_rangefeed.go @@ -907,7 +907,7 @@ func (r *Replica) handleClosedTimestampUpdateRaftMuLocked( // and be more easily understood. // TODO(kvoli): Check that the mux rangefeed is not cancelling every // rangefeed on a node when the signal fires, returning an error. - extremelySlowLagThresh := 20 * slowClosedTSThresh + extremelySlowLagThresh := 1 * time.Hour extremelySlowCloseThresh := 10 * slowClosedTSThresh shouldCancelTooFarBehind := false extremelyFarBehindStartTime := time.Time{} ``` Which times out waiting for the rangefeed to move: ``` goroutine 66 [sleep]: time.Sleep(0x3b9aca00) GOROOT/src/runtime/time.go:195 +0xfc github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration(0xa7a358200, 0x1400e686d20) pkg/util/retry/retry.go:220 +0x78 github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithinError(0x1400689f960, 0xa7a358200) pkg/testutils/soon.go:77 +0x60 github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithin({0x10d07cbc8, 0x14004da81a0}, 0x1400689f960, 0xa7a358200) pkg/testutils/soon.go:56 +0x44 github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon({0x10d07cbc8, 0x14004da81a0}, 0x1400689f960) pkg/testutils/soon.go:38 +0x4c github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestFlowControlSendQueueRangeFeed(0x14004da81a0) pkg/kv/kvserver/flow_control_integration_test.go:5903 +0x10b8 testing.tRunner(0x14004da81a0, 0x10d019568) GOROOT/src/testing/testing.go:1689 +0xec created by testing.(*T).Run in goroutine 1 GOROOT/src/testing/testing.go:1742 +0x318 ``` There's an open question, does the mux rangefeed cancel every rangefeed on a node when it receives an error? That seems broken and I could see why the stuck rangefeed watcher needed to go, if that is true. But then, how do replica removals work? Does every rangefeed need to be restarted? Here are some sparse notes I took, in case it has any use: initial replica selection uses OptimizeReplicaOrder, same as other dist_sender reqs called from newTransportForRange - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go#L613-L613 - locality > latency > attr affinity the store local replica rangefeed entry registration (post-store routing) is Replica.RangeFeed - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/replica_rangefeed.go#L242-L242 closed ts updater - https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/store.go#L2472-L2472 Informs: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 12, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ```` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ```` TODO(kvoli): - metric for cancellation signal triggering server side - new error value for cancel reason that is specific to lag - roachtest which simulates a Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 12, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ```` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ```` TODO(kvoli): - metric for cancellation signal triggering server side - new error value for cancel reason that is specific to lag - reject planning rangefeed - clean-up added tests - roachtest which ensures many rangefeeds on a specific store, limits that store's write bw and then asserts on the lag resolving via the rangefeed moving. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 13, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ```` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ```` TODO(kvoli): - reject planning rangefeed - roachtest which ensures many rangefeeds on a specific store, limits that store's write bw and then asserts on the lag resolving via the rangefeed moving. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 13, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ```` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ```` TODO(kvoli): - reject planning rangefeed - roachtest which ensures many rangefeeds on a specific store, limits that store's write bw and then asserts on the lag resolving via the rangefeed moving. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 13, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ```` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ```` TODO(kvoli): - reject planning rangefeed - roachtest which ensures many rangefeeds on a specific store, limits that store's write bw and then asserts on the lag resolving via the rangefeed moving. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 13, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ```` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ```` TODO(kvoli): - reject planning rangefeed - roachtest which ensures many rangefeeds on a specific store, limits that store's write bw and then asserts on the lag resolving via the rangefeed moving. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 13, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustaned over time, without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 13, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 13, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 16, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 16, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 17, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 17, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 18, 2024
Add a new counter metric, `kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is incremented each time a rangefeed is cancelled server-side due to a chronically lagging closed timestamp (see cockroachdb#137531). Part of: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 18, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 18, 2024
Add a new counter metric, `kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is incremented each time a rangefeed is cancelled server-side due to a chronically lagging closed timestamp (see cockroachdb#137531). Part of: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 18, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 18, 2024
Add a new counter metric, `kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is incremented each time a rangefeed is cancelled server-side due to a chronically lagging closed timestamp (see cockroachdb#137531). Part of: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 18, 2024
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which has been observed to be sustained over time. Without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 18, 2024
Add a new counter metric, `kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is incremented each time a rangefeed is cancelled server-side due to a chronically lagging closed timestamp (see cockroachdb#137531). Part of: cockroachdb#136214 Release note: None
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Dec 18, 2024
Add a new counter metric, `kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is incremented each time a rangefeed is cancelled server-side due to a chronically lagging closed timestamp (see cockroachdb#137531). Part of: cockroachdb#136214 Release note: None
craig bot
pushed a commit
that referenced
this issue
Dec 18, 2024
137531: kv: replan rangefeeds with chronic closed ts lag r=sumeerbhola,wenyihu6 a=kvoli When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration. --- kvserver: add metric for rangefeed cancellations due to lag Add a new counter metric, `kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is incremented each time a rangefeed is cancelled server-side due to a chronically lagging closed timestamp (see #137531). Fixes: #136214 Release note: None Co-authored-by: Austen McClernon <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-rangefeed
Rangefeed infrastructure, server+client
A-replication-admission-control-v2
Related to introduction of replication AC v2
branch-release-25.1
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
GA-blocker
T-kv
KV Team
Now that entries can be arbitrarily delayed by RACv2 pacing, it is also now more likely that a rangefeed planned on a chronically behind store may also have corresponding chronic rangefeed lag.
This issue is to mitigate chronic rangefeed lag that results from RACv2 pacing entries at a quorum speed.
Jira issue: CRDB-44928
The text was updated successfully, but these errors were encountered: