Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: mitigate chronic lagging rangefeeds due to rac2 send queues #136214

Closed
kvoli opened this issue Nov 26, 2024 · 2 comments · Fixed by #137531 or mohini-crl/cockroach#8
Closed

kvserver: mitigate chronic lagging rangefeeds due to rac2 send queues #136214

kvoli opened this issue Nov 26, 2024 · 2 comments · Fixed by #137531 or mohini-crl/cockroach#8
Assignees
Labels
A-kv-rangefeed Rangefeed infrastructure, server+client A-replication-admission-control-v2 Related to introduction of replication AC v2 branch-release-25.1 C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) GA-blocker T-kv KV Team

Comments

@kvoli
Copy link
Collaborator

kvoli commented Nov 26, 2024

Now that entries can be arbitrarily delayed by RACv2 pacing, it is also now more likely that a rangefeed planned on a chronically behind store may also have corresponding chronic rangefeed lag.

This issue is to mitigate chronic rangefeed lag that results from RACv2 pacing entries at a quorum speed.

Jira issue: CRDB-44928

@kvoli kvoli added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team A-kv-rangefeed Rangefeed infrastructure, server+client A-replication-admission-control-v2 Related to introduction of replication AC v2 labels Nov 26, 2024
@kvoli kvoli changed the title kvserver: mitigate chronic lagging changefeeds due to rac2 send queues kvserver: mitigate chronic lagging rangefeeds due to rac2 send queues Nov 26, 2024
Copy link

blathers-crl bot commented Dec 3, 2024

Hi @kvoli, please add branch-* labels to identify which branch(es) this GA-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@sumeerbhola
Copy link
Collaborator

related issue #119490

kvoli added a commit to kvoli/cockroach that referenced this issue Dec 5, 2024
See comments in this patch. To run the test added, which verifies the
rangefeed moves:

```bash
dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f  TestFlowControlSendQueueRangeFeed --show-logs --rewrite --timeout=100s
```

To verify that the rangefeed doesn't move without this patch, I used
this diff, which hardcodes the threshold to be infinity, which the main
logic added here is gated on.

There's an open question, does the mux rangefeed cancel every rangefeed
on a node when it receives an error? That seems broken and I could see
why the stuck rangefeed watcher needed to go, if that is true. But then,
how do replica removals work? Does every rangefeed need to be restarted?

Here are some sparse notes I took, in case it has any use:

initial replica selection uses OptimizeReplicaOrder, same as other dist_sender reqs called from newTransportForRange
- https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go#L613-L613
- locality > latency > attr affinity

the store local replica rangefeed entry registration (post-store routing) is Replica.RangeFeed
- https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/replica_rangefeed.go#L242-L242

closed ts updater
- https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/store.go#L2472-L2472

Informs: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 5, 2024
See comments in this patch. To run the test added, which verifies the
rangefeed moves:

```bash
dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f  TestFlowControlSendQueueRangeFeed --show-logs --rewrite --timeout=100s
```

To verify that the rangefeed doesn't move without this patch, I used this diff,
which hardcodes the threshold to be longer than the test timeout, which the
main logic added here is gated on:

```diff
diff --git a/pkg/kv/kvserver/replica_rangefeed.go b/pkg/kv/kvserver/replica_rangefeed.go
index ad2bc7967b9..91c4ed2a50a 100644
--- a/pkg/kv/kvserver/replica_rangefeed.go
+++ b/pkg/kv/kvserver/replica_rangefeed.go
@@ -907,7 +907,7 @@ func (r *Replica) handleClosedTimestampUpdateRaftMuLocked(
 	// and be more easily understood.
 	// TODO(kvoli): Check that the mux rangefeed is not cancelling every
 	// rangefeed on a node when the signal fires, returning an error.
-	extremelySlowLagThresh := 20 * slowClosedTSThresh
+	extremelySlowLagThresh := 1 * time.Hour
 	extremelySlowCloseThresh := 10 * slowClosedTSThresh
 	shouldCancelTooFarBehind := false
 	extremelyFarBehindStartTime := time.Time{}
```

Which times out waiting for the rangefeed to move:

```
goroutine 66 [sleep]:
time.Sleep(0x3b9aca00)
	GOROOT/src/runtime/time.go:195 +0xfc
github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration(0xa7a358200, 0x1400e686d20)
	pkg/util/retry/retry.go:220 +0x78
github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithinError(0x1400689f960, 0xa7a358200)
	pkg/testutils/soon.go:77 +0x60
github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithin({0x10d07cbc8, 0x14004da81a0}, 0x1400689f960, 0xa7a358200)
	pkg/testutils/soon.go:56 +0x44
github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon({0x10d07cbc8, 0x14004da81a0}, 0x1400689f960)
	pkg/testutils/soon.go:38 +0x4c
github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestFlowControlSendQueueRangeFeed(0x14004da81a0)
	pkg/kv/kvserver/flow_control_integration_test.go:5903 +0x10b8
testing.tRunner(0x14004da81a0, 0x10d019568)
	GOROOT/src/testing/testing.go:1689 +0xec
created by testing.(*T).Run in goroutine 1
	GOROOT/src/testing/testing.go:1742 +0x318
```

There's an open question, does the mux rangefeed cancel every rangefeed
on a node when it receives an error? That seems broken and I could see
why the stuck rangefeed watcher needed to go, if that is true. But then,
how do replica removals work? Does every rangefeed need to be restarted?

Here are some sparse notes I took, in case it has any use:

initial replica selection uses OptimizeReplicaOrder, same as other dist_sender reqs called from newTransportForRange
- https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go#L613-L613
- locality > latency > attr affinity

the store local replica rangefeed entry registration (post-store routing) is Replica.RangeFeed
- https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/replica_rangefeed.go#L242-L242

closed ts updater
- https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/store.go#L2472-L2472

Informs: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 5, 2024
See comments in this patch. To run the test added, which verifies the
rangefeed moves:

```bash
dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f  TestFlowControlSendQueueRangeFeed --show-logs --rewrite --timeout=100s
```

To verify that the rangefeed doesn't move without this patch, I used this diff,
which hardcodes the threshold to be longer than the test timeout, which the
main logic added here is gated on:

```diff
diff --git a/pkg/kv/kvserver/replica_rangefeed.go b/pkg/kv/kvserver/replica_rangefeed.go
index ad2bc7967b9..91c4ed2a50a 100644
--- a/pkg/kv/kvserver/replica_rangefeed.go
+++ b/pkg/kv/kvserver/replica_rangefeed.go
@@ -907,7 +907,7 @@ func (r *Replica) handleClosedTimestampUpdateRaftMuLocked(
 	// and be more easily understood.
 	// TODO(kvoli): Check that the mux rangefeed is not cancelling every
 	// rangefeed on a node when the signal fires, returning an error.
-	extremelySlowLagThresh := 20 * slowClosedTSThresh
+	extremelySlowLagThresh := 1 * time.Hour
 	extremelySlowCloseThresh := 10 * slowClosedTSThresh
 	shouldCancelTooFarBehind := false
 	extremelyFarBehindStartTime := time.Time{}
```

Which times out waiting for the rangefeed to move:

```
goroutine 66 [sleep]:
time.Sleep(0x3b9aca00)
	GOROOT/src/runtime/time.go:195 +0xfc
github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration(0xa7a358200, 0x1400e686d20)
	pkg/util/retry/retry.go:220 +0x78
github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithinError(0x1400689f960, 0xa7a358200)
	pkg/testutils/soon.go:77 +0x60
github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithin({0x10d07cbc8, 0x14004da81a0}, 0x1400689f960, 0xa7a358200)
	pkg/testutils/soon.go:56 +0x44
github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon({0x10d07cbc8, 0x14004da81a0}, 0x1400689f960)
	pkg/testutils/soon.go:38 +0x4c
github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestFlowControlSendQueueRangeFeed(0x14004da81a0)
	pkg/kv/kvserver/flow_control_integration_test.go:5903 +0x10b8
testing.tRunner(0x14004da81a0, 0x10d019568)
	GOROOT/src/testing/testing.go:1689 +0xec
created by testing.(*T).Run in goroutine 1
	GOROOT/src/testing/testing.go:1742 +0x318
```

There's an open question, does the mux rangefeed cancel every rangefeed
on a node when it receives an error? That seems broken and I could see
why the stuck rangefeed watcher needed to go, if that is true. But then,
how do replica removals work? Does every rangefeed need to be restarted?

Here are some sparse notes I took, in case it has any use:

initial replica selection uses OptimizeReplicaOrder, same as other dist_sender reqs called from newTransportForRange
- https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go#L613-L613
- locality > latency > attr affinity

the store local replica rangefeed entry registration (post-store routing) is Replica.RangeFeed
- https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/replica_rangefeed.go#L242-L242

closed ts updater
- https://github.com/cockroachdb/cockroach/blob/b4280f7c6613f362791478696adbdf2cc67dd874/pkg/kv/kvserver/store.go#L2472-L2472

Informs: cockroachdb#136214
Release note: None
@kvoli kvoli self-assigned this Dec 11, 2024
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 12, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

````
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
````

TODO(kvoli):
- metric for cancellation signal triggering server side
- new error value for cancel reason that is specific to lag
- roachtest which simulates a

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 12, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

````
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
````

TODO(kvoli):
- metric for cancellation signal triggering server side
- new error value for cancel reason that is specific to lag
- reject planning rangefeed
- clean-up added tests
- roachtest which ensures many rangefeeds on a specific store, limits
  that store's write bw and then asserts on the lag resolving via the
  rangefeed moving.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 13, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

````
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
````

TODO(kvoli):
- reject planning rangefeed
- roachtest which ensures many rangefeeds on a specific store, limits
  that store's write bw and then asserts on the lag resolving via the
  rangefeed moving.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 13, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

````
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
````

TODO(kvoli):
- reject planning rangefeed
- roachtest which ensures many rangefeeds on a specific store, limits
  that store's write bw and then asserts on the lag resolving via the
  rangefeed moving.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 13, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

````
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
````

TODO(kvoli):
- reject planning rangefeed
- roachtest which ensures many rangefeeds on a specific store, limits
  that store's write bw and then asserts on the lag resolving via the
  rangefeed moving.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 13, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

````
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
````

TODO(kvoli):
- reject planning rangefeed
- roachtest which ensures many rangefeeds on a specific store, limits
  that store's write bw and then asserts on the lag resolving via the
  rangefeed moving.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 13, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

```
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
```

Note we could also prevent accepting a rangefeed registration if the lag
were sufficient, however the behavior change here applies only to lag
which as been observed to be sustaned over time, without historical
data, we cannot apply identical decision logic on registration.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 13, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

```
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
```

Note we could also prevent accepting a rangefeed registration if the lag
were sufficient, however the behavior change here applies only to lag
which as been observed to be sustained over time, without historical
data, we cannot apply identical decision logic on registration.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 13, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

```
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
```

Note we could also prevent accepting a rangefeed registration if the lag
were sufficient, however the behavior change here applies only to lag
which as been observed to be sustained over time, without historical
data, we cannot apply identical decision logic on registration.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 16, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

```
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
```

Note we could also prevent accepting a rangefeed registration if the lag
were sufficient, however the behavior change here applies only to lag
which as been observed to be sustained over time, without historical
data, we cannot apply identical decision logic on registration.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 16, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

```
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
```

Note we could also prevent accepting a rangefeed registration if the lag
were sufficient, however the behavior change here applies only to lag
which as been observed to be sustained over time, without historical
data, we cannot apply identical decision logic on registration.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 17, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

```
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
```

Note we could also prevent accepting a rangefeed registration if the lag
were sufficient, however the behavior change here applies only to lag
which as been observed to be sustained over time, without historical
data, we cannot apply identical decision logic on registration.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 17, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

```
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
```

Note we could also prevent accepting a rangefeed registration if the lag
were sufficient, however the behavior change here applies only to lag
which as been observed to be sustained over time, without historical
data, we cannot apply identical decision logic on registration.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 18, 2024
Add a new counter metric,
`kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is
incremented each time a rangefeed is cancelled server-side due to a
chronically lagging closed timestamp (see cockroachdb#137531).

Part of: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 18, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

```
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
```

Note we could also prevent accepting a rangefeed registration if the lag
were sufficient, however the behavior change here applies only to lag
which as been observed to be sustained over time, without historical
data, we cannot apply identical decision logic on registration.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 18, 2024
Add a new counter metric,
`kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is
incremented each time a rangefeed is cancelled server-side due to a
chronically lagging closed timestamp (see cockroachdb#137531).

Part of: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 18, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

```
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
```

Note we could also prevent accepting a rangefeed registration if the lag
were sufficient, however the behavior change here applies only to lag
which as been observed to be sustained over time, without historical
data, we cannot apply identical decision logic on registration.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 18, 2024
Add a new counter metric,
`kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is
incremented each time a rangefeed is cancelled server-side due to a
chronically lagging closed timestamp (see cockroachdb#137531).

Part of: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 18, 2024
When a rangefeed's closed timestamp lags behind the current time, any
writes that have occurred in-between will not be emitted. This is
problematic in cases where the lag is significant and chronic, as
consumers (changefeeds, logical data replication, physical cluster
replication) are likewise delayed in their processing. Observing a
rangefeed with a chronic lagging closed timestamp will become relatively
more likely with quorum replication flow control, as entries are
deliberately queued, instead of being sent, to stores which do not have
sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds,
so that they may be replanned and retried on another replica. The other
replica may also have this issue, however there should be at least a
quorum of voting replicas with a similar closed timestamp that would be
suitable.

The replanning on a different replica is handled already by existing
machinery. This commit introduces an observer which generates a signal
indicating that the rangefeed should be cancelled. The signal also
encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds,
defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the
client. This can be visualized in the following diagram, where there is
an initial spike over the lag threshold, which is recovered from so the
rangefeed wouldn't be cancelled. The second drop below the lag
threshold is sustained for greater than the duration threshold, so the
rangefeed is then cancelled for replanning:

```
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
```

Note we could also prevent accepting a rangefeed registration if the lag
were sufficient, however the behavior change here applies only to lag
which has been observed to be sustained over time. Without historical
data, we cannot apply identical decision logic on registration.

Fixes: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 18, 2024
Add a new counter metric,
`kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is
incremented each time a rangefeed is cancelled server-side due to a
chronically lagging closed timestamp (see cockroachdb#137531).

Part of: cockroachdb#136214
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 18, 2024
Add a new counter metric,
`kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is
incremented each time a rangefeed is cancelled server-side due to a
chronically lagging closed timestamp (see cockroachdb#137531).

Part of: cockroachdb#136214
Release note: None
craig bot pushed a commit that referenced this issue Dec 18, 2024
137531: kv: replan rangefeeds with chronic closed ts lag r=sumeerbhola,wenyihu6 a=kvoli

When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens.

This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable.

The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well.

The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple
(default = 20 x closed ts target duration = 60s)
```

```
kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration
(default = 60s)
```

When a replica's closed timestamp has sustained lag greater than:

```
kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration
```

For at least:

```
`kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`
```

duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning:

```
lag=0          ─────────────────────────────────────────────────────

observed lag   ─────────┐
                        │
                        │
                        │     ┌───────┐
lag threshold  ─────────┼─────┼───────┼──────────────────────────────
                        │     │       └───┐
                        │     │           └─────┐
                        └─────┘                 └──────┐
                                                       └────────────
                                      ◄────────────────────────────►
                                         exceeds duration threshold
```

Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration.

---

kvserver: add metric for rangefeed cancellations due to lag

Add a new counter metric,
`kv.rangefeed.closed_timestamp.slow_ranges.cancelled`, which is
incremented each time a rangefeed is cancelled server-side due to a
chronically lagging closed timestamp (see #137531).

Fixes: #136214
Release note: None

Co-authored-by: Austen McClernon <[email protected]>
@craig craig bot closed this as completed in ad6f23c Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-rangefeed Rangefeed infrastructure, server+client A-replication-admission-control-v2 Related to introduction of replication AC v2 branch-release-25.1 C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) GA-blocker T-kv KV Team
Projects
None yet
2 participants