forked from cockroachdb/cockroach
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
kv: replan rangefeeds with chronic closed ts lag
When a rangefeed's closed timestamp lags behind the current time, any writes that have occurred in-between will not be emitted. This is problematic in cases where the lag is significant and chronic, as consumers (changefeeds, logical data replication, physical cluster replication) are likewise delayed in their processing. Observing a rangefeed with a chronic lagging closed timestamp will become relatively more likely with quorum replication flow control, as entries are deliberately queued, instead of being sent, to stores which do not have sufficient send tokens. This commit (re)introduces the concept of cancelling lagging rangefeeds, so that they may be replanned and retried on another replica. The other replica may also have this issue, however there should be at least a quorum of voting replicas with a similar closed timestamp that would be suitable. The replanning on a different replica is handled already by existing machinery. This commit introduces an observer which generates a signal indicating that the rangefeed should be cancelled. The signal also encapsulates the existing logic to nudge a rangefeed as well. The criteria for cancelling a rangefeed is influenced by two thresholds, defined as cluster settings: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple (default = 20 x closed ts target duration = 60s) ``` ``` kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration (default = 60s) ``` When a replica's closed timestamp has sustained lag greater than: ``` kv.rangefeed.lagging_closed_timestamp_cancel_multiple * kv.closed_timestamp.target_duration ``` For at least: ``` `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration` ``` duration, the rangefeed will be cancelled and then re-planned on the client. This can be visualized in the following diagram, where there is an initial spike over the lag threshold, which is recovered from so the rangefeed wouldn't be cancelled. The second drop below the lag threshold is sustained for greater than the duration threshold, so the rangefeed is then cancelled for replanning: ``` lag=0 ───────────────────────────────────────────────────── observed lag ─────────┐ │ │ │ ┌───────┐ lag threshold ─────────┼─────┼───────┼────────────────────────────── │ │ └───┐ │ │ └─────┐ └─────┘ └──────┐ └──────────── ◄────────────────────────────► exceeds duration threshold ``` Note we could also prevent accepting a rangefeed registration if the lag were sufficient, however the behavior change here applies only to lag which as been observed to be sustained over time, without historical data, we cannot apply identical decision logic on registration. Fixes: cockroachdb#136214 Release note: None
Showing
11 changed files
with
714 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
// Copyright 2024 The Cockroach Authors. | ||
// | ||
// Use of this software is governed by the CockroachDB Software License | ||
// included in the /LICENSE file. | ||
|
||
package kvserver | ||
|
||
import ( | ||
"context" | ||
"time" | ||
|
||
"github.com/cockroachdb/cockroach/pkg/kv/kvserver/closedts" | ||
"github.com/cockroachdb/cockroach/pkg/settings" | ||
"github.com/cockroachdb/redact" | ||
) | ||
|
||
// laggingRangeFeedCTNudgeMultiple is the multiple of the closed timestamp target | ||
// duration that a rangefeed's closed timestamp can lag behind the current time | ||
// before the rangefeed is nudged to catch up. | ||
const laggingRangeFeedCTNudgeMultiple = 5 | ||
|
||
// RangeFeedLaggingCTCancelMultiple is the multiple of the closed timestamp | ||
// target duration that a rangefeed's closed timestamp can lag behind the | ||
// current time before the rangefeed is cancelled, if the duration threshold is | ||
// also met, see RangeFeedLaggingCTCancelDuration. When set to 0, cancelling is | ||
// disabled. | ||
var RangeFeedLaggingCTCancelMultiple = settings.RegisterIntSetting( | ||
settings.SystemOnly, | ||
"kv.rangefeed.lagging_closed_timestamp_cancel_multiple", | ||
"if a range's closed timestamp is more than this multiple of the "+ | ||
"`kv.closed_timestamp.target_duration` behind the current time,"+ | ||
"for at least `kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration`"+ | ||
", cancel the rangefeed, when set to 0, canceling is disabled", | ||
20, /* 20x closed ts target, currently default 60s */ | ||
// NB: We don't want users setting a value incongruent with the closed | ||
// timestamp target duration, as that would lead to thrashing of rangefeeds. | ||
// Also, the nudge multiple is a constant above, so we don't want users | ||
// setting a lower value than that, as nudging is a prerequisite for | ||
// cancelling. | ||
settings.IntInRangeOrZeroDisable(laggingRangeFeedCTNudgeMultiple, 10_000), | ||
) | ||
|
||
// RangeFeedLaggingCTCancelDuration is the duration threshold for lagging | ||
// rangefeeds to be cancelled when the closed timestamp is lagging behind the | ||
// current time by more than: | ||
// | ||
// `kv.rangefeed.lagging_closed_timestamp_cancel_multiple` * | ||
// `kv.closed_timestamp.target_duration` | ||
// | ||
// e.g., if the closed timestamp target duration is 3s (current default) and | ||
// the multiple is 2, then the lagging rangefeed will be canceled if the closed | ||
// timestamp is more than 6s behind the current time, for at least | ||
// laggingRangeFeedCTCancelDurationThreshold: | ||
// | ||
// closed_ts = -7s (relative to now) | ||
// target_closed_ts = -3s | ||
// multiple = 2.0 | ||
// lagging_duration_threshold = 60s | ||
// | ||
// In the above example, the rangefeed will be canceled if this state is | ||
// sustained for at least 60s. Visually (and abstractly) it looks like this: | ||
// | ||
// lag=0 ───────────────────────────────────────────────────── | ||
// | ||
// observed lag ─────────┐ | ||
// │ | ||
// │ | ||
// │ ┌───────┐ | ||
// lag threshold ─────────┼─────┼───────┼────────────────────────────── | ||
// │ │ └───┐ | ||
// │ │ └─────┐ | ||
// └─────┘ └──────┐ | ||
// └──────────── | ||
// ◄────────────────────────────► | ||
// exceeds duration threshold | ||
// | ||
// Where time is moving from left to right, and the y-axis represents the | ||
// closed timestamp lag relative to the current time. | ||
var RangeFeedLaggingCTCancelDuration = settings.RegisterDurationSetting( | ||
settings.SystemOnly, | ||
"kv.rangefeed.lagging_closed_timestamp_cancel_min_lagging_duration", | ||
"if a range's closed timestamp is more than "+ | ||
"`kv.rangefeed.lagging_closed_timestamp_cancel_multiple` of the "+ | ||
"`kv.closed_timestamp.target_duration` behind the current time,"+ | ||
"for at least this duration, cancel the rangefeed", | ||
time.Minute, | ||
) | ||
|
||
type rangeFeedCTLagObserver struct { | ||
exceedsCancelLagStartTime time.Time | ||
} | ||
|
||
func newRangeFeedCTLagObserver() *rangeFeedCTLagObserver { | ||
return &rangeFeedCTLagObserver{} | ||
} | ||
|
||
func (r *rangeFeedCTLagObserver) observeClosedTimestampUpdate( | ||
ctx context.Context, closedTS, now time.Time, sv *settings.Values, | ||
) rangeFeedCTLagSignal { | ||
signal := rangeFeedCTLagSignal{targetLag: closedts.TargetDuration.Get(sv)} | ||
nudgeLagThreshold := signal.targetLag * laggingRangeFeedCTNudgeMultiple | ||
cancelLagThreshold := signal.targetLag * time.Duration(RangeFeedLaggingCTCancelMultiple.Get(sv)) | ||
cancelLagMinDuration := RangeFeedLaggingCTCancelDuration.Get(sv) | ||
|
||
signal.lag = now.Sub(closedTS) | ||
if signal.lag <= cancelLagThreshold { | ||
// The closed timestamp is no longer lagging behind the current time by | ||
// more than the cancel threshold, so reset the start time, as we only want | ||
// to signal on sustained lag above the threshold. | ||
r.exceedsCancelLagStartTime = time.Time{} | ||
} else if r.exceedsCancelLagStartTime.IsZero() { | ||
r.exceedsCancelLagStartTime = now | ||
} | ||
signal.exceedsNudgeLagThreshold = signal.lag > nudgeLagThreshold | ||
signal.exceedsCancelLagThreshold = !r.exceedsCancelLagStartTime.IsZero() && | ||
now.Sub(r.exceedsCancelLagStartTime) > cancelLagMinDuration | ||
return signal | ||
} | ||
|
||
type rangeFeedCTLagSignal struct { | ||
lag time.Duration | ||
targetLag time.Duration | ||
exceedsNudgeLagThreshold bool | ||
exceedsCancelLagThreshold bool | ||
} | ||
|
||
func (rfls rangeFeedCTLagSignal) String() string { | ||
return redact.StringWithoutMarkers(rfls) | ||
} | ||
|
||
var _ redact.SafeFormatter = rangeFeedCTLagSignal{} | ||
|
||
// SafeFormat implements the redact.SafeFormatter interface. | ||
func (rfls rangeFeedCTLagSignal) SafeFormat(w redact.SafePrinter, _ rune) { | ||
w.Printf( | ||
"behind=%v target=%v nudge=%t cancel=%t", | ||
rfls.lag, | ||
rfls.targetLag, | ||
rfls.exceedsNudgeLagThreshold, | ||
rfls.exceedsCancelLagThreshold, | ||
) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,186 @@ | ||
// Copyright 2024 The Cockroach Authors. | ||
// | ||
// Use of this software is governed by the CockroachDB Software License | ||
// included in the /LICENSE file. | ||
|
||
package kvserver | ||
|
||
import ( | ||
"context" | ||
"testing" | ||
"time" | ||
|
||
"github.com/cockroachdb/cockroach/pkg/settings/cluster" | ||
"github.com/cockroachdb/cockroach/pkg/util/leaktest" | ||
"github.com/cockroachdb/cockroach/pkg/util/log" | ||
"github.com/stretchr/testify/require" | ||
) | ||
|
||
// TestObserveClosedTimestampUpdate asserts that the expected signal is | ||
// generated for each closed timestamp observation over time. | ||
func TestObserveClosedTimestampUpdate(t *testing.T) { | ||
defer leaktest.AfterTest(t)() | ||
defer log.Scope(t).Close(t) | ||
|
||
ctx := context.Background() | ||
st := cluster.MakeTestingClusterSettings() | ||
sv := &st.SV | ||
|
||
baseTime := time.Date(2024, 1, 1, 0, 0, 0, 0, time.UTC) | ||
targetDuration := 3 * time.Second | ||
cancelMultiple := int64(2) | ||
cancelMinDuration := 60 * time.Second | ||
|
||
RangeFeedLaggingCTCancelMultiple.Override(ctx, sv, cancelMultiple) | ||
RangeFeedLaggingCTCancelDuration.Override(ctx, sv, cancelMinDuration) | ||
|
||
tests := []struct { | ||
name string | ||
updates []struct { | ||
closedTS time.Time | ||
now time.Time | ||
} | ||
expected []rangeFeedCTLagSignal | ||
}{ | ||
{ | ||
name: "no lag", | ||
updates: []struct { | ||
closedTS time.Time | ||
now time.Time | ||
}{ | ||
{ | ||
closedTS: baseTime, | ||
now: baseTime.Add(targetDuration), | ||
}, | ||
}, | ||
expected: []rangeFeedCTLagSignal{ | ||
{ | ||
lag: targetDuration, | ||
targetLag: targetDuration, | ||
exceedsNudgeLagThreshold: false, | ||
exceedsCancelLagThreshold: false, | ||
}, | ||
}, | ||
}, | ||
{ | ||
name: "exceeds nudge threshold", | ||
updates: []struct { | ||
closedTS time.Time | ||
now time.Time | ||
}{ | ||
{ | ||
closedTS: baseTime, | ||
now: baseTime.Add(targetDuration * (laggingRangeFeedCTNudgeMultiple + 1)), | ||
}, | ||
}, | ||
expected: []rangeFeedCTLagSignal{ | ||
{ | ||
lag: targetDuration * (laggingRangeFeedCTNudgeMultiple + 1), | ||
targetLag: targetDuration, | ||
exceedsNudgeLagThreshold: true, | ||
exceedsCancelLagThreshold: false, | ||
}, | ||
}, | ||
}, | ||
{ | ||
name: "exceeds cancel threshold but not duration", | ||
updates: []struct { | ||
closedTS time.Time | ||
now time.Time | ||
}{ | ||
{ | ||
closedTS: baseTime, | ||
now: baseTime.Add(targetDuration * (laggingRangeFeedCTNudgeMultiple + 1)), | ||
}, | ||
{ | ||
closedTS: baseTime, | ||
now: baseTime.Add(targetDuration*(laggingRangeFeedCTNudgeMultiple+1) + cancelMinDuration/2), | ||
}, | ||
}, | ||
expected: []rangeFeedCTLagSignal{ | ||
{ | ||
lag: targetDuration * (laggingRangeFeedCTNudgeMultiple + 1), | ||
targetLag: targetDuration, | ||
exceedsNudgeLagThreshold: true, | ||
exceedsCancelLagThreshold: false, | ||
}, | ||
{ | ||
lag: targetDuration*(laggingRangeFeedCTNudgeMultiple+1) + cancelMinDuration/2, | ||
targetLag: targetDuration, | ||
exceedsNudgeLagThreshold: true, | ||
exceedsCancelLagThreshold: false, | ||
}, | ||
}, | ||
}, | ||
{ | ||
name: "exceeds cancel threshold and duration", | ||
updates: []struct { | ||
closedTS time.Time | ||
now time.Time | ||
}{ | ||
{ | ||
closedTS: baseTime, | ||
now: baseTime.Add(targetDuration * (laggingRangeFeedCTNudgeMultiple + 1)), | ||
}, | ||
{ | ||
closedTS: baseTime, | ||
now: baseTime.Add(targetDuration*(laggingRangeFeedCTNudgeMultiple+1) + cancelMinDuration + time.Second), | ||
}, | ||
}, | ||
expected: []rangeFeedCTLagSignal{ | ||
{ | ||
lag: targetDuration * (laggingRangeFeedCTNudgeMultiple + 1), | ||
targetLag: targetDuration, | ||
exceedsNudgeLagThreshold: true, | ||
exceedsCancelLagThreshold: false, | ||
}, | ||
{ | ||
lag: targetDuration*(laggingRangeFeedCTNudgeMultiple+1) + cancelMinDuration + time.Second, | ||
targetLag: targetDuration, | ||
exceedsNudgeLagThreshold: true, | ||
exceedsCancelLagThreshold: true, | ||
}, | ||
}, | ||
}, | ||
{ | ||
name: "recovers from lag", | ||
updates: []struct { | ||
closedTS time.Time | ||
now time.Time | ||
}{ | ||
{ | ||
closedTS: baseTime, | ||
now: baseTime.Add(targetDuration * (laggingRangeFeedCTNudgeMultiple + 1)), | ||
}, | ||
{ | ||
closedTS: baseTime.Add(targetDuration * (laggingRangeFeedCTNudgeMultiple + 1)), | ||
now: baseTime.Add(targetDuration*(laggingRangeFeedCTNudgeMultiple+1) + targetDuration), | ||
}, | ||
}, | ||
expected: []rangeFeedCTLagSignal{ | ||
{ | ||
lag: targetDuration * (laggingRangeFeedCTNudgeMultiple + 1), | ||
targetLag: targetDuration, | ||
exceedsNudgeLagThreshold: true, | ||
exceedsCancelLagThreshold: false, | ||
}, | ||
{ | ||
lag: targetDuration, | ||
targetLag: targetDuration, | ||
exceedsNudgeLagThreshold: false, | ||
exceedsCancelLagThreshold: false, | ||
}, | ||
}, | ||
}, | ||
} | ||
|
||
for _, tc := range tests { | ||
t.Run(tc.name, func(t *testing.T) { | ||
observer := newRangeFeedCTLagObserver() | ||
for i, update := range tc.updates { | ||
signal := observer.observeClosedTimestampUpdate(ctx, update.closedTS, update.now, sv) | ||
require.Equal(t, tc.expected[i], signal, "update %d produced unexpected signal", i) | ||
} | ||
}) | ||
} | ||
} |
102 changes: 102 additions & 0 deletions
102
pkg/kv/kvserver/testdata/flow_control_integration_v2/send_queue_range_feed
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
echo | ||
---- | ||
---- | ||
(Rangefeed on n3) | ||
|
||
|
||
-- We will exhaust the tokens across all streams while admission is blocked on | ||
-- n3, using 4x1 MiB (deduction, the write itself is small) writes. Then, | ||
-- we will write 1 MiB to the range and wait for the closedTS to fall | ||
-- behind on n3. We expect that the closedTS falling behind will trigger | ||
-- an error that is returned to the mux rangefeed client, which will in turn | ||
-- allows the rangefeed request to be re-routed to another replica. | ||
|
||
|
||
(Sending 1 MiB put request to develop a send queue) | ||
|
||
|
||
(Sent 1 MiB put request) | ||
|
||
|
||
-- Send queue metrics from n1, n3's send queue should have 1 MiB for s3. | ||
SELECT name, crdb_internal.humanize_bytes(value::INT8) | ||
FROM crdb_internal.node_metrics | ||
WHERE name LIKE '%kvflowcontrol%send_queue%' | ||
AND name != 'kvflowcontrol.send_queue.count' | ||
ORDER BY name ASC; | ||
|
||
kvflowcontrol.send_queue.bytes | 1.0 MiB | ||
kvflowcontrol.send_queue.prevent.count | 0 B | ||
kvflowcontrol.send_queue.scheduled.deducted_bytes | 0 B | ||
kvflowcontrol.send_queue.scheduled.force_flush | 0 B | ||
kvflowcontrol.tokens.send.elastic.deducted.force_flush_send_queue | 0 B | ||
kvflowcontrol.tokens.send.elastic.deducted.prevent_send_queue | 0 B | ||
kvflowcontrol.tokens.send.regular.deducted.prevent_send_queue | 0 B | ||
|
||
|
||
-- Observe the total tracked tokens per-stream on n1, s3's entries will still | ||
-- be tracked here. | ||
SELECT range_id, store_id, crdb_internal.humanize_bytes(total_tracked_tokens::INT8) | ||
FROM crdb_internal.kv_flow_control_handles_v2 | ||
|
||
range_id | store_id | total_tracked_tokens | ||
-----------+----------+----------------------- | ||
75 | 1 | 0 B | ||
75 | 2 | 0 B | ||
75 | 3 | 4.0 MiB | ||
|
||
|
||
-- Per-store tokens available from n1, these should reflect the lack of tokens | ||
-- for s3. | ||
SELECT store_id, | ||
crdb_internal.humanize_bytes(available_eval_regular_tokens), | ||
crdb_internal.humanize_bytes(available_eval_elastic_tokens), | ||
crdb_internal.humanize_bytes(available_send_regular_tokens), | ||
crdb_internal.humanize_bytes(available_send_elastic_tokens) | ||
FROM crdb_internal.kv_flow_controller_v2 | ||
ORDER BY store_id ASC; | ||
|
||
store_id | eval_regular_available | eval_elastic_available | send_regular_available | send_elastic_available | ||
-----------+------------------------+------------------------+------------------------+------------------------- | ||
1 | 4.0 MiB | 2.0 MiB | 4.0 MiB | 2.0 MiB | ||
2 | 4.0 MiB | 2.0 MiB | 4.0 MiB | 2.0 MiB | ||
3 | 0 B | -3.0 MiB | 0 B | -2.0 MiB | ||
|
||
|
||
(Rangefeed moved to n1) | ||
|
||
|
||
-- (Allowing below-raft admission to proceed on n3.) | ||
|
||
|
||
-- Send queue and flow token metrics from n1. All tokens should be returned. | ||
SELECT name, crdb_internal.humanize_bytes(value::INT8) | ||
FROM crdb_internal.node_metrics | ||
WHERE name LIKE '%kvflowcontrol%send_queue%' | ||
AND name != 'kvflowcontrol.send_queue.count' | ||
ORDER BY name ASC; | ||
|
||
kvflowcontrol.send_queue.bytes | 0 B | ||
kvflowcontrol.send_queue.prevent.count | 0 B | ||
kvflowcontrol.send_queue.scheduled.deducted_bytes | 0 B | ||
kvflowcontrol.send_queue.scheduled.force_flush | 0 B | ||
kvflowcontrol.tokens.send.elastic.deducted.force_flush_send_queue | 0 B | ||
kvflowcontrol.tokens.send.elastic.deducted.prevent_send_queue | 0 B | ||
kvflowcontrol.tokens.send.regular.deducted.prevent_send_queue | 0 B | ||
SELECT store_id, | ||
crdb_internal.humanize_bytes(available_eval_regular_tokens), | ||
crdb_internal.humanize_bytes(available_eval_elastic_tokens), | ||
crdb_internal.humanize_bytes(available_send_regular_tokens), | ||
crdb_internal.humanize_bytes(available_send_elastic_tokens) | ||
FROM crdb_internal.kv_flow_controller_v2 | ||
ORDER BY store_id ASC; | ||
|
||
store_id | eval_regular_available | eval_elastic_available | send_regular_available | send_elastic_available | ||
-----------+------------------------+------------------------+------------------------+------------------------- | ||
1 | 4.0 MiB | 2.0 MiB | 4.0 MiB | 2.0 MiB | ||
2 | 4.0 MiB | 2.0 MiB | 4.0 MiB | 2.0 MiB | ||
3 | 4.0 MiB | 2.0 MiB | 4.0 MiB | 2.0 MiB | ||
---- | ||
---- | ||
|
||
# vim:ft=sql |