kv: rebalance snapshots can starve recovery snapshots with asymmetric settings #81832
Labels
A-kv
Anything in KV that doesn't belong in a more specific category.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
O-postmortem
Originated from a Postmortem action item.
T-kv
KV Team
In a customer issue, we saw that rebalance snapshots could starve recovery snapshots. At a high level, this is because a node will only receive one snapshot at a time and the two share the same receiver-side semaphore. This alone is an issue because it means that the difference in importance between snapshots is not recognized.
The issue is more severe when the
kv.snapshot_recovery.max_rate
andkv.snapshot_rebalance.max_rate
settings are given different values. This is because these values inform the timeouts assigned to snapshots. If the recovery rate is high and the rebalance rate is low, recovery snapshots can have a lower timeout than the expected duration of a single rebalance snapshot. This means that any steady rebalance load can starve recovery snapshots. One potential mitigation for this last issue is to set the timeout for a snapshot based onmin(kv.snapshot_recovery.max_rate, kv.snapshot_rebalance.max_rate)
to avoid this problem.Jira issue: CRDB-16088
Epic CRDB-16160
The text was updated successfully, but these errors were encountered: