kvserver: compute MVCCStats in AdminSplit instead of splitTrigger #119499

miraradeva · 2024-02-22T02:48:37Z

In order to mitigate the latency impact of computing MVCC stats during a split (#22348), we can move some of the stats computation from the splitTrigger (which holds latches while computing the stats) to AdminSplit (where we just need a pebble snapshot). The resulting stats will be inaccurate wrt writes concurrent with the split but experiments have shown that this discrepancy is fairly small. More details on this approach and alternatives in Split Latency due to MVCC Stats Calculation.

The resulting stats will be marked as containing estimates, which will trigger stats re-computation next time the range is processed by the consistency checker queue. In the mean time, the stats are close to accurate, and the estimates do not drift (see below).

For this change, as part of AdminSplit compute the following:

If the range doesn't contain estimates, before creating the split txn, iterate over the new LHS and compute its MVCC stats as preSplitLeftStats.
Otherwise, iterate over the entire range and compute the stats of the LHS as preSplitLeftStats and the stats of the entire range as preSplitFullStats. Since the range contains estimates preSplitFullStats will differ from the stored stats for the range. We can either correct them now, in AdminSplit, or pass in a delta to splitTrigger, as antiDriftDelta. This delta accounts for estimates introduced during a previous split, or AddSST, or any any other operation that introduces estimates. Correcting this delta ensures stats estimates do not compound over time.
Pass in preSplitLeftStats and potentially antiDriftDelta to splitTrigger.

In the split trigger:

Use preSplitLeftStats as the post-split LHS stats.
Apply antiDriftDelta to the stored full-range stats (if we didn't already do it in AdminSplit).
Derive the RHS stats by subtracting preSplitLeftStats from the full-range stats.
Mark the stats as containing estimates because these pre-split stats may not include any write concurrent with the split.
If any of the new LHS or RHS ranges are empty, don't mark them as containing estimates to avoid extra stats re-computation later.

Jira issue: CRDB-36253

Epic CRDB-34215

The text was updated successfully, but these errors were encountered:

…-user spans ComputeStatsForRange iterates over an entire range and computes its MVCC stats. This patch adds ComputeStatsForRangeUserOnly and ComputeStatsForRangeExcludingUser, which compute the MVCC stats corresponding to the user-only and non-user spans in the range, respectively. These functions are useful for computing the estimated stats proposed in cockroachdb#119499 by allowing to quickly iterate only over the non-user key spans in the range, and separately iterate over the user-only key spans. Part of: cockroachdb#119499 Release note: None

Previously, we computed MVCC stats for the LHS and RHS of a split in splitTrigger, while holding latches. This resulted in high tail latencies due to foreground work not being able to proceed as long as the split holds these latches. More details in cockroachdb#119499. This change moves some of the stats computation to AdminSplit, where no latches are held. In particular, in AdminSplit, we pre-compute the part of the LHS stats corresponding to user-data spans, and pass that to the splitTrigger. In the splitTrigger, we iterate only over the non-user spans in the range (which are small), and combine the resulting stats with the passed in user stats to arrive at the new LHS stats. The RHS stats are derived by substituting the LHS stats from the total on-disk stats. This computation does not produce 100% correct stats because any writes concurrent with the split may not be attributed correctly to the LHS or RHS. Therefore, both LHS and RHS stats are marked with ContainsEstimates. To prevent the stats from drifting after successive splits, we kick off a stats re-computation in AdminSplit (only if the on-disk stats contain estimates). Thus, each split may introduce some stats estimates, but before doing so it will correct any previous estimates. There are two invariants that the stats computation guarantees: 1. Non-user stats (SysBytes, SysCount, etc.) are always correct. This is because we scan those spans while holding latches. 2. If there are no writes concurrent with the split, the stats are always correct. Fixes: cockroachdb#119499 Release note (performance improvement): Splits no longer hold latches for time proportional to the range size while computing MVCC stats. Instead, MVCC stats are pre-computed before the critical section of the split. As a side effect, the resulting stats are no longer 100% accurate because they may correctly distribute writes concurrent with the split. To mitigate this, and to prevent the stats from drifting after successive splits, the existing stored stats are re-computed and corrected (if needed) as part of the non-critical section of the split.

…-user spans ComputeStatsForRange iterates over an entire range and computes its MVCC stats. This patch adds ComputeStatsForRangeUserOnly and ComputeStatsForRangeExcludingUser, which compute the MVCC stats corresponding to the user-only and non-user spans in the range, respectively. These functions are useful for computing the estimated stats proposed in cockroachdb#119499 by allowing to quickly iterate only over the non-user key spans in the range, and separately iterate over the user-only key spans. Part of: cockroachdb#119499 Release note: None

Previously, we computed MVCC stats for the LHS and RHS of a split in splitTrigger, while holding latches. This resulted in high tail latencies due to foreground work not being able to proceed as long as the split holds these latches. More details in cockroachdb#119499. This change moves some of the stats computation to AdminSplit, where no latches are held. In particular, in AdminSplit, we pre-compute the part of the LHS stats corresponding to user-data spans, and pass that to the splitTrigger. In the splitTrigger, we iterate only over the non-user spans in the range (which are small), and combine the resulting stats with the passed in user stats to arrive at the new LHS stats. The RHS stats are derived by substituting the LHS stats from the total on-disk stats. This computation does not produce 100% correct stats because any writes concurrent with the split may not be attributed correctly to the LHS or RHS. Therefore, both LHS and RHS stats are marked with ContainsEstimates. To prevent the stats from drifting after successive splits, we kick off a stats re-computation in AdminSplit (only if the on-disk stats contain estimates). Thus, each split may introduce some stats estimates, but before doing so it will correct any previous estimates. There are two invariants that the stats computation guarantees: 1. Non-user stats (SysBytes, SysCount, etc.) are always correct. This is because we scan those spans while holding latches. 2. If there are no writes concurrent with the split, the stats are always correct. Fixes: cockroachdb#119499 Release note (performance improvement): Splits no longer hold latches for time proportional to the range size while computing MVCC stats. Instead, MVCC stats are pre-computed before the critical section of the split. As a side effect, the resulting stats are no longer 100% accurate because they may correctly distribute writes concurrent with the split. To mitigate this, and to prevent the stats from drifting after successive splits, the existing stored stats are re-computed and corrected (if needed) as part of the non-critical section of the split.

…ccurate-stats split This patch is a followup to cockroachdb#119499, and allows to add thresholds on the number of keys and bytes of the difference between the pre-split MVCC stats (retrieved in AdminSplit) and the stats when the split holds latches (retrieved in splitTrigger). This difference corresponds to writes concurrent with the split. If the difference is too large, the split falls back to computing LHS and RHS stats accurately. The difference is computed only for stats corresponding to user-data; system stats are always kept accurate. The thresholds are tunable by cluster settings. Fixes: cockroachdb#119503 Release note: None

…-user spans ComputeStatsForRange iterates over an entire range and computes its MVCC stats. This patch adds ComputeStatsForRangeUserOnly and ComputeStatsForRangeExcludingUser, which compute the MVCC stats corresponding to the user-only and non-user spans in the range, respectively. These functions are useful for computing the estimated stats proposed in cockroachdb#119499 by allowing to quickly iterate only over the non-user key spans in the range, and separately iterate over the user-only key spans. Part of: cockroachdb#119499 Release note: None

Previously, we computed MVCC stats for the LHS and RHS of a split in splitTrigger, while holding latches. This resulted in high tail latencies due to foreground work not being able to proceed as long as the split holds these latches. More details in cockroachdb#119499. This change moves some of the stats computation to AdminSplit, where no latches are held. In particular, in AdminSplit, we pre-compute the part of the LHS stats corresponding to user-data spans, and pass that to the splitTrigger. In the splitTrigger, we iterate only over the non-user spans in the range (which are small), and combine the resulting stats with the passed in user stats to arrive at the new LHS stats. The RHS stats are derived by substituting the LHS stats from the total on-disk stats. This computation does not produce 100% correct stats because any writes concurrent with the split may not be attributed correctly to the LHS or RHS. Therefore, both LHS and RHS stats are marked with ContainsEstimates. To prevent the stats from drifting after successive splits, we kick off a stats re-computation in AdminSplit (only if the on-disk stats contain estimates). Thus, each split may introduce some stats estimates, but before doing so it will correct any previous estimates. There are two invariants that the stats computation guarantees: 1. Non-user stats (SysBytes, SysCount, etc.) are always correct. This is because we scan those spans while holding latches. 2. If there are no writes concurrent with the split, the stats are always correct. Fixes: cockroachdb#119499 Release note (performance improvement): Splits no longer hold latches for time proportional to the range size while computing MVCC stats. Instead, MVCC stats are pre-computed before the critical section of the split. As a side effect, the resulting stats are no longer 100% accurate because they may correctly distribute writes concurrent with the split. To mitigate this, and to prevent the stats from drifting after successive splits, the existing stored stats are re-computed and corrected (if needed) as part of the non-critical section of the split.

…ccurate-stats split This patch is a followup to cockroachdb#119499, and allows to add thresholds on the number of keys and bytes of the difference between the pre-split MVCC stats (retrieved in AdminSplit) and the stats when the split holds latches (retrieved in splitTrigger). This difference corresponds to writes concurrent with the split. If the difference is too large, the split falls back to computing LHS and RHS stats accurately. The difference is computed only for stats corresponding to user-data; system stats are always kept accurate. These thresholds are tunable by two new cluster settings, MaxMVCCStatCountDiff and MaxMVCCStatBytesDiff, which denote the maximum number of entities (e.g. keys, intents) and maximum number of bytes (e.g. value bytes, range value bytes), respectively, that are acceptable as the difference between the pre- and post-split values of an individual stat. The former defaults to 1000, and the latter to 512KB (0.1% of the maximum range size). Fixes: cockroachdb#119503 Release note: None

…-user spans ComputeStatsForRange iterates over an entire range and computes its MVCC stats. This patch adds ComputeStatsForRangeUserOnly and ComputeStatsForRangeExcludingUser, which compute the MVCC stats corresponding to the user-only and non-user spans in the range, respectively. These functions are useful for computing the estimated stats proposed in cockroachdb#119499 by allowing to quickly iterate only over the non-user key spans in the range, and separately iterate over the user-only key spans. Part of: cockroachdb#119499 Release note: None

Previously, we computed MVCC stats for the LHS and RHS of a split in splitTrigger, while holding latches. This resulted in high tail latencies due to foreground work not being able to proceed as long as the split holds these latches. More details in cockroachdb#119499. This change moves some of the stats computation to AdminSplit, where no latches are held. In particular, in AdminSplit, we pre-compute the part of the LHS stats corresponding to user-data spans, and pass that to the splitTrigger. In the splitTrigger, we iterate only over the non-user spans in the range (which are small), and combine the resulting stats with the passed in user stats to arrive at the new LHS stats. The RHS stats are derived by substituting the LHS stats from the total on-disk stats. This computation does not produce 100% correct stats because any writes concurrent with the split may not be attributed correctly to the LHS or RHS. Therefore, both LHS and RHS stats are marked with ContainsEstimates. To prevent the stats from drifting after successive splits, we kick off a stats re-computation in AdminSplit (only if the on-disk stats contain estimates). Thus, each split may introduce some stats estimates, but before doing so it will correct any previous estimates. There are two invariants that the stats computation guarantees: 1. Non-user stats (SysBytes, SysCount, etc.) are always correct. This is because we scan those spans while holding latches. 2. If there are no writes concurrent with the split, the stats are always correct. Fixes: cockroachdb#119499 Release note (performance improvement): Splits no longer hold latches for time proportional to the range size while computing MVCC stats. Instead, MVCC stats are pre-computed before the critical section of the split. As a side effect, the resulting stats are no longer 100% accurate because they may correctly distribute writes concurrent with the split. To mitigate this, and to prevent the stats from drifting after successive splits, the existing stored stats are re-computed and corrected (if needed) as part of the non-critical section of the split.

…ccurate-stats split This patch is a followup to cockroachdb#119499, and allows to add thresholds on the number of keys and bytes of the difference between the pre-split MVCC stats (retrieved in AdminSplit) and the stats when the split holds latches (retrieved in splitTrigger). This difference corresponds to writes concurrent with the split. If the difference is too large, the split falls back to computing LHS and RHS stats accurately. The difference is computed only for stats corresponding to user-data; system stats are always kept accurate. These thresholds are tunable by two new cluster settings, MaxMVCCStatCountDiff and MaxMVCCStatBytesDiff, which denote the maximum number of entities (e.g. keys, intents) and maximum number of bytes (e.g. value bytes, range value bytes), respectively, that are acceptable as the difference between the pre- and post-split values of an individual stat. The former defaults to 5000, and the latter to 5.12MB (1% of the maximum range size). Fixes: cockroachdb#119503 Release note: None

…-user spans ComputeStatsForRange iterates over an entire range and computes its MVCC stats. This patch adds ComputeStatsForRangeUserOnly and ComputeStatsForRangeExcludingUser, which compute the MVCC stats corresponding to the user-only and non-user spans in the range, respectively. These functions are useful for computing the estimated stats proposed in cockroachdb#119499 by allowing to quickly iterate only over the non-user key spans in the range, and separately iterate over the user-only key spans. Part of: cockroachdb#119499 Release note: None

…ccurate-stats split This patch is a followup to cockroachdb#119499, and allows to add thresholds on the number of keys and bytes of the difference between the pre-split MVCC stats (retrieved in AdminSplit) and the stats when the split holds latches (retrieved in splitTrigger). This difference corresponds to writes concurrent with the split. If the difference is too large, the split falls back to computing LHS and RHS stats accurately. The difference is computed only for stats corresponding to user-data; system stats are always kept accurate. These thresholds are tunable by two new cluster settings, MaxMVCCStatCountDiff and MaxMVCCStatBytesDiff, which denote the maximum number of entities (e.g. keys, intents) and maximum number of bytes (e.g. value bytes, range value bytes), respectively, that are acceptable as the difference between the pre- and post-split values of an individual stat. The former defaults to 5000, and the latter to 5.12MB (1% of the maximum range size). Fixes: cockroachdb#119503 Release note: None

miraradeva added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv Anything in KV that doesn't belong in a more specific category. T-kv KV Team labels Feb 22, 2024

miraradeva self-assigned this Feb 22, 2024

craig bot closed this as completed in 5ce9052 Mar 21, 2024

kvoli mentioned this issue Apr 12, 2024

kvserver: cpu spike during import due to split stats computation #122309

Closed

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: compute MVCCStats in AdminSplit instead of splitTrigger #119499

kvserver: compute MVCCStats in AdminSplit instead of splitTrigger #119499

miraradeva commented Feb 22, 2024 •

edited by exalate-issue-sync bot

Loading

kvserver: compute MVCCStats in AdminSplit instead of splitTrigger #119499

kvserver: compute MVCCStats in AdminSplit instead of splitTrigger #119499

Comments

miraradeva commented Feb 22, 2024 • edited by exalate-issue-sync bot Loading

miraradeva commented Feb 22, 2024 •

edited by exalate-issue-sync bot

Loading