-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: limit Raft ready memory usage across replicas #102840
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
P-2
Issues/test failures with a fix SLA of 3 months
T-kv
KV Team
Comments
erikgrinaker
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
A-kv-replication
Relating to Raft, consensus, and coordination.
T-kv-replication
labels
May 6, 2023
cc @cockroachdb/replication |
erikgrinaker
changed the title
kvserver: limit Raft ready memory usage
kvserver: limit Raft ready memory usage across replicas
May 6, 2023
exalate-issue-sync
bot
removed
the
O-postmortem
Originated from a Postmortem action item.
label
May 16, 2023
shralex
added
the
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
label
Jul 17, 2023
craig bot
pushed a commit
that referenced
this issue
Mar 22, 2024
105286: kvserver: introduce raft memory tracking r=nvanbenschoten a=pav-kv This PR introduces bytes tracking/limiting for raft entries. At the moment, it only tracks entries returned by raft `Ready()`. This only includes committed entries pulled in order to be applied to the state machine. The lifetime of these entries is limited to `handleRaftReadyRaftMuLocked` duration, so the acquired bytes quota is released at the end of this call. Currently, `Ready()` calls `Storage.Entries()` only to fetch committed entries to be applied. In the future, it will also fetch entries when sending `MsgApp` messages to followers. The tracking will be extended correspondingly. The limit is controlled by `COCKROACH_RAFT_ENTRIES_MEMORY_LIMIT` env variable, which is currently disabled by default. ### Testing The effect has been demonstrated on a test #120886 which creates a large backlog of unapplied committed entries across multiple ranges. When the server is started, it eagerly tries to apply all these entries. See the screenshots of a few metrics during the unconstrained run, and a run with memory limiting. The memory footprint has dropped from 10GB to 6GB on this test, without negatively impacting the throughput (in fact, there is throughput improvement; this can be accidental, but I observed it multiple times). <details> <summary>Before</summary> <img width="1062" alt="Screenshot 2024-03-22 at 13 09 56" src="https://github.com/cockroachdb/cockroach/assets/3757441/09cd14c0-54ab-4585-8983-5562d5a5b49e"> <img width="1050" alt="Screenshot 2024-03-22 at 13 10 10" src="https://github.com/cockroachdb/cockroach/assets/3757441/93d89241-5f51-41d8-9b81-c52fe428f869"> <img width="1061" alt="Screenshot 2024-03-22 at 13 10 22" src="https://github.com/cockroachdb/cockroach/assets/3757441/446e589b-8cef-455e-bd53-75fc98191791"> </details> <details> <summary>After (with COCKROACH_RAFT_ENTRIES_MEMORY_LIMIT=24M)</summary> <img width="1053" alt="Screenshot 2024-03-22 at 13 13 30" src="https://github.com/cockroachdb/cockroach/assets/3757441/9b1e7d67-0d46-48b4-af5c-f1f381b208f1"> <img width="1048" alt="Screenshot 2024-03-22 at 13 13 38" src="https://github.com/cockroachdb/cockroach/assets/3757441/79a91937-46a7-4461-aca2-27ed6387f44c"> <img width="1051" alt="Screenshot 2024-03-22 at 13 13 48" src="https://github.com/cockroachdb/cockroach/assets/3757441/7b02344f-5494-4e3a-9d04-48c6bbedd097"> </details> --- Part of #102840 Release note (performance improvement): This change introduced pacing for pulling raft entries from storage when applying them to the state machine. This helps avoiding OOMs or excessive resource usage when a node catches up on many committed entries, which ultimately affects availability and performance (e.g., tail latencies). New metrics `raft.loaded_entries.bytes` and `raft.loaded_entries.reserved.bytes` gauge the total size of log entries pulled from raft for applying to the state machine, and can be used for figuring out a reasonable soft limit. The soft limit can be set via `COCKROACH_RAFT_ENTRIES_MEMORY_LIMIT` env var which is currently disabled by default. Co-authored-by: Pavel Kalinnikov <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
P-2
Issues/test failures with a fix SLA of 3 months
T-kv
KV Team
When applying committed Raft entries, we pull 64 MB of commands into memory per Raft ready cycle.
cockroach/pkg/base/config.go
Lines 261 to 262 in 8136e8e
However, this is per range, so with e.g.
COCKROACH_SCHEDULER_CONCURRENCY
of 96 (>=12 CPU cores) we can pull up to 6 GB into memory at once. And since the scheduler concurrency is computed per store (#102838), a 10-store node can pull 60 GB into memory at once.Additionally, this does not take into account per-entry overhead (e.g.
replicatedCmdBufNode
andraftpb.Entry
), which can be significant in the case of small commands and has been seen to amplify memory usage by a factor of 12.We need a global memory budget for command application across all replicas on a node, which also takes into account struct overhead.
Jira issue: CRDB-27680
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: