Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: limit Raft ready memory usage across replicas #102840

Open
erikgrinaker opened this issue May 6, 2023 · 1 comment
Open

kvserver: limit Raft ready memory usage across replicas #102840

erikgrinaker opened this issue May 6, 2023 · 1 comment
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented May 6, 2023

When applying committed Raft entries, we pull 64 MB of commands into memory per Raft ready cycle.

defaultRaftMaxCommittedSizePerReady = envutil.EnvOrDefaultInt(
"COCKROACH_RAFT_MAX_COMMITTED_SIZE_PER_READY", 64<<20 /* 64 MB */)

However, this is per range, so with e.g. COCKROACH_SCHEDULER_CONCURRENCY of 96 (>=12 CPU cores) we can pull up to 6 GB into memory at once. And since the scheduler concurrency is computed per store (#102838), a 10-store node can pull 60 GB into memory at once.

Additionally, this does not take into account per-entry overhead (e.g. replicatedCmdBufNode and raftpb.Entry), which can be significant in the case of small commands and has been seen to amplify memory usage by a factor of 12.

We need a global memory budget for command application across all replicas on a node, which also takes into account struct overhead.

Jira issue: CRDB-27680

Epic CRDB-39898

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels May 6, 2023
@blathers-crl
Copy link

blathers-crl bot commented May 6, 2023

cc @cockroachdb/replication

@erikgrinaker erikgrinaker added the O-postmortem Originated from a Postmortem action item. label May 6, 2023
@erikgrinaker erikgrinaker changed the title kvserver: limit Raft ready memory usage kvserver: limit Raft ready memory usage across replicas May 6, 2023
@exalate-issue-sync exalate-issue-sync bot removed the O-postmortem Originated from a Postmortem action item. label May 16, 2023
@shralex shralex added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Jul 17, 2023
@nvanbenschoten nvanbenschoten added the P-2 Issues/test failures with a fix SLA of 3 months label Jan 17, 2024
craig bot pushed a commit that referenced this issue Mar 22, 2024
105286: kvserver: introduce raft memory tracking r=nvanbenschoten a=pav-kv

This PR introduces bytes tracking/limiting for raft entries. At the moment, it only tracks entries returned by raft `Ready()`. This only includes committed entries pulled in order to be applied to the state machine. The lifetime of these entries is limited to `handleRaftReadyRaftMuLocked` duration, so the acquired bytes quota is released at the end of this call.

Currently, `Ready()` calls `Storage.Entries()` only to fetch committed entries to be applied. In the future, it will also fetch entries when sending `MsgApp` messages to followers. The tracking will be extended correspondingly.

The limit is controlled by `COCKROACH_RAFT_ENTRIES_MEMORY_LIMIT` env variable, which is currently disabled by default.

### Testing

The effect has been demonstrated on a test #120886 which creates a large backlog of unapplied committed entries across multiple ranges. When the server is started, it eagerly tries to apply all these entries. See the screenshots of a few metrics during the unconstrained run, and a run with memory limiting.

The memory footprint has dropped from 10GB to 6GB on this test, without negatively impacting the throughput (in fact, there is throughput improvement; this can be accidental, but I observed it multiple times).

<details>
<summary>Before</summary>

<img width="1062" alt="Screenshot 2024-03-22 at 13 09 56" src="https://github.com/cockroachdb/cockroach/assets/3757441/09cd14c0-54ab-4585-8983-5562d5a5b49e">
<img width="1050" alt="Screenshot 2024-03-22 at 13 10 10" src="https://github.com/cockroachdb/cockroach/assets/3757441/93d89241-5f51-41d8-9b81-c52fe428f869">
<img width="1061" alt="Screenshot 2024-03-22 at 13 10 22" src="https://github.com/cockroachdb/cockroach/assets/3757441/446e589b-8cef-455e-bd53-75fc98191791">

</details>

<details>
<summary>After (with COCKROACH_RAFT_ENTRIES_MEMORY_LIMIT=24M)</summary>

<img width="1053" alt="Screenshot 2024-03-22 at 13 13 30" src="https://github.com/cockroachdb/cockroach/assets/3757441/9b1e7d67-0d46-48b4-af5c-f1f381b208f1">
<img width="1048" alt="Screenshot 2024-03-22 at 13 13 38" src="https://github.com/cockroachdb/cockroach/assets/3757441/79a91937-46a7-4461-aca2-27ed6387f44c">
<img width="1051" alt="Screenshot 2024-03-22 at 13 13 48" src="https://github.com/cockroachdb/cockroach/assets/3757441/7b02344f-5494-4e3a-9d04-48c6bbedd097">

</details>

---

Part of #102840

Release note (performance improvement): This change introduced pacing for pulling raft entries from storage when applying them to the state machine. This helps avoiding OOMs or excessive resource usage when a node catches up on many committed entries, which ultimately affects availability and performance (e.g., tail latencies). New metrics `raft.loaded_entries.bytes` and `raft.loaded_entries.reserved.bytes` gauge the total size of log entries pulled from raft for applying to the state machine, and can be used for figuring out a reasonable soft limit. The soft limit can be set via `COCKROACH_RAFT_ENTRIES_MEMORY_LIMIT` env var which is currently disabled by default.

Co-authored-by: Pavel Kalinnikov <[email protected]>
@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

4 participants