-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: reduce RaftMaxInflightBytes
to 32 MB
#101437
base: reduce RaftMaxInflightBytes
to 32 MB
#101437
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This still needs some benchmarking, but initial results in #100341 are promising. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for experimenting!
This patch reduces the default `RaftMaxInflightBytes` from 256 MB to 32 MB, to reduce the out-of-memory incidence during bulk operations like `RESTORE` on clusters with overloaded disks. `RaftMaxInflightBytes` specifies the maximum aggregate byte size of Raft log entries that a leader will send to a follower without hearing responses. As such, it also bounds the amount of replication data buffered in memory on the receiver. Individual messages can still exceed this limit (consider the default command size limit at 64 MB). Normally, `RaftMaxInflightMsgs` * `RaftMaxSizePerMsg` will bound this at 4 MB (128 messages at 32 KB each). However, individual messages are allowed to exceed the 32 KB limit, typically large AddSSTable commands that can be around 10 MB each. To prevent followers running out of memory, we place an additional total byte limit of 32 MB, which is 8 times more than normal. A survey of CC clusters over the past 30 days showed that, excluding a single outlier cluster, the total outstanding `raft.rcvd.queued_bytes` of any individual node never exceeded 500 MB, and was roughly 0 across all clusters for the majority of time. Epic: none Release note (ops change): the amount of replication traffic in flight from a single Raft leader to a follower has been reduced from 256 MB to 32 MB, in order to reduce the chance of running out of memory during bulk write operations. This can be controlled via the environment variable `COCKROACH_RAFT_MAX_INFLIGHT_BYTES`.
42f50f2
to
8136e8e
Compare
I ran some kv0 experiments with batches of 10x 1MB blocks at concurrency 32 on a 5-node cluster with 100 GB pd-ssd, which saturated write bandwidth, and we didn't really see any queueing (running without this patch). Even when I starved IO on one node, the queues didn't increase too much (never beyond 100 MB total per node). So it doesn't really seem like this should affect regular SQL traffic. |
This PR successfully prevented OOMs in 10 runs of Thanks for the review, and for implementing bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error setting reviewers, but backport branch blathers/backport-release-23.1-101437 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/101507/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 23.1.x failed. See errors above. error setting reviewers, but backport branch blathers/backport-release-23.1.0-101437 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/101508/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 23.1.0 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This patch reduces the default
RaftMaxInflightBytes
from 256 MB to 32 MB, to reduce the out-of-memory incidence during bulk operations likeRESTORE
on clusters with overloaded disks.RaftMaxInflightBytes
specifies the maximum aggregate byte size of Raft log entries that a leader will send to a follower without hearing responses. As such, it also bounds the amount of replication data buffered in memory on the receiver. Individual messages can still exceed this limit (consider the default command size limit at 64 MB).Normally,
RaftMaxInflightMsgs
*RaftMaxSizePerMsg
will bound this at 4 MB (128 messages at 32 KB each). However, individual messages are allowed to exceed the 32 KB limit, typically large AddSSTable commands that can be around 10 MB each. To prevent followers running out of memory, we place an additional total byte limit of 32 MB, which is 8 times more than normal.A survey of CC clusters over the past 30 days showed that, excluding a single outlier cluster, the total outstanding
raft.rcvd.queued_bytes
of any individual node never exceeded 500 MB, and was roughly 0 across all clusters for the majority of time.Touches #71805.
Resolves #100341.
Resolves #100804.
Resolves #100983.
Resolves #101426.
Epic: none
Release note (ops change): the amount of replication traffic in flight from a single Raft leader to a follower has been reduced from 256 MB to 32 MB, in order to reduce the chance of running out of memory during bulk write operations. This can be controlled via the environment variable
COCKROACH_RAFT_MAX_INFLIGHT_BYTES
.