Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: unbounded memory use when falling behind on sideloaded MsgApp #73376

Open
2 tasks
tbg opened this issue Dec 2, 2021 · 2 comments
Open
2 tasks

kvserver: unbounded memory use when falling behind on sideloaded MsgApp #73376

tbg opened this issue Dec 2, 2021 · 2 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-investigation Further steps needed to qualify. C-label will change. T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Dec 2, 2021

Description

In #71802 (comment), we are seeing occasional failures due to nodes running out of memory. The heap profiles show large amounts of memory allocated by loading sideloaded SSTs into memory for appending to followers. Each individual raft leader will only pull ~one SST per append (due to our 32kb max-append-size target) but it may do so for each follower, meaning that for every leader in the system, we can expect at most num_followers * sst_size to be pulled into memory per raft cycle. Unfortunately, outgoing messages are buffered and so even a single group might put a theoretical limit of 10k SSTs into memory.
We don't have a single group but potentially tens of thousands of them, and theoretically each of them can do the above (though they all share the 10k limit or messages will be dropped wholesale). In practice, the quota pool should, on each leader, prevent too many SSTs from entering the raft layer before they've been fully distributed to the followers. The quota pool size is half the raft log truncation threshold which is 16mb, i.e. we have an 8mb proposal quota, so really, assuming SSTs that are no larger than 8mb, we expect to have only 8mb*num_followers in flight at any given time, per local raft leader.

Here we saw the heap profile track 2.11GiB. Unfortunately, we don't have the artifacts any more but even with them, it might be difficult to find out whether we are dealing with a small number of extraordinarily large SSTs vs a homogeneous flood of reasonably-sized SSTs. Still, investigating another occurrence would be helpful, in particular with an eye of when during the restore the problem occurs.


Action items

  • add a histogram of raft append sizes (making sure that it allows us to distinguish between few large vs many reasonable msgs)
  • switch from the cardinality-based queuing here to msg-size-based, and selectively drop messages that don't fit into the queue (how to size the queue will be an open question)

Jira issue: CRDB-11564

Epic CRDB-39898

@irfansharif
Copy link
Contributor

irfansharif commented Aug 18, 2022

Tracking this in the 22.2 stability list since it's tripped up several roachtests through opaque ooms (backlinks above) and users are likely to run into it. Needs an owner, ideally someone from the Repl side.

@tbg
Copy link
Member Author

tbg commented Mar 16, 2023

We understand what seems like the dominant class of memory-build up better now thanks to #98576. In short, when a follower is overloaded, SSTs will pile up in both the raft receive queue, and, to a larger degree, the RawNode.raft.raftLog.unstable slice. We see many GBs of data in these locations combined, but it doesn't seem like any individual replica dominates - it's death by a few dozen moderately severe cuts (i.e. 30 * 200 mb or thereabouts).

@tbg tbg changed the title kvserver: unbounded memory use when appending sideloaded proposals kvserver: unbounded memory use when falling behind on sideloaded MsgApp Mar 16, 2023
tbg added a commit to tbg/cockroach that referenced this issue Mar 16, 2023
We see that on 2xlarge this test runs likely runs into its EBS bandwidth
limits. The easiest way to avoid that is to switch to a beefier machine,
which doubles its bandwidth limits.

We should also survive being bandwidth-limited, but currently don't do
reliably - this is tracked in cockroachdb#73376.

Epic: CRDB-25503
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-investigation Further steps needed to qualify. C-label will change. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

3 participants