-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
engine: Limit size of merged rocksdb batches #27865
Comments
I'm a bit surprised this actually occurred in practice. I'd really love to see this super-large batch and verify it contains keys from multiple different ranges. Failing that, I'll probably make an attempt at reproducing the problem first. |
Yeah, this is speculation on my part since I don't know anything about what's actually in this batch. It's definitely true that there's no limit there now so it could produce a batch of unbounded size. It doesn't sound too far-fetched to me that all the node restarts could provide a forcing function that would result in more simultaneous disk writes that would hit this batching mechanism. |
I'm pretty sure RocksDB doesn't have a limit on the size of batches that it will merge either. Let me check that. |
I was wrong, RocksDB does have a limit on the size of batches it groups together. I'll add something similar.
|
Limit the size of batch groups to 1 MB. A larger batch can still be committed, but it won't be grouped with any other batches. Fixes cockroachdb#27865 Release note (bug fix): Limit the size of "batch groups" when committing a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch groups are created which can cause a server to run out of memory when replaying the RocksDB log at startup.
Limit the size of batch groups to 1 MB. A larger batch can still be committed, but it won't be grouped with any other batches. Fixes cockroachdb#27865 Release note (bug fix): Limit the size of "batch groups" when committing a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch groups are created which can cause a server to run out of memory when replaying the RocksDB log at startup.
Limit the size of batch groups to 1 MB. A larger batch can still be committed, but it won't be grouped with any other batches. Fixes cockroachdb#27865 Release note (bug fix): Limit the size of "batch groups" when committing a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch groups are created which can cause a server to run out of memory when replaying the RocksDB log at startup.
27895: storage/engine: limit the size of batch groups r=bdarnell a=petermattis Limit the size of batch groups to 1 MB. A larger batch can still be committed, but it won't be grouped with any other batches. Fixes #27865 Release note (bug fix): Limit the size of "batch groups" when committing a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch groups are created which can cause a server to run out of memory when replaying the RocksDB log at startup. 27902: roachpb: replace `gogoproto.onlyone` with `oneof` in ErrorDetail r=nvanbenschoten a=tamird According to @nvanbenschoten: Improves throughput by 1.9% against a patched version of `kv --read-percent=0 --seed=<something-fixed>` that used INSERTs instead of UPSERTs to create all primary key violations. With the change to avoid the allocation of `roachpb.Error.ErrorDetail` this change in theory is strictly an improvement because we're simply trading one big allocation (`ErrorDetail`) for a much smaller one (`isErrorDetail_Value`). I also suspect the constant-time dynamic dispatch of the interface is faster than the linear-time "dynamic" dispatch of the union struct. Co-authored-by: Peter Mattis <[email protected]> Co-authored-by: Tamir Duberstein <[email protected]>
Limit the size of batch groups to 1 MB. A larger batch can still be committed, but it won't be grouped with any other batches. Fixes cockroachdb#27865 Release note (bug fix): Limit the size of "batch groups" when committing a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch groups are created which can cause a server to run out of memory when replaying the RocksDB log at startup.
28009: release-2.0: storage/engine: limit the size of batch groups r=benesch a=petermattis Backport 1/1 commits from #27895. /cc @cockroachdb/release Queuing this up so we don't forget about it. --- Limit the size of batch groups to 1 MB. A larger batch can still be committed, but it won't be grouped with any other batches. Fixes #27865 Release note (bug fix): Limit the size of "batch groups" when committing a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch groups are created which can cause a server to run out of memory when replaying the RocksDB log at startup. Co-authored-by: Peter Mattis <[email protected]>
When many threads are writing to rocksdb at once, a syncer goroutine collects all their write batches to write at once, minimizing the number of distinct disk writes. However, this can result in very large rocksdb WAL entries, and processing a rocksdb WAL entry during recovery appears to require substantially more memory than writing it the first time. A crash while a very large entry is in the uncompacted portion of the WAL can render the node unable to restart. Empirically, we've seen a node with 8GB of ram write a 1.6GB entry to the log and be unable to play it back on restart.
The syncer thread should monitor the size of the batches it builds and limit them to a reasonable size.
cc @petermattis
The text was updated successfully, but these errors were encountered: