engine: Limit size of merged rocksdb batches #27865

bdarnell · 2018-07-23T21:44:21Z

When many threads are writing to rocksdb at once, a syncer goroutine collects all their write batches to write at once, minimizing the number of distinct disk writes. However, this can result in very large rocksdb WAL entries, and processing a rocksdb WAL entry during recovery appears to require substantially more memory than writing it the first time. A crash while a very large entry is in the uncompacted portion of the WAL can render the node unable to restart. Empirically, we've seen a node with 8GB of ram write a 1.6GB entry to the log and be unable to play it back on restart.

The syncer thread should monitor the size of the batches it builds and limit them to a reasonable size.

cc @petermattis

petermattis · 2018-07-24T00:03:33Z

I'm a bit surprised this actually occurred in practice. I'd really love to see this super-large batch and verify it contains keys from multiple different ranges. Failing that, I'll probably make an attempt at reproducing the problem first.

bdarnell · 2018-07-24T17:44:49Z

Yeah, this is speculation on my part since I don't know anything about what's actually in this batch. It's definitely true that there's no limit there now so it could produce a batch of unbounded size. It doesn't sound too far-fetched to me that all the node restarts could provide a forcing function that would result in more simultaneous disk writes that would hit this batching mechanism.

petermattis · 2018-07-24T18:15:55Z

I'm pretty sure RocksDB doesn't have a limit on the size of batches that it will merge either. Let me check that.

petermattis · 2018-07-24T18:26:36Z

I was wrong, RocksDB does have a limit on the size of batches it groups together. I'll add something similar.

  // Allow the group to grow up to a maximum size, but if the
  // original write is small, limit the growth so we do not slow
  // down the small write too much.
  size_t max_size = 1 << 20;
  if (size <= (128 << 10)) {
    max_size = size + (128 << 10);
  }

Limit the size of batch groups to 1 MB. A larger batch can still be committed, but it won't be grouped with any other batches. Fixes cockroachdb#27865 Release note (bug fix): Limit the size of "batch groups" when committing a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch groups are created which can cause a server to run out of memory when replaying the RocksDB log at startup.

@nvanbenschoten

27895: storage/engine: limit the size of batch groups r=bdarnell a=petermattis Limit the size of batch groups to 1 MB. A larger batch can still be committed, but it won't be grouped with any other batches. Fixes #27865 Release note (bug fix): Limit the size of "batch groups" when committing a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch groups are created which can cause a server to run out of memory when replaying the RocksDB log at startup. 27902: roachpb: replace `gogoproto.onlyone` with `oneof` in ErrorDetail r=nvanbenschoten a=tamird According to @nvanbenschoten: Improves throughput by 1.9% against a patched version of `kv --read-percent=0 --seed=<something-fixed>` that used INSERTs instead of UPSERTs to create all primary key violations. With the change to avoid the allocation of `roachpb.Error.ErrorDetail` this change in theory is strictly an improvement because we're simply trading one big allocation (`ErrorDetail`) for a much smaller one (`isErrorDetail_Value`). I also suspect the constant-time dynamic dispatch of the interface is faster than the linear-time "dynamic" dispatch of the union struct. Co-authored-by: Peter Mattis <[email protected]> Co-authored-by: Tamir Duberstein <[email protected]>

Limit the size of batch groups to 1 MB. A larger batch can still be committed, but it won't be grouped with any other batches. Fixes cockroachdb#27865 Release note (bug fix): Limit the size of "batch groups" when committing a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch groups are created which can cause a server to run out of memory when replaying the RocksDB log at startup.

28009: release-2.0: storage/engine: limit the size of batch groups r=benesch a=petermattis Backport 1/1 commits from #27895. /cc @cockroachdb/release Queuing this up so we don't forget about it. --- Limit the size of batch groups to 1 MB. A larger batch can still be committed, but it won't be grouped with any other batches. Fixes #27865 Release note (bug fix): Limit the size of "batch groups" when committing a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch groups are created which can cause a server to run out of memory when replaying the RocksDB log at startup. Co-authored-by: Peter Mattis <[email protected]>

petermattis self-assigned this Jul 24, 2018

petermattis mentioned this issue Jul 24, 2018

storage/engine: limit the size of batch groups #27895

Merged

craig bot closed this as completed in #27895 Jul 26, 2018

petermattis mentioned this issue Jul 29, 2018

release-2.0: storage/engine: limit the size of batch groups #28009

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

engine: Limit size of merged rocksdb batches #27865

engine: Limit size of merged rocksdb batches #27865

bdarnell commented Jul 23, 2018

petermattis commented Jul 24, 2018

bdarnell commented Jul 24, 2018

petermattis commented Jul 24, 2018

petermattis commented Jul 24, 2018

engine: Limit size of merged rocksdb batches #27865

engine: Limit size of merged rocksdb batches #27865

Comments

bdarnell commented Jul 23, 2018

petermattis commented Jul 24, 2018

bdarnell commented Jul 24, 2018

petermattis commented Jul 24, 2018

petermattis commented Jul 24, 2018