Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine: Limit size of merged rocksdb batches #27865

Closed
bdarnell opened this issue Jul 23, 2018 · 4 comments · Fixed by #27895
Closed

engine: Limit size of merged rocksdb batches #27865

bdarnell opened this issue Jul 23, 2018 · 4 comments · Fixed by #27895
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@bdarnell
Copy link
Contributor

When many threads are writing to rocksdb at once, a syncer goroutine collects all their write batches to write at once, minimizing the number of distinct disk writes. However, this can result in very large rocksdb WAL entries, and processing a rocksdb WAL entry during recovery appears to require substantially more memory than writing it the first time. A crash while a very large entry is in the uncompacted portion of the WAL can render the node unable to restart. Empirically, we've seen a node with 8GB of ram write a 1.6GB entry to the log and be unable to play it back on restart.

The syncer thread should monitor the size of the batches it builds and limit them to a reasonable size.

cc @petermattis

@bdarnell bdarnell added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs A-storage Relating to our storage engine (Pebble) on-disk storage. labels Jul 23, 2018
@petermattis
Copy link
Collaborator

I'm a bit surprised this actually occurred in practice. I'd really love to see this super-large batch and verify it contains keys from multiple different ranges. Failing that, I'll probably make an attempt at reproducing the problem first.

@bdarnell
Copy link
Contributor Author

Yeah, this is speculation on my part since I don't know anything about what's actually in this batch. It's definitely true that there's no limit there now so it could produce a batch of unbounded size. It doesn't sound too far-fetched to me that all the node restarts could provide a forcing function that would result in more simultaneous disk writes that would hit this batching mechanism.

@petermattis
Copy link
Collaborator

I'm pretty sure RocksDB doesn't have a limit on the size of batches that it will merge either. Let me check that.

@petermattis petermattis self-assigned this Jul 24, 2018
@petermattis
Copy link
Collaborator

I was wrong, RocksDB does have a limit on the size of batches it groups together. I'll add something similar.

  // Allow the group to grow up to a maximum size, but if the
  // original write is small, limit the growth so we do not slow
  // down the small write too much.
  size_t max_size = 1 << 20;
  if (size <= (128 << 10)) {
    max_size = size + (128 << 10);
  }

petermattis added a commit to petermattis/cockroach that referenced this issue Jul 24, 2018
Limit the size of batch groups to 1 MB. A larger batch can still be
committed, but it won't be grouped with any other batches.

Fixes cockroachdb#27865

Release note (bug fix): Limit the size of "batch groups" when committing
a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch
groups are created which can cause a server to run out of memory when
replaying the RocksDB log at startup.
petermattis added a commit to petermattis/cockroach that referenced this issue Jul 24, 2018
Limit the size of batch groups to 1 MB. A larger batch can still be
committed, but it won't be grouped with any other batches.

Fixes cockroachdb#27865

Release note (bug fix): Limit the size of "batch groups" when committing
a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch
groups are created which can cause a server to run out of memory when
replaying the RocksDB log at startup.
petermattis added a commit to petermattis/cockroach that referenced this issue Jul 26, 2018
Limit the size of batch groups to 1 MB. A larger batch can still be
committed, but it won't be grouped with any other batches.

Fixes cockroachdb#27865

Release note (bug fix): Limit the size of "batch groups" when committing
a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch
groups are created which can cause a server to run out of memory when
replaying the RocksDB log at startup.
craig bot pushed a commit that referenced this issue Jul 26, 2018
27895: storage/engine: limit the size of batch groups r=bdarnell a=petermattis

Limit the size of batch groups to 1 MB. A larger batch can still be
committed, but it won't be grouped with any other batches.

Fixes #27865

Release note (bug fix): Limit the size of "batch groups" when committing
a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch
groups are created which can cause a server to run out of memory when
replaying the RocksDB log at startup.

27902: roachpb: replace `gogoproto.onlyone` with `oneof` in ErrorDetail r=nvanbenschoten a=tamird

According to @nvanbenschoten:

Improves throughput by 1.9% against a patched version of `kv --read-percent=0 --seed=<something-fixed>` that used INSERTs instead of UPSERTs to create all primary key violations. With the change to avoid the allocation of `roachpb.Error.ErrorDetail` this change in theory is strictly an improvement because we're simply trading one big allocation (`ErrorDetail`) for a much smaller one (`isErrorDetail_Value`). I also suspect the constant-time dynamic dispatch of the interface is faster than the linear-time "dynamic" dispatch of the union struct.

Co-authored-by: Peter Mattis <[email protected]>
Co-authored-by: Tamir Duberstein <[email protected]>
@craig craig bot closed this as completed in #27895 Jul 26, 2018
petermattis added a commit to petermattis/cockroach that referenced this issue Jul 29, 2018
Limit the size of batch groups to 1 MB. A larger batch can still be
committed, but it won't be grouped with any other batches.

Fixes cockroachdb#27865

Release note (bug fix): Limit the size of "batch groups" when committing
a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch
groups are created which can cause a server to run out of memory when
replaying the RocksDB log at startup.
craig bot pushed a commit that referenced this issue Aug 2, 2018
28009: release-2.0: storage/engine: limit the size of batch groups r=benesch a=petermattis

Backport 1/1 commits from #27895.

/cc @cockroachdb/release

Queuing this up so we don't forget about it.

---

Limit the size of batch groups to 1 MB. A larger batch can still be
committed, but it won't be grouped with any other batches.

Fixes #27865

Release note (bug fix): Limit the size of "batch groups" when committing
a batch to RocksDB to avoid rare scenarios in which multi-gigabyte batch
groups are created which can cause a server to run out of memory when
replaying the RocksDB log at startup.


Co-authored-by: Peter Mattis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants