Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: investigate larger default range sizes #39717

Closed
jeffrey-xiao opened this issue Aug 16, 2019 · 6 comments · Fixed by #44440
Closed

storage: investigate larger default range sizes #39717

jeffrey-xiao opened this issue Aug 16, 2019 · 6 comments · Fixed by #44440
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change.

Comments

@jeffrey-xiao
Copy link
Contributor

jeffrey-xiao commented Aug 16, 2019

#16954 was a major blocker in being able to increase the default range size from 64 MiB (another potential problem is #22348). Since #16954 has been closed, we can experiment with larger default range sizes.

I ran tpcc on a 3 node cluster with 16 vCPUs, 2500 warehouses, and an average range size of 500 MiB and a max range size of 1 GiB and nothing seemed to blow up.

image

There were a couple of latency spikes during the ramp-up period of tpcc, but I think they were caused by the destruction of old replicas when the initial splits were merged from setting up the workload (not too sure about this).

@petermattis
Copy link
Collaborator

Not a blocker, but I recall there is a timeout for snapshot send operations. If we increase the range size significantly, we may need to increase that timeout. For future proofing, it may be a good idea to adjust the timeout based on the size of the snapshot. Or since we send the snapshot in chunks, put the timeout on the sending of each chunk instead of the operation as a whole.

@awoods187 awoods187 added the C-investigation Further steps needed to qualify. C-label will change. label Aug 30, 2019
@ajwerner
Copy link
Contributor

Not a blocker, but I recall there is a timeout for snapshot send operations.

This timeout isn't directly on the snapshot itself but rather is due to the timeout on queue processing in the storage package. I'm going to type a patch to allow queues to independently control that timeout. For the raft snapshot queue my sense is maybe it'd be best to set the queue process timeout quite high and then do what you suggest with a timeout per chunk or per snapshot.

@ajwerner
Copy link
Contributor

Err we have the ability to set the timeout per queue but it was set to the default 1m.

@ajwerner
Copy link
Contributor

ajwerner commented Dec 13, 2019

Concerns:

  • Snapshot timeouts
  • Backup memory usage
    • Backups generate SSTs which are range sized. These SSTs are built in C++ and handed back to go. A 512MB slice is a big slice in memory constrained environments.
  • CDC catch up scans
    • CDC uses export requests to pull large amounts of data to perform catch up scans. We need a mechanism to limit the size which is retrieved.
  • RESTORE
    • there might also be a mem blowup on the RESTORE side since all the SSTs that cover a given span being restored need to be downloaded and opened as iterators that can be wrapped up in one merging-iterator from which to produce the RESTORE’d span SST. SST index is at the end, so we have to download the whole thing before we can create an iterator over it. Right now that means all he SSTs that contain keys for a given span being imported are held in ram while we construct, in ram, the resulting SST that covers that range. We could spill them to disk but encryption at rest makes that somewhat more involved.

@dt
Copy link
Member

dt commented Dec 13, 2019

Another reason RESTORE might prefer not to spill to disk is that disk bandwidth is often a hot commodity during a RESTORE or bulk-ingestion -- using any of it to write something other than the final ingestion SSTs data could potentially make RESTORE slower.

@ajwerner
Copy link
Contributor

@dt correct me if I'm wrong but from the discussion on slack it seems like we can mitigate all of the above concerns if we creates SSTs during backup with some target size rather than as the entire range size.

SSTs created prior to increasing the range size will target 64MB. If we continued to target 64MB exported SSTs for BACKUP then that would continue to be the size used for RESTORE. The potential downside of this approach is that RESTORE will create smaller ranges. We're not too concerned about this downside in the long run because the ranges will get merged together. The problem with waiting for the merge queue to merge the ranges together is that it will need to move data around to collocate the ranges for merging which could lead to a bunch of extra data movement following the RESTORE. @dt does limiting the size of the SSTs created during backup seem challenging?

ajwerner added a commit to ajwerner/cockroach that referenced this issue Jan 28, 2020
This commit extends the engine interface to take a targetSize parameter in
the ExportToSst method. The iteration will stope if the first version of a key
to be added to the SST would lead to targetSize being exceeded. If
exportAllRevisions is false, the targetSize will not be exceeded unless the
first kv pair exceeds it.

This commit additionally fixes a bug in the rocksdb implementation of
DBExportToSst whereby the first key in the export request would be skipped.
This case likely never occurred because the key passed to Export was rarely
exactly the first key to be included (see the change related to seek_key in
db.cc).

The exportccl.TestRandomKeyAndTimestampExport was extended to excercise various
targetSize limits. That test run under stress with the tee engine inspires some
confidence and did catch the above mentioned bug. More testing would likely be
good.

This commit leaves the task of adopting the targetSize parameter for later.

Fixes cockroachdb#39717.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Jan 28, 2020
This commit extends the engine interface to take a targetSize parameter in
the ExportToSst method. The iteration will stope if the first version of a key
to be added to the SST would lead to targetSize being exceeded. If
exportAllRevisions is false, the targetSize will not be exceeded unless the
first kv pair exceeds it.

This commit additionally fixes a bug in the rocksdb implementation of
DBExportToSst whereby the first key in the export request would be skipped.
This case likely never occurred because the key passed to Export was rarely
exactly the first key to be included (see the change related to seek_key in
db.cc).

The exportccl.TestRandomKeyAndTimestampExport was extended to excercise various
targetSize limits. That test run under stress with the tee engine inspires some
confidence and did catch the above mentioned bug. More testing would likely be
good.

This commit leaves the task of adopting the targetSize parameter for later.

Fixes cockroachdb#39717.

Release note: None
craig bot pushed a commit that referenced this issue Jan 28, 2020
44440: libroach,engine: support pagination of ExportToSst r=ajwerner a=ajwerner

This commit extends the engine interface to take a targetSize parameter in
the ExportToSst method. The iteration will stope if the first version of a key
to be added to the SST would lead to targetSize being exceeded. If
exportAllRevisions is false, the targetSize will not be exceeded unless the
first kv pair exceeds it.

This commit additionally fixes a bug in the rocksdb implementation of
DBExportToSst whereby the first key in the export request would be skipped.
This case likely never occurred because the key passed to Export was rarely
exactly the first key to be included (see the change related to seek_key in
db.cc).

The exportccl.TestRandomKeyAndTimestampExport was extended to excercise various
targetSize limits. That test run under stress with the tee engine inspires some
confidence and did catch the above mentioned bug. More testing would likely be
good.

This commit leaves the task of adopting the targetSize parameter for later.

Fixes #39717.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in 05f91e5 Jan 28, 2020
@ajwerner ajwerner reopened this Jan 28, 2020
ajwerner added a commit to ajwerner/cockroach that referenced this issue Feb 3, 2020
This PR is motivated by the desire to get the memory usage of CDC under
control in the presense of much larger ranges. Currently when a changefeed
decides it needs to do a backfill, it breaks the spans up along range
boundaries and then fetches the data (with some parallelism) for the backfill.

The memory overhead was somewhat bounded by the range size. If we want to make
the range size dramatically larger, the memory usage would become a function of
that new, much larger range size.

Fortunately, we don't have much need for these `ExportRequest`s any more.
Another fascinating revelation of late is that the `ScanResponse` does indeed
include MVCC timestamps (not the we necessarily needed them but it's a good
idea to keep them for compatibility).

The `ScanRequest` permits currently a limit on `NumRows` which this commit
utilized. I wanted to get this change typed in anticipation of cockroachdb#44341 which
will provide a limit on `NumBytes`.

I retained the existing parallelism as ScanRequests with limits are not
parallel.

I would like to do some benchmarking but I feel pretty okay about the
testing we have in place already. @danhhz what do you want to see here?

Relates to cockroachdb#39717.

Release note: None.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Feb 4, 2020
This PR is motivated by the desire to get the memory usage of CDC under
control in the presense of much larger ranges. Currently when a changefeed
decides it needs to do a backfill, it breaks the spans up along range
boundaries and then fetches the data (with some parallelism) for the backfill.

The memory overhead was somewhat bounded by the range size. If we want to make
the range size dramatically larger, the memory usage would become a function of
that new, much larger range size.

Fortunately, we don't have much need for these `ExportRequest`s any more.
Another fascinating revelation of late is that the `ScanResponse` does indeed
include MVCC timestamps (not the we necessarily needed them but it's a good
idea to keep them for compatibility).

The `ScanRequest` permits currently a limit on `NumRows` which this commit
utilized. I wanted to get this change typed in anticipation of cockroachdb#44341 which
will provide a limit on `NumBytes`.

I retained the existing parallelism as ScanRequests with limits are not
parallel.

I would like to do some benchmarking but I feel pretty okay about the
testing we have in place already. @danhhz what do you want to see here?

Relates to cockroachdb#39717.

Release note: None.
craig bot pushed a commit that referenced this issue Feb 4, 2020
44663: changefeedccl: use ScanRequest instead of ExportRequest during backfills r=danhhz a=ajwerner

This PR is motivated by the desire to get the memory usage of CDC under
control in the presense of much larger ranges. Currently when a changefeed
decides it needs to do a backfill, it breaks the spans up along range
boundaries and then fetches the data (with some parallelism) for the backfill.

The memory overhead was somewhat bounded by the range size. If we want to make
the range size dramatically larger, the memory usage would become a function of
that new, much larger range size.

Fortunately, we don't have much need for these `ExportRequest`s any more.
Another fascinating revelation of late is that the `ScanResponse` does indeed
include MVCC timestamps (not the we necessarily needed them but it's a good
idea to keep them for compatibility).

The `ScanRequest` permits currently a limit on `NumRows` which this commit
utilized. I wanted to get this change typed in anticipation of #44341 which
will provide a limit on `NumBytes`.

I retained the existing parallelism as ScanRequests with limits are not
parallel.

I would like to do some benchmarking but I feel pretty okay about the
testing we have in place already. @danhhz what do you want to see here?

Relates to #39717.

Release note: None.

44719: sql: add telemetry for uses of alter primary key r=otan a=rohany

Fixes #44716.

This PR adds a telemetry counter for uses
of the alter primary key command.

Release note (sql change): This PR adds collected telemetry
from clusters upon using the alter primary key command.

44721: vendor: bump pebble to 89adc50375ffd11c8e62f46f1a5c320012cffafe r=petermattis a=petermattis

* db: additional tweak to the sstable boundary generation
* db: add memTable.logSeqNum
* db: force flushing of overlapping queued memtables during ingestion
* tool: lsm visualization tool
* db: consistently handle key decoding failure
* cmd/pebble: fix lint for maxOpsPerSec's type inference
* tool: add "find" command
* cmd/pebble: fix --rate flag
* internal/metamorphic: use the strict MemFS and add an operation to reset the DB
* db: make DB.Close() wait for any ongoing deletion of obsolete files
* sstable: encode varints directly into buf in blockWriter.store
* sstable: micro-optimize Writer.addPoint()
* sstable: minor cleanup of Writer/blockWriter
* sstable: micro-optimize blockWriter.store

Fixes #44631 

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Rohan Yadav <[email protected]>
Co-authored-by: Peter Mattis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
5 participants