-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: investigate larger default range sizes #39717
Comments
Not a blocker, but I recall there is a timeout for snapshot send operations. If we increase the range size significantly, we may need to increase that timeout. For future proofing, it may be a good idea to adjust the timeout based on the size of the snapshot. Or since we send the snapshot in chunks, put the timeout on the sending of each chunk instead of the operation as a whole. |
This timeout isn't directly on the snapshot itself but rather is due to the timeout on queue processing in the storage package. I'm going to type a patch to allow queues to independently control that timeout. For the raft snapshot queue my sense is maybe it'd be best to set the queue process timeout quite high and then do what you suggest with a timeout per chunk or per snapshot. |
Err we have the ability to set the timeout per queue but it was set to the default 1m. |
Concerns:
|
Another reason RESTORE might prefer not to spill to disk is that disk bandwidth is often a hot commodity during a RESTORE or bulk-ingestion -- using any of it to write something other than the final ingestion SSTs data could potentially make RESTORE slower. |
@dt correct me if I'm wrong but from the discussion on slack it seems like we can mitigate all of the above concerns if we creates SSTs during backup with some target size rather than as the entire range size. SSTs created prior to increasing the range size will target 64MB. If we continued to target 64MB exported SSTs for |
This commit extends the engine interface to take a targetSize parameter in the ExportToSst method. The iteration will stope if the first version of a key to be added to the SST would lead to targetSize being exceeded. If exportAllRevisions is false, the targetSize will not be exceeded unless the first kv pair exceeds it. This commit additionally fixes a bug in the rocksdb implementation of DBExportToSst whereby the first key in the export request would be skipped. This case likely never occurred because the key passed to Export was rarely exactly the first key to be included (see the change related to seek_key in db.cc). The exportccl.TestRandomKeyAndTimestampExport was extended to excercise various targetSize limits. That test run under stress with the tee engine inspires some confidence and did catch the above mentioned bug. More testing would likely be good. This commit leaves the task of adopting the targetSize parameter for later. Fixes cockroachdb#39717. Release note: None
This commit extends the engine interface to take a targetSize parameter in the ExportToSst method. The iteration will stope if the first version of a key to be added to the SST would lead to targetSize being exceeded. If exportAllRevisions is false, the targetSize will not be exceeded unless the first kv pair exceeds it. This commit additionally fixes a bug in the rocksdb implementation of DBExportToSst whereby the first key in the export request would be skipped. This case likely never occurred because the key passed to Export was rarely exactly the first key to be included (see the change related to seek_key in db.cc). The exportccl.TestRandomKeyAndTimestampExport was extended to excercise various targetSize limits. That test run under stress with the tee engine inspires some confidence and did catch the above mentioned bug. More testing would likely be good. This commit leaves the task of adopting the targetSize parameter for later. Fixes cockroachdb#39717. Release note: None
44440: libroach,engine: support pagination of ExportToSst r=ajwerner a=ajwerner This commit extends the engine interface to take a targetSize parameter in the ExportToSst method. The iteration will stope if the first version of a key to be added to the SST would lead to targetSize being exceeded. If exportAllRevisions is false, the targetSize will not be exceeded unless the first kv pair exceeds it. This commit additionally fixes a bug in the rocksdb implementation of DBExportToSst whereby the first key in the export request would be skipped. This case likely never occurred because the key passed to Export was rarely exactly the first key to be included (see the change related to seek_key in db.cc). The exportccl.TestRandomKeyAndTimestampExport was extended to excercise various targetSize limits. That test run under stress with the tee engine inspires some confidence and did catch the above mentioned bug. More testing would likely be good. This commit leaves the task of adopting the targetSize parameter for later. Fixes #39717. Release note: None Co-authored-by: Andrew Werner <[email protected]>
This PR is motivated by the desire to get the memory usage of CDC under control in the presense of much larger ranges. Currently when a changefeed decides it needs to do a backfill, it breaks the spans up along range boundaries and then fetches the data (with some parallelism) for the backfill. The memory overhead was somewhat bounded by the range size. If we want to make the range size dramatically larger, the memory usage would become a function of that new, much larger range size. Fortunately, we don't have much need for these `ExportRequest`s any more. Another fascinating revelation of late is that the `ScanResponse` does indeed include MVCC timestamps (not the we necessarily needed them but it's a good idea to keep them for compatibility). The `ScanRequest` permits currently a limit on `NumRows` which this commit utilized. I wanted to get this change typed in anticipation of cockroachdb#44341 which will provide a limit on `NumBytes`. I retained the existing parallelism as ScanRequests with limits are not parallel. I would like to do some benchmarking but I feel pretty okay about the testing we have in place already. @danhhz what do you want to see here? Relates to cockroachdb#39717. Release note: None.
This PR is motivated by the desire to get the memory usage of CDC under control in the presense of much larger ranges. Currently when a changefeed decides it needs to do a backfill, it breaks the spans up along range boundaries and then fetches the data (with some parallelism) for the backfill. The memory overhead was somewhat bounded by the range size. If we want to make the range size dramatically larger, the memory usage would become a function of that new, much larger range size. Fortunately, we don't have much need for these `ExportRequest`s any more. Another fascinating revelation of late is that the `ScanResponse` does indeed include MVCC timestamps (not the we necessarily needed them but it's a good idea to keep them for compatibility). The `ScanRequest` permits currently a limit on `NumRows` which this commit utilized. I wanted to get this change typed in anticipation of cockroachdb#44341 which will provide a limit on `NumBytes`. I retained the existing parallelism as ScanRequests with limits are not parallel. I would like to do some benchmarking but I feel pretty okay about the testing we have in place already. @danhhz what do you want to see here? Relates to cockroachdb#39717. Release note: None.
44663: changefeedccl: use ScanRequest instead of ExportRequest during backfills r=danhhz a=ajwerner This PR is motivated by the desire to get the memory usage of CDC under control in the presense of much larger ranges. Currently when a changefeed decides it needs to do a backfill, it breaks the spans up along range boundaries and then fetches the data (with some parallelism) for the backfill. The memory overhead was somewhat bounded by the range size. If we want to make the range size dramatically larger, the memory usage would become a function of that new, much larger range size. Fortunately, we don't have much need for these `ExportRequest`s any more. Another fascinating revelation of late is that the `ScanResponse` does indeed include MVCC timestamps (not the we necessarily needed them but it's a good idea to keep them for compatibility). The `ScanRequest` permits currently a limit on `NumRows` which this commit utilized. I wanted to get this change typed in anticipation of #44341 which will provide a limit on `NumBytes`. I retained the existing parallelism as ScanRequests with limits are not parallel. I would like to do some benchmarking but I feel pretty okay about the testing we have in place already. @danhhz what do you want to see here? Relates to #39717. Release note: None. 44719: sql: add telemetry for uses of alter primary key r=otan a=rohany Fixes #44716. This PR adds a telemetry counter for uses of the alter primary key command. Release note (sql change): This PR adds collected telemetry from clusters upon using the alter primary key command. 44721: vendor: bump pebble to 89adc50375ffd11c8e62f46f1a5c320012cffafe r=petermattis a=petermattis * db: additional tweak to the sstable boundary generation * db: add memTable.logSeqNum * db: force flushing of overlapping queued memtables during ingestion * tool: lsm visualization tool * db: consistently handle key decoding failure * cmd/pebble: fix lint for maxOpsPerSec's type inference * tool: add "find" command * cmd/pebble: fix --rate flag * internal/metamorphic: use the strict MemFS and add an operation to reset the DB * db: make DB.Close() wait for any ongoing deletion of obsolete files * sstable: encode varints directly into buf in blockWriter.store * sstable: micro-optimize Writer.addPoint() * sstable: minor cleanup of Writer/blockWriter * sstable: micro-optimize blockWriter.store Fixes #44631 Release note: None Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Rohan Yadav <[email protected]> Co-authored-by: Peter Mattis <[email protected]>
#16954 was a major blocker in being able to increase the default range size from 64 MiB (another potential problem is #22348). Since #16954 has been closed, we can experiment with larger default range sizes.
I ran tpcc on a 3 node cluster with 16 vCPUs, 2500 warehouses, and an average range size of 500 MiB and a max range size of 1 GiB and nothing seemed to blow up.
There were a couple of latency spikes during the ramp-up period of tpcc, but I think they were caused by the destruction of old replicas when the initial splits were merged from setting up the workload (not too sure about this).
The text was updated successfully, but these errors were encountered: