storage: implement SST snapshot strategy #38873

jeffrey-xiao · 2019-07-15T15:33:01Z

Heavily based on #25134 with some minor refactoring and changes to ensure that tests pass.

This PR implements the SST snapshot strategy discussed in #16954. The sender creates SSTs and streams them in chunks to the receiver. These SSTs are then directly ingested into RocksDB. This strategy uses a constant amount of memory with respect to the size of range, whereas KV_BATCH uses linear memory.

The maximum number of SSTs created using this strategy is 4 + SR + 2 where SR is the number of subsumed replicas.

Three SSTs get streamed from the sender (range local keys, replicated range-id local keys, and data keys)
One SST is constructed for the unreplicated state.
One SST is constructed for every subsumed replica to clear the range-id local keys. This SST consists of one range deletion tombstone and one RaftTombstone key.
A maximum of two SSTs for all subsumed replicas to account for the case of not fully contained subsumed replicas. Note that currently, subsumed replicas can have keys right of the current replica, but not left of, so there will be a maximum of one SST created for the range-local keys and one for the data keys. These SSTs consist of one range deletion tombstone.

TODO and Open Questions

Should there be a compaction suggestion for the data range? Note that this compaction suggestion would overlap with the data range of subsumed replicas, but not necessarily contain it.

Metrics

One way to evaluate this change is the following steps:

Setup 3 node cluster
Set range_min_bytes to 0 and range_max_bytes to some large number.
Increase kv.snapshot_recovery.max_rate and kv.snapshot_rebalance.max_rate to some large number.
Disable load-based splitting.
Stop node 2.
Run an insert heavy workload (kv0) on the cluster.
Start node 2.
Time how long it takes for node 2 to have all the ranges.

We can have two independent variables

Fixed total data size, variable number of splits
Fixed number of splits, variable total data size

We also probably want to evaluate the performance of the cluster after the recovery has been complete to see the impact of ingesting small SSTs.

cockroach-teamcity · 2019-07-15T15:33:15Z

This change is

Release note: None

Release notes: None

This method truncates the SSTfile being written and returns the data that was truncated. It can be called multiple times when writing an SST file and can be used to chunk an SST file into pieces. Since SSTs are built in an append-only manner, the concatenated chunks is equivalent to an SST built without using Truncate and using Finish. Release note: None

Instead of storing the state for applying an snapshot directly in IncomingSnapshot, this change stores it in snapshotStrategy. This change enables different strategies to be used when applying snapshots. Release note: None

Release note: None

38932: storage: build SSTs from KV_BATCH snapshot r=jeffrey-xiao a=jeffrey-xiao Implements the SST snapshot strategy discussed in #16954 and partially implemented in #25134 and #38873, but only have the logic on the receiver side for ease of testing and compatibility. This PR also handles the complications of subsumed replicas that are not fully contained by the current replica. The maximum number of SSTs created using this strategy is 4 + SR + 2 where SR is the number of subsumed replicas. - Three SSTs get streamed from the sender (range local keys, replicated range-id local keys, and data keys) - One SST is constructed for the unreplicated range-id local keys. - One SST is constructed for every subsumed replica to clear the range-id local keys. These SSTs consists of one range deletion tombstone and one `RaftTombstone` key. - A maximum of two SSTs for all subsumed replicas to account for the case of not fully contained subsumed replicas. Note that currently, subsumed replicas can have keys right of the current replica, but not left of, so there will be a maximum of one SST created for the range-local keys and one for the data keys. These SSTs consist of one range deletion tombstone. This number can be further reduced to 3 + SR if we pass the file handles and sst writers from the receiving step to the application step. We can combine the SSTs of the unreplicated range id and replicated id, and the range local of the subsumed replicas and data SSTs of the subsumed replicas. We probably don't want to do this optimization since we'll have to undo this optimization if we start constructing the SSTs from the sender or start chunking large SSTs into smaller SSTs. Blocked by facebook/rocksdb#5649. # Test Plan - [x] Testing knob to inspect SSTs before ingestion. Ensure that expected SSTs for subsumed replicas are ingested. - [x] Unit tests for `SSTSnapshotStorage`. # Metrics and Evaluation One way to evaluate this change is the following steps: 1. Setup 3 node cluster 2. Set default Raft log truncation threshold to some low constant: ```go defaultRaftLogTruncationThreshold = envutil.EnvOrDefaultInt64( "COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD", 128<<10 /* 128 KB */) ``` 3. Set `range_min_bytes` to 0 and `range_max_bytes` to some large number. 4. Increase `kv.snapshot_recovery.max_rate` and `kv.snapshot_rebalance.max_rate` to some large number. 5. Disable load-based splitting. 6. Stop node 2. 7. Run an insert heavy workload (kv0) on the cluster. 8. Start node 2. 9. Time how long it takes for node 2 to have all the ranges. Roachtest: https://gist.github.com/jeffrey-xiao/e69fcad04968822d603f6807ca77ef3b We can have two independent variables 1. Fixed total data size (4000000 ops; ~3.81 GiB), variable number of splits - 32 splits (~121 MiB ranges) - 64 splits (~61.0 MiB ranges) - 128 splits (~31.2 MiB ranges) - 256 splits (~15.7 MiB ranges) - 512 splits (~7.9 MiB ranges) - 1024 splits (~3.9 MiB ranges) 2. Fixed number of splits (32), variable total data size - 125000 (~ 3.7 MiB ranges) - 250000 (~7.5 MiB ranges) - 500000 (~15 MiB ranges) - 1000000 (~30 MiB ranges) - 2000000 (60 MiB ranges) - 4000000 (121 MiB ranges) # Fsync Chunk Size The size of the SST chunk that we write before fsync-ing impacts how fast node 2 has all the ranges. I've experimented 32 splits and an median range size of 121 MB with no fsync-ing (~27s recovery), fsync-ing in 8 MB chunks (~30s recovery), fsync-ing in 2 MB chunks (~40s recovery), fsync-ing in 256 KB chunks (~42s recovery). The default bulk sst sync rate is 2MB and #20352 sets `bytes_per_sync` to 512 KB, so something between those options is probably good. The reason we would want to fsync is to prevent the OS from accumulating such a large buffer that it blocks unrelated small/fast writes for a long time when it flushes. # Impact on Foreground Traffic For testing the impact on foreground traffic, I ran kv0 on a four node cluster with the merge queue and split queue disabled and starting with a constant number of splits. After 5 minutes, I decommissioned node 1 so its replicas would drain to other nodes using snapshots. Roachtest: https://gist.github.com/jeffrey-xiao/5d9443a37b0929884aca927f9c320b6c **Average Range Size of 3 MiB** - [Before](https://user-images.githubusercontent.com/8853434/62398633-41a2bb00-b547-11e9-9e3d-747ee724943b.png) - [After](https://user-images.githubusercontent.com/8853434/62398634-41a2bb00-b547-11e9-85e7-445b7989d173.png) **Average Range Size of 32 MiB** - [Before](https://user-images.githubusercontent.com/8853434/62398631-410a2480-b547-11e9-9019-86d3bd2e6f73.png) - [After](https://user-images.githubusercontent.com/8853434/62398632-410a2480-b547-11e9-9513-8763e132e76b.png) **Average Range Size 128 MiB** - [Before](https://user-images.githubusercontent.com/8853434/62398558-15873a00-b547-11e9-8ab6-2e8e9bae658c.png) - [After](https://user-images.githubusercontent.com/8853434/62398559-15873a00-b547-11e9-9c72-b3e90fce1acc.png) We see p99 latency wins for larger range sizes and comparable performance for smaller range sizes. Release note (performance improvement): Snapshots sent between replicas are now applied more performantly and use less memory. Co-authored-by: Jeffrey Xiao <[email protected]>

jeffrey-xiao force-pushed the streaming-snapshots branch from 6e4f1a8 to 9745f64 Compare July 15, 2019 16:04

settings: add VersionSSTableSnapshots

bfdc684

Release note: None

jeffrey-xiao force-pushed the streaming-snapshots branch from 9745f64 to 0032728 Compare July 15, 2019 16:45

storage: implement writer interface for RocksDBSstFileWriter

b80f34c

Release notes: None

jeffrey-xiao force-pushed the streaming-snapshots branch 4 times, most recently from 52d2bdf to d6909ff Compare July 15, 2019 21:33

tbg requested review from tbg and nvanbenschoten July 16, 2019 15:24

jeffrey-xiao closed this Jul 22, 2019

jeffrey-xiao mentioned this pull request Jul 22, 2019

storage: build SSTs from KV_BATCH snapshot #38932

Merged

2 tasks

jeffrey-xiao reopened this Jul 25, 2019

jeffrey-xiao added 5 commits July 25, 2019 12:41

storage: include snapshotStrategy in IncomingSnapshot

d54cc1f

Instead of storing the state for applying an snapshot directly in IncomingSnapshot, this change stores it in snapshotStrategy. This change enables different strategies to be used when applying snapshots. Release note: None

storage: implement SST snapshot strategy

22c0f8b

Release note: None

1

eca9d57

1

86109fa

jeffrey-xiao force-pushed the streaming-snapshots branch from d9161be to 86109fa Compare July 25, 2019 16:42

1

ed6a7dd

jeffrey-xiao changed the title ~~[WIP] storage: implement SST snapshot strategy~~ storage: implement SST snapshot strategy Jul 25, 2019

jeffrey-xiao marked this pull request as ready for review July 25, 2019 18:51

jeffrey-xiao requested review from a team July 25, 2019 18:51

jeffrey-xiao closed this Jul 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: implement SST snapshot strategy #38873

storage: implement SST snapshot strategy #38873

jeffrey-xiao commented Jul 15, 2019 •

edited

Loading

cockroach-teamcity commented Jul 15, 2019

storage: implement SST snapshot strategy #38873

storage: implement SST snapshot strategy #38873

Conversation

jeffrey-xiao commented Jul 15, 2019 • edited Loading

TODO and Open Questions

Metrics

cockroach-teamcity commented Jul 15, 2019

jeffrey-xiao commented Jul 15, 2019 •

edited

Loading