kvflowcontrol,admission: productionize replication admission control #98703

irfansharif · 2023-03-15T19:01:57Z

Is your feature request related to a problem? Please describe.

Tracking issue to productionize #95563 and rolling it out into the wild (enabled by default, made safe-to-opt-into for production clusters):

Jira issue: CRDB-25455

Epic CRDB-25348

The text was updated successfully, but these errors were encountered:

103757: roach{prod,test}: add first-class support for disk snapshots r=irfansharif a=irfansharif Part of #89978. Pre-cursor to #83826. Part of #98703. Long-lived disk snapshots can drastically reduce testing time for scale tests. Tests, whether run by hand or through CI, need only run the long running fixture generating code (importing some dataset, generating it organically through workload, etc.) once snapshot fingerprints are changed, fingerprints that incorporate the major crdb version that generated them. Here's an example run that freshly generates disk snapshots: === RUN admission-control/index-backfill no existing snapshots found for admission-control/index-backfill (ac-index-backfill), doing pre-work created volume snapshot ac-index-backfill-0001-vunknown-1-n2-standard-8 for volume irfansharif-snapshot-0001-1 on irfansharif-snapshot-0001-1/n1 using 1 newly created snapshot(s) with prefix "ac-index-backfill" detached and deleted volume irfansharif-snapshot-0001-1 from irfansharif-snapshot-0001 created volume irfansharif-snapshot-0001-1 attached volume irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001 mounted irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001 --- PASS: admission-control/index-backfill (79.14s) Here's a subsequent run that makes use of the aforementioned disk snapshots: === RUN admission-control/index-backfill using 1 pre-existing snapshot(s) with prefix "ac-index-backfill" detached and deleted volume irfansharif-snapshot-0001-1 from irfansharif-snapshot-0001 created volume irfansharif-snapshot-0001-1 attached volume irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001 mounted irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001 --- PASS: admission-control/index-backfill (43.47s) We add the following APIs to the roachtest.Cluster interface, for tests to interact with disk snapshots. admission-control/index-backfill is an example test making use of these APIs. ```go type Cluster interface { // ... // CreateSnapshot creates volume snapshots of the cluster using // the given prefix. These snapshots can later be retrieved, // deleted or applied to already instantiated clusters. CreateSnapshot(ctx context.Context, snapshotPrefix string) error // ListSnapshots lists the individual volume snapshots that // satisfy the search criteria. ListSnapshots( ctx context.Context, vslo vm.VolumeSnapshotListOpts, ) ([]vm.VolumeSnapshot, error) // DeleteSnapshots permanently deletes the given snapshots. DeleteSnapshots( ctx context.Context, snapshots ...vm.VolumeSnapshot, ) error // ApplySnapshots applies the given volume snapshots to the // underlying cluster. This is a destructive operation as far as // existing state is concerned - all already-attached volumes are // detached and deleted to make room for new snapshot-derived // volumes. The new volumes are created using the same specs // (size, disk type, etc.) as the original cluster. ApplySnapshots( ctx context.Context, snapshots []vm.VolumeSnapshot, ) error } ``` These Cluster APIs are in turn is powered by the following additions to the vm.Provider interface, implemented by each cloud provider. GCE is the fully spec-ed out one for now. ```go type Provider interface { // ... // CreateVolume creates a new volume using the given options. CreateVolume(l *logger.Logger, vco VolumeCreateOpts) (Volume, error) // ListVolumes lists all volumes already attached to the given VM. ListVolumes(l *logger.Logger, vm *VM) ([]Volume, error) // DeleteVolume detaches and deletes the given volume from the // given VM. DeleteVolume(l *logger.Logger, volume Volume, vm *VM) error // AttachVolume attaches the given volume to the given VM. AttachVolume(l *logger.Logger, volume Volume, vm *VM) (string, error) // CreateVolumeSnapshot creates a snapshot of the given volume, // using the given options. CreateVolumeSnapshot( l *logger.Logger, volume Volume, vsco VolumeSnapshotCreateOpts, ) (VolumeSnapshot, error) // ListVolumeSnapshots lists the individual volume snapshots that // satisfy the search criteria. ListVolumeSnapshots( l *logger.Logger, vslo VolumeSnapshotListOpts, ) ([]VolumeSnapshot, error) // DeleteVolumeSnapshot permanently deletes the given snapshot. DeleteVolumeSnapshot(l *logger.Logger, snapshot VolumeSnapshot) error } ``` Since these snapshots necessarily outlive the tests, and we don't want them dangling perpetually, we introduce a prune-dangling roachtest that acts as a poor man's cron job, sifting through expired snapshots (>30days) and deleting them. For GCE at least it's not obvious to me how to create these snapshots in cloud buckets with a TTL built in, hence this hack. It looks like this (with change to the TTL): === RUN prune-dangling pruned old snapshot ac-index-backfill-0001-vunknown-1-n2-standard-8 --- PASS: prune-dangling (8.59s) --- We add expose some of these APIs through the roachprod binary directly. ``` $ roachprod snapshot --help snapshot enables creating/listing/deleting/applying cluster snapshots Usage: roachprod snapshot [command] Available Commands: create snapshot a named cluster, using the given snapshot name and description list list all snapshots for the given cloud provider, optionally filtering by the given name delete delete all snapshots for the given cloud provider optionally filtering by the given name apply apply the named snapshots from the given cloud provider to the named cluster ``` --- About admission-control/index-backfill. It's a fully featured test that uses the TPC-C 100k dataset and runs a foreground load for 20k customers. It takes >4hrs to import this data set; with disk snapshots this step is skipped entirely and takes a few minutes. The actual test is trivial, we run the foreground load for 1hr and run a large index backfill concurrently. Before #98308, this results in wild performance oscillations. It's still a bit wild after flow control, but less so. We slightly extend the tpc-e harness to make this happen, adding a few smarts: exposing a 'during' helper to run backfills concurrently with foreground load, integrate with --skip-init, estimated setup times, prometheus, and disk snapshots of course. Release note: None Co-authored-by: irfan sharif <[email protected]>

Fixes cockroachdb#104696. Fixes cockroachdb#104697. Fixes cockroachdb#104698. Part of cockroachdb#98703. In 072c16d (added as part of cockroachdb#95637) we re-worked the locking structure around the RaftTransport's per-RPC class level send queues. When new send queues are instantiated or old ones deleted, we now also maintain the kvflowcontrol connection tracker, so such maintenance now needs to happen while holding a kvflowcontrol mutex. When rebasing \cockroachdb#95637 onto master, we accidentally included earlier queue deletion code without holding the appropriate mutex. Queue deletions now happened twice which made it possible to hit a RaftTransport assertion about expecting the right send queue to already exist. Specifically, the following sequence was possible: - (*RaftTransport).SendAsync is invoked, observes no queue for <nodeid,class>, creates it, and tracks it in the queues map. - It invokes an async worker W1 to process that send queue through (*RaftTransport).startProcessNewQueue. The async worker is responsible for clearing the tracked queue in the queues map once done. - W1 expects to find the tracked queue in the queues map, finds it, proceeds. - W1 is done processing. On its way out, W1 clears <nodeid,class> from the queues map the first time. - (*RaftTransport).SendAsync is invoked by another goroutine, observes no queue for <nodeid,class>, creates it, and tracks it in the queues map. - It invokes an async worker W2 to process that send queue through (*RaftTransport).startProcessNewQueue. The async worker is responsible for clearing the tracked queue in the queues map once done. - W1 blindly clears the <nodeid,class> raft send queue the second time. - W2 expects to find the queue in the queues map, but doesn't, and fatals. Release note: None

104699: kvserver: fix clearrange/* tests r=irfansharif a=irfansharif Fixes #104696. Fixes #104697. Fixes #104698. Part of #98703. In 072c16d (added as part of #95637) we re-worked the locking structure around the RaftTransport's per-RPC class level send queues. When new send queues are instantiated or old ones deleted, we now also maintain the kvflowcontrol connection tracker, so such maintenance now needs to happen while holding a kvflowcontrol mutex. When rebasing \#95637 onto master, we accidentally included earlier queue deletion code without holding the appropriate mutex. Queue deletions now happened twice which made it possible to hit a RaftTransport assertion about expecting the right send queue to already exist. Specifically, the following sequence was possible: - `(*RaftTransport).SendAsync` is invoked, observes no queue for `<nodeid,class>`, creates it, and tracks it in the queues map. - It invokes an async worker W1 to process that send queue through `(*RaftTransport).startProcessNewQueue`. The async worker is responsible for clearing the tracked queue in the queues map once done. - W1 expects to find the tracked queue in the queues map, finds it, proceeds. - W1 is done processing. On its way out, W1 clears `<nodeid,class>` from the queues map the first time. - `(*RaftTransport).SendAsync` is invoked by another goroutine, observes no queue for <nodeid,class>, creates it, and tracks it in the queues map. - It invokes an async worker W2 to process that send queue through `(*RaftTransport).startProcessNewQueue`. The async worker is responsible for clearing the tracked queue in the queues map once done. - W1 blindly clears the `<nodeid,class>` raft send queue the second time. - W2 expects to find the queue in the queues map, but doesn't, and fatals. Release note: None Co-authored-by: irfan sharif <[email protected]>

irfansharif · 2023-09-05T17:09:08Z

#110036 is the last of it.

irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Mar 15, 2023

irfansharif mentioned this issue Mar 15, 2023

*: implement replication admission control #95563

Closed

10 tasks

exalate-issue-sync bot added the T-kv KV Team label Mar 15, 2023

irfansharif mentioned this issue May 23, 2023

roach{prod,test}: add first-class support for disk snapshots #103757

Merged

irfansharif mentioned this issue Jun 10, 2023

kvserver: fix clearrange/* tests #104699

Merged

irfansharif mentioned this issue Jun 11, 2023

roachtest: tpccbench/nodes=12/cpu=16/lease=expiration failed #104704

Closed

irfansharif closed this as completed Sep 5, 2023

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvflowcontrol,admission: productionize replication admission control #98703

kvflowcontrol,admission: productionize replication admission control #98703

irfansharif commented Mar 15, 2023 •

edited

Loading

irfansharif commented Sep 5, 2023

kvflowcontrol,admission: productionize replication admission control #98703

kvflowcontrol,admission: productionize replication admission control #98703

Comments

irfansharif commented Mar 15, 2023 • edited Loading

irfansharif commented Sep 5, 2023

irfansharif commented Mar 15, 2023 •

edited

Loading