Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest,admission: add libraries for better tests/experiments #89978

Open
2 of 6 tasks
irfansharif opened this issue Oct 14, 2022 · 0 comments
Open
2 of 6 tasks

roachtest,admission: add libraries for better tests/experiments #89978

irfansharif opened this issue Oct 14, 2022 · 0 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-admission-control Admission Control

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Oct 14, 2022

Is your feature request related to a problem? Please describe.

We need a set of library functions in the roachtest package, or to use when writing roachtests, to make authoring admission control integration tests + experiments (#89208) easier. For ex:

  • summary metrics for the test run so we don't have to examine graphs: e.g. throughput, throughput variance, latency percentiles etc. for each workload. some existing/new (clusterstats for ex.) library components in the roachtest package may be developed, maybe emitting to roachperf/replacement. want to measure things like '% of time latency was over Xms' for a few values of X, or '% of time between T and T+t where throughput was below Y'. both things to evaluate latency and performance isolation of admission control.
    • bonus points if you can segment this by tenants in multi-tenant roachtests, or by workload in multi-workload experiments
  • some helper struct in the roachtest package that envelopes manual cgroup mucking and for different things like CPU/disk bandwidth/IOPs
  • some helper struct in the roachtest package to auto collect outlier traces that are now possible with stmtdiagnostics: support continuous bundle collection #83020 + stmtdiagnostics: support probabilistic bundle collection #82750. would be good to end a test with a set of outlier traces from that test run
  • library functions in roachtest to attach VMs to disk images pre-loaded with previous CRDB cluster state. better way to run experiments with full-ish LSMs that take time to fill up (possibly useful in db: benchmark framework for compaction heuristic comparisons pebble#1865). should have a parallel where roachtests actually generate disk images to use for future experiments. has to be captured over a CRDB cluster that's wound down entirely to get a consistent disk snapshot
  • roachprod,roachtest: support tenants #78484

Jira issue: CRDB-20528

@irfansharif irfansharif added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Oct 14, 2022
@irfansharif irfansharif changed the title roachtest,admission: improve library functions for better tests/experiments roachtest,admission: add libraries for better tests/experiments Oct 14, 2022
@exalate-issue-sync exalate-issue-sync bot added the T-kv KV Team label Mar 22, 2023
craig bot pushed a commit that referenced this issue May 26, 2023
103757: roach{prod,test}: add first-class support for disk snapshots r=irfansharif a=irfansharif

Part of #89978. Pre-cursor to #83826. Part of #98703.

Long-lived disk snapshots can drastically reduce testing time for scale tests. Tests, whether run by hand or through CI, need only run the long running fixture generating code (importing some dataset, generating it organically through workload, etc.) once snapshot fingerprints are changed, fingerprints that incorporate the major crdb version that generated them.

Here's an example run that freshly generates disk snapshots:

    === RUN   admission-control/index-backfill
    no existing snapshots found for admission-control/index-backfill (ac-index-backfill), doing pre-work
    created volume snapshot ac-index-backfill-0001-vunknown-1-n2-standard-8 for volume irfansharif-snapshot-0001-1 on irfansharif-snapshot-0001-1/n1
    using 1 newly created snapshot(s) with prefix "ac-index-backfill"
    detached and deleted volume irfansharif-snapshot-0001-1 from irfansharif-snapshot-0001
    created volume irfansharif-snapshot-0001-1
    attached volume irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001
    mounted irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001
    --- PASS: admission-control/index-backfill (79.14s)

Here's a subsequent run that makes use of the aforementioned disk snapshots:

    === RUN   admission-control/index-backfill
    using 1 pre-existing snapshot(s) with prefix "ac-index-backfill"
    detached and deleted volume irfansharif-snapshot-0001-1 from irfansharif-snapshot-0001
    created volume irfansharif-snapshot-0001-1
    attached volume irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001
    mounted irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001
    --- PASS: admission-control/index-backfill (43.47s)

We add the following APIs to the roachtest.Cluster interface, for tests to interact with disk snapshots. admission-control/index-backfill is an example test making use of these APIs.
```go
  type Cluster interface {
      // ...

      // CreateSnapshot creates volume snapshots of the cluster using
      // the given prefix. These snapshots can later be retrieved,
      // deleted or applied to already instantiated clusters.
      CreateSnapshot(ctx context.Context, snapshotPrefix string) error

      // ListSnapshots lists the individual volume snapshots that
      // satisfy the search criteria.
      ListSnapshots(
        ctx context.Context, vslo vm.VolumeSnapshotListOpts,
      ) ([]vm.VolumeSnapshot, error)

      // DeleteSnapshots permanently deletes the given snapshots.
      DeleteSnapshots(
        ctx context.Context, snapshots ...vm.VolumeSnapshot,
      ) error

      // ApplySnapshots applies the given volume snapshots to the
      // underlying cluster. This is a destructive operation as far as
      // existing state is concerned - all already-attached volumes are
      // detached and deleted to make room for new snapshot-derived
      // volumes. The new volumes are created using the same specs
      // (size, disk type, etc.) as the original cluster.
      ApplySnapshots(
        ctx context.Context, snapshots []vm.VolumeSnapshot,
      ) error
  }
```
These Cluster APIs are in turn is powered by the following additions to the vm.Provider interface, implemented by each cloud provider. GCE is the fully spec-ed out one for now.
```go
  type Provider interface {
      // ...

      // CreateVolume creates a new volume using the given options.
      CreateVolume(l *logger.Logger, vco VolumeCreateOpts) (Volume, error)

      // ListVolumes lists all volumes already attached to the given VM.
      ListVolumes(l *logger.Logger, vm *VM) ([]Volume, error)

      // DeleteVolume detaches and deletes the given volume from the
      // given VM.
      DeleteVolume(l *logger.Logger, volume Volume, vm *VM) error

      // AttachVolume attaches the given volume to the given VM.
      AttachVolume(l *logger.Logger, volume Volume, vm *VM) (string, error)

      // CreateVolumeSnapshot creates a snapshot of the given volume,
      // using the given options.
      CreateVolumeSnapshot(
        l *logger.Logger, volume Volume, vsco VolumeSnapshotCreateOpts,
      ) (VolumeSnapshot, error)

      // ListVolumeSnapshots lists the individual volume snapshots that
      // satisfy the search criteria.
      ListVolumeSnapshots(
        l *logger.Logger, vslo VolumeSnapshotListOpts,
      ) ([]VolumeSnapshot, error)

      // DeleteVolumeSnapshot permanently deletes the given snapshot.
      DeleteVolumeSnapshot(l *logger.Logger, snapshot VolumeSnapshot) error
  }
```
Since these snapshots necessarily outlive the tests, and we don't want them dangling perpetually, we introduce a prune-dangling roachtest that acts as a poor man's cron job, sifting through expired snapshots (>30days) and deleting them. For GCE at least it's not obvious to me how to create these snapshots in cloud buckets with a TTL built in, hence this hack. It looks like this (with change to the TTL):

    === RUN   prune-dangling
    pruned old snapshot ac-index-backfill-0001-vunknown-1-n2-standard-8
    --- PASS: prune-dangling (8.59s)

---

We add expose some of these APIs through the roachprod binary directly.
```
$ roachprod snapshot --help
  snapshot enables creating/listing/deleting/applying cluster snapshots

  Usage:
    roachprod snapshot [command]

  Available Commands:
    create      snapshot a named cluster, using the given snapshot name and description
    list        list all snapshots for the given cloud provider, optionally filtering by the given name
    delete      delete all snapshots for the given cloud provider optionally filtering by the given name
    apply       apply the named snapshots from the given cloud provider to the named cluster
```
---

About admission-control/index-backfill. It's a fully featured test that uses the TPC-C 100k dataset and runs a foreground load for 20k customers. It takes >4hrs to import this data set; with disk snapshots this step is skipped entirely and takes a few minutes. The actual test is trivial, we run the foreground load for 1hr and run a large index backfill concurrently. Before #98308, this results in wild performance oscillations. It's still a bit wild after flow control, but less so.

We slightly extend the tpc-e harness to make this happen, adding a few smarts: exposing a 'during' helper to run backfills concurrently with foreground load, integrate with --skip-init, estimated setup times, prometheus, and disk snapshots of course.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
@exalate-issue-sync exalate-issue-sync bot added T-admission-control Admission Control and removed T-kv KV Team labels Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-admission-control Admission Control
Projects
None yet
Development

No branches or pull requests

1 participant