Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add admission-control/index-backfill #103816

Conversation

irfansharif
Copy link
Contributor

And make it use disk snapshots. Add a few smarts to the TPC-E harness
(exposing a 'during' helper, integrate with --skip-init, --local,
estimated setup times, prometheus, and disk snapshots) while here.

Release note: None

Pure code movement. We'll make use of it outside this file in subsequent
commits.

Release note: None
These tests have been stable for a few months now. Reduce to a weekly
cadence.

Release note: None
Long-lived disk snapshots can drastically reduce testing time for scale
tests. Tests, whether run by hand or through CI, need only run the
long running fixture generating code (importing some dataset, generating
it organically through workload, etc.) once snapshot fingerprints are
changed, fingerprints that incorporate the major crdb version that
generated them.

Here's an example run that freshly generates disk snapshots:

    === RUN   admission-control/index-backfill
    03:57:19 admission_control_index_backfill.go:53: no existing snapshots found for admission-control/index-backfill (ac-index-backfill), doing pre-work
    03:57:54 roachprod.go:1626: created volume snapshot ac-index-backfill-0001-vunknown-1-n2-standard-8 (id=6426236595187320652) for volume irfansharif-snapshot-0001-1 on irfansharif-snapshot-0001-1/n1
    03:57:55 admission_control_index_backfill.go:61: using 1 newly created snapshot(s) with prefix "ac-index-backfill"
    03:58:02 roachprod.go:1716: detached and deleted volume irfansharif-snapshot-0001-1 from irfansharif-snapshot-0001
    03:58:28 roachprod.go:1764: created volume irfansharif-snapshot-0001-1
    03:58:33 roachprod.go:1770: attached volume irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001
    03:58:36 roachprod.go:1783: mounted irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001
    --- PASS: admission-control/index-backfill (79.14s)

Here's a subsequent run that makes use of the aforementioned disk
snapshot:

    === RUN   admission-control/index-backfill
    04:00:40 admission_control_index_backfill.go:63: using 1 pre-existing snapshot(s) with prefix "ac-index-backfill"
    04:00:47 roachprod.go:1716: detached and deleted volume irfansharif-snapshot-0001-1 from irfansharif-snapshot-0001
    04:01:14 roachprod.go:1763: created volume irfansharif-snapshot-0001-1
    04:01:19 roachprod.go:1769: attached volume irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001
    04:01:22 roachprod.go:1782: mounted irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001
    --- PASS: admission-control/index-backfill (43.47s)

We add the following APIs to the roachtest.Cluster interface, for tests
to interact with disk snapshots. admission-control/index-backfill is a
placeholder test making use of these APIs.

  type Cluster interface {
      // ...

      // CreateSnapshot creates volume snapshots of the cluster using
      // the given prefix. These snapshots can later be retrieved,
      // deleted or applied to already instantiated clusters.
      CreateSnapshot(ctx context.Context, snapshotPrefix string) error

      // ListSnapshots lists the individual volume snapshots that
      // satisfy the search criteria.
      ListSnapshots(
        ctx context.Context, vslo vm.VolumeSnapshotListOpts,
      ) ([]vm.VolumeSnapshot, error)

      // DeleteSnapshots permanently deletes the given snapshots.
      DeleteSnapshots(
        ctx context.Context, snapshots ...vm.VolumeSnapshot,
      ) error

      // ApplySnapshots applies the given volume snapshots to the
      // underlying cluster. This is a destructive operation as far as
      // existing state is concerned - all already-attached volumes are
      // detached and deleted to make room for new snapshot-derived
      // volumes. The new volumes are created using the same specs
      // (size, disk type, etc.) as the original cluster.
      ApplySnapshots(
        ctx context.Context, snapshots []vm.VolumeSnapshot,
      ) error
  }

This in turn is powered by the following additions to the vm.Provider
interface, implemented by each cloud provider.

  type Provider interface {
      // ...

      // CreateVolume creates a new volume using the given options.
      CreateVolume(l *logger.Logger, vco VolumeCreateOpts) (Volume, error)

      // ListVolumes lists all volumes already attached to the given VM.
      ListVolumes(l *logger.Logger, vm *VM) ([]Volume, error)

      // DeleteVolume detaches and deletes the given volume from the
      // given VM.
      DeleteVolume(l *logger.Logger, volume Volume, vm *VM) error

      // AttachVolume attaches the given volume to the given VM.
      AttachVolume(l *logger.Logger, volume Volume, vm *VM) (string, error)

      // CreateVolumeSnapshot creates a snapshot of the given volume,
      // using the given options.
      CreateVolumeSnapshot(
        l *logger.Logger, volume Volume, vsco VolumeSnapshotCreateOpts,
      ) (VolumeSnapshot, error)

      // ListVolumeSnapshots lists the individual volume snapshots that
      // satisfy the search criteria.
      ListVolumeSnapshots(
        l *logger.Logger, vslo VolumeSnapshotListOpts,
      ) ([]VolumeSnapshot, error)

      // DeleteVolumeSnapshot permanently deletes the given snapshot.
      DeleteVolumeSnapshot(l *logger.Logger, snapshot VolumeSnapshot) error
  }

Since these snapshots necessarily outlive the tests, and we don't want
them dangling perpetually, we introduce a prune-dangling roachtest that
acts as a poor man's cron job, sifting through expired snapshots
(>30days) and deleting them. For GCE at least it's not obvious to me how
to create these snapshots in cloud buckets with a TTL built in, hence
this hack. It looks like this (with change to the TTL):

    === RUN   prune-dangling
    06:22:48 prune_dangling_snapshots_and_disks.go:54: pruned old snapshot ac-index-backfill-0001-vunknown-1-n2-standard-8 (id=7962137245497025996)
    06:22:48 test_runner.go:1023: tearing down after success; see teardown.log
    --- PASS: prune-dangling (8.59s)

Subsequent commits will:
- [ ] Fill out admission-control/index-backfill, a non-trivial use of
      disk snapshots. It will cut down the test time from >4hrs to <25m.
- [ ] Expose top-level commands in roachprod to manipulate these
      snapshots.

Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif force-pushed the 230523.index-backfill-roachtest branch 3 times, most recently from c05f987 to 01a4c74 Compare May 24, 2023 09:51
And make it use disk snapshots. Add a few smarts to the TPC-E harness
(exposing a 'during' helper to run backfills concurrently with
foreground load, integrate with --skip-init, --local, estimated setup
times, prometheus, and disk snapshots of course).

Release note: None
@irfansharif irfansharif force-pushed the 230523.index-backfill-roachtest branch 2 times, most recently from 7bd072d to 7ce8dfb Compare May 24, 2023 22:55
@irfansharif
Copy link
Contributor Author

Pulled it into #103757.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants