-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: add admission-control/index-backfill #103816
Closed
irfansharif
wants to merge
5
commits into
cockroachdb:master
from
irfansharif:230523.index-backfill-roachtest
Closed
roachtest: add admission-control/index-backfill #103816
irfansharif
wants to merge
5
commits into
cockroachdb:master
from
irfansharif:230523.index-backfill-roachtest
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Pure code movement. We'll make use of it outside this file in subsequent commits. Release note: None
These tests have been stable for a few months now. Reduce to a weekly cadence. Release note: None
Long-lived disk snapshots can drastically reduce testing time for scale tests. Tests, whether run by hand or through CI, need only run the long running fixture generating code (importing some dataset, generating it organically through workload, etc.) once snapshot fingerprints are changed, fingerprints that incorporate the major crdb version that generated them. Here's an example run that freshly generates disk snapshots: === RUN admission-control/index-backfill 03:57:19 admission_control_index_backfill.go:53: no existing snapshots found for admission-control/index-backfill (ac-index-backfill), doing pre-work 03:57:54 roachprod.go:1626: created volume snapshot ac-index-backfill-0001-vunknown-1-n2-standard-8 (id=6426236595187320652) for volume irfansharif-snapshot-0001-1 on irfansharif-snapshot-0001-1/n1 03:57:55 admission_control_index_backfill.go:61: using 1 newly created snapshot(s) with prefix "ac-index-backfill" 03:58:02 roachprod.go:1716: detached and deleted volume irfansharif-snapshot-0001-1 from irfansharif-snapshot-0001 03:58:28 roachprod.go:1764: created volume irfansharif-snapshot-0001-1 03:58:33 roachprod.go:1770: attached volume irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001 03:58:36 roachprod.go:1783: mounted irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001 --- PASS: admission-control/index-backfill (79.14s) Here's a subsequent run that makes use of the aforementioned disk snapshot: === RUN admission-control/index-backfill 04:00:40 admission_control_index_backfill.go:63: using 1 pre-existing snapshot(s) with prefix "ac-index-backfill" 04:00:47 roachprod.go:1716: detached and deleted volume irfansharif-snapshot-0001-1 from irfansharif-snapshot-0001 04:01:14 roachprod.go:1763: created volume irfansharif-snapshot-0001-1 04:01:19 roachprod.go:1769: attached volume irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001 04:01:22 roachprod.go:1782: mounted irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001 --- PASS: admission-control/index-backfill (43.47s) We add the following APIs to the roachtest.Cluster interface, for tests to interact with disk snapshots. admission-control/index-backfill is a placeholder test making use of these APIs. type Cluster interface { // ... // CreateSnapshot creates volume snapshots of the cluster using // the given prefix. These snapshots can later be retrieved, // deleted or applied to already instantiated clusters. CreateSnapshot(ctx context.Context, snapshotPrefix string) error // ListSnapshots lists the individual volume snapshots that // satisfy the search criteria. ListSnapshots( ctx context.Context, vslo vm.VolumeSnapshotListOpts, ) ([]vm.VolumeSnapshot, error) // DeleteSnapshots permanently deletes the given snapshots. DeleteSnapshots( ctx context.Context, snapshots ...vm.VolumeSnapshot, ) error // ApplySnapshots applies the given volume snapshots to the // underlying cluster. This is a destructive operation as far as // existing state is concerned - all already-attached volumes are // detached and deleted to make room for new snapshot-derived // volumes. The new volumes are created using the same specs // (size, disk type, etc.) as the original cluster. ApplySnapshots( ctx context.Context, snapshots []vm.VolumeSnapshot, ) error } This in turn is powered by the following additions to the vm.Provider interface, implemented by each cloud provider. type Provider interface { // ... // CreateVolume creates a new volume using the given options. CreateVolume(l *logger.Logger, vco VolumeCreateOpts) (Volume, error) // ListVolumes lists all volumes already attached to the given VM. ListVolumes(l *logger.Logger, vm *VM) ([]Volume, error) // DeleteVolume detaches and deletes the given volume from the // given VM. DeleteVolume(l *logger.Logger, volume Volume, vm *VM) error // AttachVolume attaches the given volume to the given VM. AttachVolume(l *logger.Logger, volume Volume, vm *VM) (string, error) // CreateVolumeSnapshot creates a snapshot of the given volume, // using the given options. CreateVolumeSnapshot( l *logger.Logger, volume Volume, vsco VolumeSnapshotCreateOpts, ) (VolumeSnapshot, error) // ListVolumeSnapshots lists the individual volume snapshots that // satisfy the search criteria. ListVolumeSnapshots( l *logger.Logger, vslo VolumeSnapshotListOpts, ) ([]VolumeSnapshot, error) // DeleteVolumeSnapshot permanently deletes the given snapshot. DeleteVolumeSnapshot(l *logger.Logger, snapshot VolumeSnapshot) error } Since these snapshots necessarily outlive the tests, and we don't want them dangling perpetually, we introduce a prune-dangling roachtest that acts as a poor man's cron job, sifting through expired snapshots (>30days) and deleting them. For GCE at least it's not obvious to me how to create these snapshots in cloud buckets with a TTL built in, hence this hack. It looks like this (with change to the TTL): === RUN prune-dangling 06:22:48 prune_dangling_snapshots_and_disks.go:54: pruned old snapshot ac-index-backfill-0001-vunknown-1-n2-standard-8 (id=7962137245497025996) 06:22:48 test_runner.go:1023: tearing down after success; see teardown.log --- PASS: prune-dangling (8.59s) Subsequent commits will: - [ ] Fill out admission-control/index-backfill, a non-trivial use of disk snapshots. It will cut down the test time from >4hrs to <25m. - [ ] Expose top-level commands in roachprod to manipulate these snapshots. Release note: None
irfansharif
force-pushed
the
230523.index-backfill-roachtest
branch
3 times, most recently
from
May 24, 2023 09:51
c05f987
to
01a4c74
Compare
And make it use disk snapshots. Add a few smarts to the TPC-E harness (exposing a 'during' helper to run backfills concurrently with foreground load, integrate with --skip-init, --local, estimated setup times, prometheus, and disk snapshots of course). Release note: None
irfansharif
force-pushed
the
230523.index-backfill-roachtest
branch
2 times, most recently
from
May 24, 2023 22:55
7bd072d
to
7ce8dfb
Compare
Release note: None
irfansharif
force-pushed
the
230523.index-backfill-roachtest
branch
from
May 24, 2023 23:07
7ce8dfb
to
7b176c1
Compare
Pulled it into #103757. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
And make it use disk snapshots. Add a few smarts to the TPC-E harness
(exposing a 'during' helper, integrate with --skip-init, --local,
estimated setup times, prometheus, and disk snapshots) while here.
Release note: None