Skip to content

Commit

Permalink
roach{prod,test}: add first-class support for disk snapshots
Browse files Browse the repository at this point in the history
Long-lived disk snapshots can drastically reduce testing time for scale
tests. Tests, whether run by hand or through CI, need only run the
long running fixture generating code (importing some dataset, generating
it organically through workload, etc.) once snapshot fingerprints are
changed, fingerprints that incorporate the major crdb version that
generated them.

Here's an example run that freshly generates disk snapshots:

    === RUN   admission-control/index-backfill
    03:57:19 admission_control_index_backfill.go:53: no existing snapshots found for admission-control/index-backfill (ac-index-backfill), doing pre-work
    03:57:54 roachprod.go:1626: created volume snapshot ac-index-backfill-0001-vunknown-1-n2-standard-8 (id=6426236595187320652) for volume irfansharif-snapshot-0001-1 on irfansharif-snapshot-0001-1/n1
    03:57:55 admission_control_index_backfill.go:61: using 1 newly created snapshot(s) with prefix "ac-index-backfill"
    03:58:02 roachprod.go:1716: detached and deleted volume irfansharif-snapshot-0001-1 from irfansharif-snapshot-0001
    03:58:28 roachprod.go:1764: created volume irfansharif-snapshot-0001-1
    03:58:33 roachprod.go:1770: attached volume irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001
    03:58:36 roachprod.go:1783: mounted irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001
    --- PASS: admission-control/index-backfill (79.14s)

Here's a subsequent run that makes use of the aforementioned disk
snapshot:

    === RUN   admission-control/index-backfill
    04:00:40 admission_control_index_backfill.go:63: using 1 pre-existing snapshot(s) with prefix "ac-index-backfill"
    04:00:47 roachprod.go:1716: detached and deleted volume irfansharif-snapshot-0001-1 from irfansharif-snapshot-0001
    04:01:14 roachprod.go:1763: created volume irfansharif-snapshot-0001-1
    04:01:19 roachprod.go:1769: attached volume irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001
    04:01:22 roachprod.go:1782: mounted irfansharif-snapshot-0001-1 to irfansharif-snapshot-0001
    --- PASS: admission-control/index-backfill (43.47s)

We add the following APIs to the roachtest.Cluster interface, for tests
to interact with disk snapshots. admission-control/index-backfill is a
placeholder test making use of these APIs.

  type Cluster interface {
      // ...

      // CreateSnapshot creates volume snapshots of the cluster using
      // the given prefix. These snapshots can later be retrieved,
      // deleted or applied to already instantiated clusters.
      CreateSnapshot(ctx context.Context, snapshotPrefix string) error

      // ListSnapshots lists the individual volume snapshots that
      // satisfy the search criteria.
      ListSnapshots(
        ctx context.Context, vslo vm.VolumeSnapshotListOpts,
      ) ([]vm.VolumeSnapshot, error)

      // DeleteSnapshots permanently deletes the given snapshots.
      DeleteSnapshots(
        ctx context.Context, snapshots ...vm.VolumeSnapshot,
      ) error

      // ApplySnapshots applies the given volume snapshots to the
      // underlying cluster. This is a destructive operation as far as
      // existing state is concerned - all already-attached volumes are
      // detached and deleted to make room for new snapshot-derived
      // volumes. The new volumes are created using the same specs
      // (size, disk type, etc.) as the original cluster.
      ApplySnapshots(
        ctx context.Context, snapshots []vm.VolumeSnapshot,
      ) error
  }

This in turn is powered by the following additions to the vm.Provider
interface, implemented by each cloud provider.

  type Provider interface {
      // ...

      // CreateVolume creates a new volume using the given options.
      CreateVolume(l *logger.Logger, vco VolumeCreateOpts) (Volume, error)

      // ListVolumes lists all volumes already attached to the given VM.
      ListVolumes(l *logger.Logger, vm *VM) ([]Volume, error)

      // DeleteVolume detaches and deletes the given volume from the
      // given VM.
      DeleteVolume(l *logger.Logger, volume Volume, vm *VM) error

      // AttachVolume attaches the given volume to the given VM.
      AttachVolume(l *logger.Logger, volume Volume, vm *VM) (string, error)

      // CreateVolumeSnapshot creates a snapshot of the given volume,
      // using the given options.
      CreateVolumeSnapshot(
        l *logger.Logger, volume Volume, vsco VolumeSnapshotCreateOpts,
      ) (VolumeSnapshot, error)

      // ListVolumeSnapshots lists the individual volume snapshots that
      // satisfy the search criteria.
      ListVolumeSnapshots(
        l *logger.Logger, vslo VolumeSnapshotListOpts,
      ) ([]VolumeSnapshot, error)

      // DeleteVolumeSnapshot permanently deletes the given snapshot.
      DeleteVolumeSnapshot(l *logger.Logger, snapshot VolumeSnapshot) error
  }

Since these snapshots necessarily outlive the tests, and we don't want
them dangling perpetually, we introduce a prune-dangling roachtest that
acts as a poor man's cron job, sifting through expired snapshots
(>30days) and deleting them. For GCE at least it's not obvious to me how
to create these snapshots in cloud buckets with a TTL built in, hence
this hack. It looks like this (with change to the TTL):

    === RUN   prune-dangling
    06:22:48 prune_dangling_snapshots_and_disks.go:54: pruned old snapshot ac-index-backfill-0001-vunknown-1-n2-standard-8 (id=7962137245497025996)
    06:22:48 test_runner.go:1023: tearing down after success; see teardown.log
    --- PASS: prune-dangling (8.59s)

Subsequent commits will:
- [ ] Fill out admission-control/index-backfill, a non-trivial use of
      disk snapshots. It will cut down the test time from >4hrs to <25m.
- [ ] Expose top-level commands in roachprod to manipulate these
      snapshots.

Release note: None
  • Loading branch information
irfansharif committed May 23, 2023
1 parent f201f9e commit f13b01d
Show file tree
Hide file tree
Showing 19 changed files with 848 additions and 148 deletions.
5 changes: 4 additions & 1 deletion pkg/cmd/roachprod/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -1125,7 +1125,10 @@ var storageSnapshotCmd = &cobra.Command{
cluster := args[0]
name := args[1]
desc := args[2]
return roachprod.SnapshotVolume(context.Background(), config.Logger, cluster, name, desc)
return roachprod.CreateSnapshot(context.Background(), config.Logger, cluster, vm.VolumeSnapshotCreateOpts{
Name: name,
Description: desc,
})
}),
}

Expand Down
34 changes: 34 additions & 0 deletions pkg/cmd/roachtest/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -574,6 +574,9 @@ func MachineTypeToCPUs(s string) int {
if _, err := fmt.Sscanf(s, "n1-standard-%d", &v); err == nil {
return v
}
if _, err := fmt.Sscanf(s, "n2-standard-%d", &v); err == nil {
return v
}
if _, err := fmt.Sscanf(s, "n1-highcpu-%d", &v); err == nil {
return v
}
Expand Down Expand Up @@ -1690,6 +1693,37 @@ func (c *clusterImpl) doDestroy(ctx context.Context, l *logger.Logger) <-chan st
return ch
}

func (c *clusterImpl) ListSnapshots(
ctx context.Context, vslo vm.VolumeSnapshotListOpts,
) ([]vm.VolumeSnapshot, error) {
return roachprod.ListSnapshots(ctx, c.l, c.name, vslo)
}

func (c *clusterImpl) DeleteSnapshots(ctx context.Context, snapshots ...vm.VolumeSnapshot) error {
return roachprod.DeleteSnapshots(ctx, c.l, c.name, snapshots...)
}

func (c *clusterImpl) CreateSnapshot(ctx context.Context, snapshotPrefix string) error {
return roachprod.CreateSnapshot(ctx, c.l, c.name, vm.VolumeSnapshotCreateOpts{
Name: snapshotPrefix,
Description: fmt.Sprintf("snapshot for test: %s", c.t.Name()),
Labels: map[string]string{
vm.TagUsage: "roachtest",
},
})
}

func (c *clusterImpl) ApplySnapshots(ctx context.Context, snapshots []vm.VolumeSnapshot) error {
opts := vm.VolumeCreateOpts{
Size: c.spec.VolumeSize,
Type: c.spec.GCEVolumeType, // TODO(irfansharif): This is only applicable to GCE. Change that.
Labels: map[string]string{
"usage": "roachtest",
},
}
return roachprod.ApplySnapshots(ctx, c.l, c.name, snapshots, opts)
}

// Put a local file to all of the machines in a cluster.
// Put is DEPRECATED. Use PutE instead.
func (c *clusterImpl) Put(ctx context.Context, src, dest string, nodes ...option.Option) {
Expand Down
1 change: 1 addition & 0 deletions pkg/cmd/roachtest/cluster/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ go_library(
"//pkg/roachprod/install",
"//pkg/roachprod/logger",
"//pkg/roachprod/prometheus",
"//pkg/roachprod/vm",
"@com_github_cockroachdb_errors//:errors",
],
)
Expand Down
19 changes: 19 additions & 0 deletions pkg/cmd/roachtest/cluster/cluster_interface.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/roachprod/install"
"github.com/cockroachdb/cockroach/pkg/roachprod/logger"
"github.com/cockroachdb/cockroach/pkg/roachprod/prometheus"
"github.com/cockroachdb/cockroach/pkg/roachprod/vm"
)

// Cluster is the interface through which a given roachtest interacts with the
Expand Down Expand Up @@ -134,4 +135,22 @@ type Cluster interface {

StartGrafana(ctx context.Context, l *logger.Logger, promCfg *prometheus.Config) error
StopGrafana(ctx context.Context, l *logger.Logger, dumpDir string) error

// Volume snapshot related APIs.

// CreateSnapshot creates volume snapshots of the cluster using the given
// prefix. These snapshots can later be retrieved, deleted or applied to
// already instantiated clusters.
CreateSnapshot(ctx context.Context, snapshotPrefix string) error
// ListSnapshots lists the individual volume snapshots that satisfy the
// search criteria.
ListSnapshots(ctx context.Context, vslo vm.VolumeSnapshotListOpts) ([]vm.VolumeSnapshot, error)
// DeleteSnapshots permanently deletes the given snapshots.
DeleteSnapshots(ctx context.Context, snapshots ...vm.VolumeSnapshot) error
// ApplySnapshots applies the given volume snapshots to the underlying
// cluster. This is a destructive operation as far as existing state is
// concerned - all already-attached volumes are detached and deleted to make
// room for new snapshot-derived volumes. The new volumes are created using
// the same specs (size, disk type, etc.) as the original cluster.
ApplySnapshots(ctx context.Context, snapshots []vm.VolumeSnapshot) error
}
14 changes: 13 additions & 1 deletion pkg/cmd/roachtest/spec/cluster_spec.go
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,13 @@ type ClusterSpec struct {
RandomlyUseZfs bool

GatherCores bool

// GCE-specific arguments.
//
// TODO(irfansharif): This cluster spec type suffers the curse of
// generality. Make it easier to just inject cloud-specific arguments.
GCEMinCPUPlatform string
GCEVolumeType string
}

// MakeClusterSpec makes a ClusterSpec.
Expand Down Expand Up @@ -155,6 +162,7 @@ func getGCEOpts(
localSSD bool,
RAID0 bool,
terminateOnMigration bool,
minCPUPlatform, volumeType string,
) vm.ProviderOpts {
opts := gce.DefaultProviderOpts()
opts.MachineType = machineType
Expand All @@ -173,6 +181,8 @@ func getGCEOpts(
opts.UseMultipleDisks = !RAID0
}
opts.TerminateOnMigration = terminateOnMigration
opts.MinCPUPlatform = minCPUPlatform
opts.PDVolumeType = volumeType

return opts
}
Expand Down Expand Up @@ -289,7 +299,9 @@ func (s *ClusterSpec) RoachprodOpts(
providerOpts = getAWSOpts(machineType, zones, s.VolumeSize, createVMOpts.SSDOpts.UseLocalSSD)
case GCE:
providerOpts = getGCEOpts(machineType, zones, s.VolumeSize, ssdCount,
createVMOpts.SSDOpts.UseLocalSSD, s.RAID0, s.TerminateOnMigration)
createVMOpts.SSDOpts.UseLocalSSD, s.RAID0, s.TerminateOnMigration,
s.GCEMinCPUPlatform, s.GCEVolumeType,
)
case Azure:
providerOpts = getAzureOpts(machineType, zones)
}
Expand Down
11 changes: 11 additions & 0 deletions pkg/cmd/roachtest/spec/option.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,17 @@ type Option interface {
apply(spec *ClusterSpec)
}

type cloudOption string

func (o cloudOption) apply(spec *ClusterSpec) {
spec.Cloud = string(o)
}

// Cloud controls what cloud is used to create the cluster.
func Cloud(s string) Option {
return cloudOption(s)
}

type nodeCPUOption int

func (o nodeCPUOption) apply(spec *ClusterSpec) {
Expand Down
3 changes: 3 additions & 0 deletions pkg/cmd/roachtest/tests/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ go_library(
"admission_control.go",
"admission_control_elastic_backup.go",
"admission_control_elastic_cdc.go",
"admission_control_index_backfill.go",
"admission_control_index_overload.go",
"admission_control_multi_store_overload.go",
"admission_control_multitenant_fairness.go",
Expand Down Expand Up @@ -118,6 +119,7 @@ go_library(
"pgx_blocklist.go",
"pop.go",
"process_lock.go",
"prune_dangling_snapshots_and_disks.go",
"psycopg.go",
"psycopg_blocklist.go",
"query_comparison_util.go",
Expand Down Expand Up @@ -211,6 +213,7 @@ go_library(
"//pkg/roachprod/install",
"//pkg/roachprod/logger",
"//pkg/roachprod/prometheus",
"//pkg/roachprod/vm",
"//pkg/server",
"//pkg/server/serverpb",
"//pkg/sql",
Expand Down
1 change: 1 addition & 0 deletions pkg/cmd/roachtest/tests/admission_control.go
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,5 @@ func registerAdmission(r registry.Registry) {
registerTPCCOverload(r)
registerTPCCSevereOverload(r)
registerIndexOverload(r)
registerIndexBackfill(r)
}
83 changes: 83 additions & 0 deletions pkg/cmd/roachtest/tests/admission_control_index_backfill.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
// Copyright 2023 The Cockroach Authors.
//
// Use of this software is governed by the Business Source License
// included in the file licenses/BSL.txt.
//
// As of the Change Date specified in that file, in accordance with
// the Business Source License, use of this software will be governed
// by the Apache License, Version 2.0, included in the file
// licenses/APL.txt.

package tests

import (
"context"

"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/registry"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/spec"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test"
"github.com/cockroachdb/cockroach/pkg/roachprod/vm"
)

func registerIndexBackfill(r registry.Registry) {
clusterSpec := r.MakeClusterSpec(
1, /* nodeCount */
spec.CPU(8),
spec.Zones("us-east1-b"),
spec.VolumeSize(500),
spec.Cloud(spec.GCE),
)
clusterSpec.InstanceType = "n2-standard-8"
clusterSpec.GCEMinCPUPlatform = "Intel Ice Lake"
clusterSpec.GCEVolumeType = "pd-ssd"

r.Add(registry.TestSpec{
Name: "admission-control/index-backfill",
Owner: registry.OwnerAdmissionControl,
// TODO(irfansharif): Reduce to weekly cadence once stabilized.
// Tags: registry.Tags(`weekly`),
Cluster: clusterSpec,
RequiresLicense: true,
Run: func(ctx context.Context, t test.Test, c cluster.Cluster) {
// TODO(irfansharif): Make a registry of these prefix strings. It's
// important no registered name is a prefix of another.
const snapshotPrefix = "ac-index-backfill"

var snapshots []vm.VolumeSnapshot
snapshots, err := c.ListSnapshots(ctx, vm.VolumeSnapshotListOpts{
// TODO(irfansharif): Search by taking in the other parts of the
// snapshot fingerprint, i.e. the node count, the version, etc.
Name: snapshotPrefix,
})
if err != nil {
t.Fatal(err)
}
if len(snapshots) == 0 {
t.L().Printf("no existing snapshots found for %s (%s), doing pre-work", t.Name(), snapshotPrefix)
// TODO(irfansharif): Add validation that we're some released
// version, probably the predecessor one. Also ensure that any
// running CRDB processes have been stopped since we're taking
// raw disk snapshots. Also later we'll be unmounting/mounting
// attached volumes.
if err := c.CreateSnapshot(ctx, snapshotPrefix); err != nil {
t.Fatal(err)
}
snapshots, err = c.ListSnapshots(ctx, vm.VolumeSnapshotListOpts{Name: snapshotPrefix})
if err != nil {
t.Fatal(err)
}
t.L().Printf("using %d newly created snapshot(s) with prefix %q", len(snapshots), snapshotPrefix)
} else {
t.L().Printf("using %d pre-existing snapshot(s) with prefix %q", len(snapshots), snapshotPrefix)
}

if err := c.ApplySnapshots(ctx, snapshots); err != nil {
t.Fatal(err)
}

// TODO(irfansharif): Actually do something using TPC-E, index
// backfills and replication admission control.
},
})
}
63 changes: 63 additions & 0 deletions pkg/cmd/roachtest/tests/prune_dangling_snapshots_and_disks.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
// Copyright 2023 The Cockroach Authors.
//
// Use of this software is governed by the Business Source License
// included in the file licenses/BSL.txt.
//
// As of the Change Date specified in that file, in accordance with
// the Business Source License, use of this software will be governed
// by the Apache License, Version 2.0, included in the file
// licenses/APL.txt.

package tests

import (
"context"

"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/registry"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/spec"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test"
"github.com/cockroachdb/cockroach/pkg/roachprod"
"github.com/cockroachdb/cockroach/pkg/roachprod/vm"
"github.com/cockroachdb/cockroach/pkg/util/timeutil"
)

// This test exists only to prune expired snapshots. Not all cloud providers
// (GCE) let you store volume snapshots in buckets with a pre-configured TTL. So
// we use this nightly roachtest as a poor man's cron job.
func registerPruneDanglingSnapshotsAndDisks(r registry.Registry) {
clusterSpec := r.MakeClusterSpec(
1, /* nodeCount */
spec.Cloud(spec.GCE),
)

r.Add(registry.TestSpec{
Name: "prune-dangling",
Owner: registry.OwnerTestEng,
Cluster: clusterSpec,
RequiresLicense: true,
Run: func(ctx context.Context, t test.Test, c cluster.Cluster) {
snapshots, err := c.ListSnapshots(ctx, vm.VolumeSnapshotListOpts{
CreatedBefore: timeutil.Now().Add(-1 * roachprod.SnapshotTTL),
Labels: map[string]string{
vm.TagUsage: "roachtest", // only prune out snapshots created in tests
},
})
if err != nil {
t.Fatal(err)
}

for _, snapshot := range snapshots {
if err := c.DeleteSnapshots(ctx, snapshot); err != nil {
t.Fatal(err)
}
t.L().Printf("pruned old snapshot %s (id=%s)", snapshot.Name, snapshot.ID)
}

// TODO(irfansharif): Also prune out unattached disks. Use something
// like:
//
// gcloud compute --project $project disks list --filter="-users:*"
},
})
}
1 change: 1 addition & 0 deletions pkg/cmd/roachtest/tests/registry.go
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ func RegisterTests(r registry.Registry) {
registerPop(r)
registerProcessLock(r)
registerPsycopg(r)
registerPruneDanglingSnapshotsAndDisks(r)
registerQueue(r)
registerQuitTransfersLeases(r)
registerRebalanceLoad(r)
Expand Down
22 changes: 5 additions & 17 deletions pkg/roachprod/cloud/cluster_cloud.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ import (
"fmt"
"regexp"
"sort"
"strings"
"time"

"github.com/cockroachdb/cockroach/pkg/roachprod/config"
Expand Down Expand Up @@ -164,21 +163,6 @@ func (c *Cluster) IsLocal() bool {
return config.IsLocalClusterName(c.Name)
}

const vmNameFormat = "user-<clusterid>-<nodeid>"

// namesFromVM determines the user name and the cluster name from a VM.
func namesFromVM(v vm.VM) (userName string, clusterName string, _ error) {
if v.IsLocal() {
return config.Local, v.LocalClusterName, nil
}
name := v.Name
parts := strings.Split(name, "-")
if len(parts) < 3 {
return "", "", fmt.Errorf("expected VM name in the form %s, got %s", vmNameFormat, name)
}
return parts[0], strings.Join(parts[:len(parts)-1], "-"), nil
}

// ListCloud returns information about all instances (across all available
// providers).
func ListCloud(l *logger.Logger, options vm.ListOptions) (*Cloud, error) {
Expand Down Expand Up @@ -207,7 +191,11 @@ func ListCloud(l *logger.Logger, options vm.ListOptions) (*Cloud, error) {
for _, vms := range providerVMs {
for _, v := range vms {
// Parse cluster/user from VM name, but only for non-local VMs
userName, clusterName, err := namesFromVM(v)
userName, err := v.UserName()
if err != nil {
v.Errors = append(v.Errors, vm.ErrInvalidName)
}
clusterName, err := v.ClusterName()
if err != nil {
v.Errors = append(v.Errors, vm.ErrInvalidName)
}
Expand Down
Loading

0 comments on commit f13b01d

Please sign in to comment.