-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: change default restore roachtest configuration #92699
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-disaster-recovery
Comments
msbutler
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-disaster-recovery
labels
Nov 29, 2022
cc @cockroachdb/disaster-recovery |
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Dec 22, 2022
This patch introduces a new framework for writing restore roachtests that minimizes code reuse and leverages our new backup fixture organization. The framework makes it easy to write a new test using a variety of knobs like: - hardware: cloud provider, disk volume, # of nodes, # of cpus - backup fixture: workload, workload scale The patch is the first in an ongoing effort to redo our roachtests, and introduces two new roachtests: - restore/nodes=4: the default configuration: 4 nodes, 8vcpus, 1000 GB EBS, restore a tpce backup fixture (25,000 customers, around 400 GB). - restore/gce: same config as above, run on gce. Future patches will add more tests that use this framework. Informs cockroachdb#92699 Release note: None
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Jan 4, 2023
This patch introduces a new framework for writing restore roachtests that minimizes code reuse and leverages our new backup fixture organization. The framework makes it easy to write a new test using a variety of knobs like: - hardware: cloud provider, disk volume, # of nodes, # of cpus - backup fixture: workload, workload scale The patch is the first in an ongoing effort to redo our roachtests, and introduces two new roachtests: - restore/nodes=4: the default configuration: 4 nodes, 8vcpus, 1000 GB EBS, restore a tpce backup fixture (25,000 customers, around 400 GB). - restore/gce: same config as above, run on gce. Future patches will add more tests that use this framework. Informs cockroachdb#92699 Release note: None
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Jan 5, 2023
This patch introduces a new framework for writing restore roachtests that minimizes code reuse and leverages our new backup fixture organization. The framework makes it easy to write a new test using a variety of knobs like: - hardware: cloud provider, disk volume, # of nodes, # of cpus - backup fixture: workload, workload scale The patch is the first in an ongoing effort to redo our roachtests, and introduces two new roachtests: - restore/nodes=4: the default configuration: 4 nodes, 8vcpus, 1000 GB EBS, restore a tpce backup fixture (25,000 customers, around 400 GB). - restore/gce: same config as above, run on gce. Notice that this patch also introduces a new naming convention for restore tests. The default test is named `restore/nodes=4` and each test which deviates from the config will highlight the deviation in the name. For example `restore/gce` only switches the cloud provider and holds all other variables constant; thus only 'gce' is needed in the name. Future patches will add more tests that use this framework. Informs cockroachdb#92699 Release note: None
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Jan 6, 2023
This patch introduces a new framework for writing restore roachtests that minimizes code reuse and leverages our new backup fixture organization. The framework makes it easy to write a new test using a variety of knobs like: - hardware: cloud provider, disk volume, # of nodes, # of cpus - backup fixture: workload, workload scale The patch is the first in an ongoing effort to redo our roachtests, and introduces 3 new roachtests: - restore/tpce/400GB: the default configuration: 4 nodes, 8vcpus, 1000 GB EBS, restore a tpce backup fixture (25,000 customers, around 400 GB). - restore/tpce/400GB/gce: same config as above, run on gce. - restore/tpce/8TB/nodes=10: the big one! Notice that this patch also introduces a new naming convention for restore tests. The default test is named `restore/tpce/400GB` and only contains the basic workload. Each other test name will contain the workload and any specs which deviate from the default config. For example `restore/tpce/400GB/gce` only switches the cloud provider and holds all other variables constant; thus only the workload and 'gce' are needed in the name. Future patches will add more tests that use this framework. Informs cockroachdb#92699 Release note: None enforce naming convention
craig bot
pushed a commit
that referenced
this issue
Jan 7, 2023
94143: backupccl: introduce new restore roachtest framework r=lidorcarmel a=msbutler This patch introduces a new framework for writing restore roachtests that minimizes code reuse and leverages our new backup fixture organization. The framework makes it easy to write a new test using a variety of knobs like: - hardware: cloud provider, disk volume, # of nodes, # of cpus - backup fixture: workload, workload scale The patch is the first in an ongoing effort to redo our roachtests, and introduces 3 new roachtests: - restore/tpce/400GB: the default configuration: 4 nodes, 8vcpus, 1000 GB EBS, restore a tpce backup fixture (25,000 customers, around 400 GB). - restore/tpce/400GB/gce: same config as above, run on gce. - restore/tpce/8TB/nodes=10: the big one! Notice that this patch also introduces a new naming convention for restore tests. The default test is named `restore/tpce/400GB` and only contains the basic workload. Each other test name will contain the workload and any specs which deviate from the default config. For example `restore/tpce/400GB/gce` only switches the cloud provider and holds all other variables constant; thus only the workload and 'gce' are needed in the name. Future patches will add more tests that use this framework. Informs #92699 Release note: None Co-authored-by: Michael Butler <[email protected]>
craig bot
pushed a commit
that referenced
this issue
Mar 7, 2023
97587: allocator: check IO overload on lease transfer r=andrewbaptist a=kvoli Previously, the allocator would return lease transfer targets without considering the IO overload of stores involved. When leases would transfer to the IO overloaded stores, service latency tended to degrade. This commit adds IO overload checks prior to lease transfers. The IO overload checks are similar to the IO overload checks for allocating replicas in #97142. The checks work by comparing a candidate store against `kv.allocator.lease_io_overload_threshold` and the mean of other candidates. If the candidate store is equal to or greater than both these values, it is considered IO overloaded. The default value is 0.5. The current leaseholder has to meet a higher bar to be considered IO overloaded. It must have an IO overload score greater or equal to `kv.allocator.lease_shed_io_overload_threshold`. The default value is 0.9. The level of enforcement for IO overload is controlled by `kv.allocator.lease_io_overload_threshold_enforcement` controls the action taken when a candidate store for a lease transfer is IO overloaded. - `ignore`: ignore IO overload scores entirely during lease transfers (effectively disabling this mechanism); - `block_transfer_to`: lease transfers only consider stores that aren't IO overloaded (existing leases on IO overloaded stores are left as is); - `shed`: actively shed leases from IO overloaded stores to less IO overloaded stores (this is a super-set of block_transfer_to). The default is `block_transfer_to`. This commit also updates the existing replica IO overload checks to be prefixed with `Replica`, to avoid confusion between lease and replica IO overload checks. Resolves: #96508 Release note (ops change): Range leases will no longer be transferred to stores which are IO overloaded. 98041: backupccl: fix off by one index in fileSSTSink file extension r=rhu713 a=rhu713 Currently, the logic that extends the last flushed file fileSSTSink does not trigger if there is only one flushed file. This failure to extend the first flushed file can result in file entries in the backup manifest with duplicate start keys. For example, if the first export response written to the sink contains partial entries of a single key `a`, then the span of the first file will be `a-a`, and the span of the subsequent file will always be `a-<end_key>`. The presence of these duplicate start keys breaks the encoding of the external manifest files list SST as the file path + start key combination in the manifest are assumed to be unique. Fixes #97953 Release note: None 98072: backupccl: replace restore2TB and restoretpccInc tests r=lidorcarmel a=msbutler This patch removes the restore2TB* roachtests which ran a 2TB bank restore to benchmark restore performance across a few hardware configurations. This patch also replaces the `restoreTPCCInc/nodes=10` test which tested our ability to handle a backup with a long chain. This patch also adds: 1. `restore/tpce/400GB/aws/nodes=4/cpus=16` to measure how per-node throughput scales when the per node vcpu count doubles relative to default. 2. `restore/tpce/400GB/aws/nodes=8/cpus=8` to measure how per-node throughput scales when the number of nodes doubles relative to default. 3. `restore/tpce/400GB/aws/backupsIncluded=48/nodes=4/cpus=8` to measure restore reliability and performance on 48 length long backup chain relative to default. A future patch will update the fixtures used in the restore node shutdown scripts, and add more perf based tests. Fixes #92699 Release note: None Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Rui Hu <[email protected]> Co-authored-by: Michael Butler <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-disaster-recovery
Our current restore roachtest suite should be updated to better reflect customer workloads/topologies. I propose creating a default topology/workload and refactoring our existing tests to be more intentional in how they branch from the default configuration. Ideally, each test that deviates from the default configuration should explicitly test how this deviation affects performance. The new default configuration is described in detail here.
This issue will track work to:
Epic CRDB-20915
Jira issue: CRDB-21924
Epic CRDB-20915
The text was updated successfully, but these errors were encountered: