c2c: write-rate roachtests for cluster-to-cluster streaming #89176

stevendanna · 2022-10-03T13:30:11Z

We want a series of roachtests that provide us with information about the write rate we can sustain while maintaining a constant replication lag.

The test should run the KV95 workload. We can use the existing KV95 tests to determine an appropriate load level.
The test should fail if the replication lag grows beyond some pre-defined bounds. The CDC tests have an example of doing this.
Export performance data from the workload to roachperf
Roachperf should be modified to render the new graphs for our tests.

Jira issue: CRDB-20159

Epic CRDB-18751

blathers-crl · 2022-10-03T16:35:41Z

cc @cockroachdb/disaster-recovery

Previously in c2c roachtests, the foreground workload on the src cluster would run for a predefined amount of time, based on the expected initial scan time. But, if this estimated initial scan time wasn't accurate, the roachtest would not properly simulate c2c customer workload. E.g. if the initial scan actually took much longer than expected, the workload would finish before the initial scan! This patch removes the need to specify a duration for the src cluster workload. Instead, the goroutine running the workload will get cancelled at cutover time, determined by the `replicationTestSpec.additionalDuration` field, which specifies how long the workload should after the initial scan completes. This patch also adds additional logging which provides instructions for opening a sql session to the tenant and opening a tenant's dbconsole. Informs cockroachdb#89176 Release note: None

95161: kv: Test to measure slowdown after a node restart r=irfansharif a=andrewbaptist After a node is down for a few minutes and then starts up again, there is a slowdown related to it catching up on Raft messages it missed while down. This can cause an IO Overload scenario and greatly impact performance on the cluster. This adds a test for the issue, a separate PR will be created to enable this test and fix the issue. Informs: #95159 Epic: none Release note: None 95191: c2c: increase c2c roachtest workload flexibility r=stevendanna,renatolabs a=msbutler Previously in c2c roachtests, the foreground workload on the src cluster would run for a predefined amount of time, based on the expected initial scan time. But, if this estimated initial scan time wasn't accurate, the roachtest would not properly simulate c2c customer workload. E.g. if the initial scan actually took much longer than expected, the workload would finish before the initial scan! This patch removes the need to specify a duration for the src cluster workload. Instead, the goroutine running the workload will get cancelled at cutover time, determined by the `replicationTestSpec.additionalDuration` field, which specifies how long the workload should after the initial scan completes. This patch also adds additional logging which provides instructions for opening a sql session to the tenant and opening a tenant's dbconsole. Informs #89176 Release note: None 95407: pkg/cloud/azure: migrate to new azure sdk r=benbardin,dt a=msbutler This patch replaces the deprecated azure sdk with azure's new sdk, used to read and write backups to azure. This PR also reverts the custom azure put uploader, which if necessary, can be re added to use the new sdk in the future. This PR should unblock azure kms work. Informs #86903 Epic CRDB-18954 Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Steven Danna <[email protected]>

97465: c2c: gather perf metrics from prometheus r=stevendanna a=msbutler c2c roachtest performance metrics are now gathered by a prom/grafana instance running locally on the roachprod cluster. This change allows us to gather and process any metrics exposed to the crdb prom endpoint. Specifically, we now gather: `capacity_used`, `replication_logical_bytes`, `replication_sst_bytes` at various points during the c2c roachtest, allowing us to measure: - Initial Scan Throughput: initial scan size / initial scan duration - Workload Throughput: data ingested during workload / workload duration - Cutover Throughput: (data ingested between cutover time and cutover cmd) / (cutover process duration) where the size of these operations can be measured as either physical replicated bytes, logical ingested bytes, or physical ingested bytes on the source cluster. This patch also fixes a recent bug which mislabeled src cluster throughput as initial scan throughput. Informs #89176 Release note: None 97505: server, ui: remove interpreted jobs retrying status r=xinhaoz a=xinhaoz This commit removes the 'Retrying' status from the jobs UX. Previously, we were interpolating this status from the running status. This just added confusion and incorectness to the status of the job being displayed. The status being surfaced now aligns directly with what is shown in the `crdb_internal.jobs` table. Some missing job statuses were also added as request options to the 'Status' dropdown, including: - Pause Requested - Cancel Requested - Revert Failed Fixes: #95712 Release note (ui change): Retrying is no longer a status shown in the jobs page. <img width="1326" alt="image" src="https://user-images.githubusercontent.com/20136951/220738075-733b0cc8-9f77-4ace-a944-3791ff159c62.png"> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]>

This patch refactors the roachtest driver such that: 1) the streamingWorkload interface can run a custom workload with arbitrary sql queries. 2) to reduce helper function signature bloat, many helper functions are now replicationTestSpec methods. 3) the test writer can specity an `additionalDuration` of 0, which allows the workload to terminate on its own. 4) a health monitor will fail the test if it cannot connect to a node This patch also adds two new roachtests: - c2c/BulkOps: runs the backup/mvcc-range-tombstones roachtest on the source cluster (without the backup-restore roundtrips for now), and streams it to the destination. - c2c/UnitTest: is quick roachtest that can be used to debug the c2c roachtest infrastructure. Informs cockroachdb#89176 Release note: None

98295: c2c: refactor roachtest driver to run and stream arbitrary workloads r=stevendanna a=msbutler This patch refactors the roachtest driver such that: 1) the streamingWorkload interface can run a custom workload with arbitrary sql queries. 2) to reduce helper function signature bloat, many helper functions are now replicationTestSpec methods. 3) the test writer can specity an `additionalDuration` of 0, which allows the workload to terminate on its own. 4) a health monitor will fail the test if it cannot connect to a node This patch also adds two new roachtests: - c2c/BulkOps: runs the backup/mvcc-range-tombstones roachtest on the source cluster (without the backup-restore roundtrips for now), and streams it to the destination. - c2c/UnitTest: is quick roachtest that can be used to debug the c2c roachtest infrastructure. Informs #89176 Release note: None 98445: sql_instance: migrate to rbr compatible index r=JeffSwenson a=JeffSwenson Migrate sql_instance to a regional by row compatible index. The version gates are intended to follow the protocol discussed in the comment at the top of upgrades/system_rbr_indexes.go The crdb_region column ID was changed from 5 to 6 in order to match the logical order in which the sql_addr and crdb_region columns were added. The exact ID doesn't really matter in this case since the sql_addr column was added in v23.1. Most of the rbr migration work is the same for sqlliveness, lease, and sql_instances. The main exception to that is the migration cache used by the sql instance reader. The cache is backed by a range feed and we need to switch implementations when the version setting changes. Part of #94843 Relase note: None Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Jeff <[email protected]>

msbutler · 2023-06-14T18:54:45Z

not actively working on this. unassigning.

stevendanna added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Oct 3, 2022

stevendanna mentioned this issue Oct 3, 2022

c2c: extend c2c roachtests to look at different workloads #89178

Open

5 tasks

blathers-crl bot added the T-disaster-recovery label Oct 3, 2022

exalate-issue-sync bot assigned stevendanna Oct 6, 2022

msbutler mentioned this issue Nov 29, 2022

streamingccl: write end to end c2c roachtest with tpcc workload #92697

Closed

exalate-issue-sync bot assigned msbutler and unassigned stevendanna Jan 4, 2023

msbutler mentioned this issue Jan 13, 2023

c2c: increase c2c roachtest workload flexibility #95191

Merged

msbutler mentioned this issue Feb 22, 2023

c2c: gather perf metrics from prometheus #97465

Merged

msbutler mentioned this issue Mar 9, 2023

c2c: refactor roachtest driver to run and stream arbitrary workloads #98295

Merged

msbutler removed their assignment Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c2c: write-rate roachtests for cluster-to-cluster streaming #89176

c2c: write-rate roachtests for cluster-to-cluster streaming #89176

stevendanna commented Oct 3, 2022 •

edited by msbutler

Loading

blathers-crl bot commented Oct 3, 2022

msbutler commented Jun 14, 2023

c2c: write-rate roachtests for cluster-to-cluster streaming #89176

c2c: write-rate roachtests for cluster-to-cluster streaming #89176

Comments

stevendanna commented Oct 3, 2022 • edited by msbutler Loading

blathers-crl bot commented Oct 3, 2022

msbutler commented Jun 14, 2023

stevendanna commented Oct 3, 2022 •

edited by msbutler

Loading