Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c2c: write-rate roachtests for cluster-to-cluster streaming #89176

Open
1 of 4 tasks
stevendanna opened this issue Oct 3, 2022 · 2 comments
Open
1 of 4 tasks

c2c: write-rate roachtests for cluster-to-cluster streaming #89176

stevendanna opened this issue Oct 3, 2022 · 2 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery

Comments

@stevendanna
Copy link
Collaborator

stevendanna commented Oct 3, 2022

We want a series of roachtests that provide us with information about the write rate we can sustain while maintaining a constant replication lag.

  • The test should run the KV95 workload. We can use the existing KV95 tests to determine an appropriate load level.
  • The test should fail if the replication lag grows beyond some pre-defined bounds. The CDC tests have an example of doing this.
  • Export performance data from the workload to roachperf
  • Roachperf should be modified to render the new graphs for our tests.

Jira issue: CRDB-20159

Epic CRDB-18751

@stevendanna stevendanna added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Oct 3, 2022
@blathers-crl
Copy link

blathers-crl bot commented Oct 3, 2022

cc @cockroachdb/disaster-recovery

@exalate-issue-sync exalate-issue-sync bot assigned msbutler and unassigned stevendanna Jan 4, 2023
msbutler added a commit to msbutler/cockroach that referenced this issue Jan 13, 2023
Previously in c2c roachtests, the foreground workload on the src cluster would
run for a predefined amount of time, based on the expected initial scan time.
But, if this estimated initial scan time wasn't accurate, the roachtest would
not properly simulate c2c customer workload. E.g. if the initial scan actually
took much longer than expected, the workload would finish before the initial
scan!

This patch removes the need to specify a duration for the src cluster workload.
Instead, the goroutine running the workload will get cancelled at cutover time,
determined by the `replicationTestSpec.additionalDuration` field, which
specifies how long the workload should after the initial scan completes.

This patch also adds additional logging which provides instructions for opening
a sql session to the tenant and opening a tenant's dbconsole.

Informs cockroachdb#89176

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Jan 13, 2023
Previously in c2c roachtests, the foreground workload on the src cluster would
run for a predefined amount of time, based on the expected initial scan time.
But, if this estimated initial scan time wasn't accurate, the roachtest would
not properly simulate c2c customer workload. E.g. if the initial scan actually
took much longer than expected, the workload would finish before the initial
scan!

This patch removes the need to specify a duration for the src cluster workload.
Instead, the goroutine running the workload will get cancelled at cutover time,
determined by the `replicationTestSpec.additionalDuration` field, which
specifies how long the workload should after the initial scan completes.

This patch also adds additional logging which provides instructions for opening
a sql session to the tenant and opening a tenant's dbconsole.

Informs cockroachdb#89176

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Jan 19, 2023
Previously in c2c roachtests, the foreground workload on the src cluster would
run for a predefined amount of time, based on the expected initial scan time.
But, if this estimated initial scan time wasn't accurate, the roachtest would
not properly simulate c2c customer workload. E.g. if the initial scan actually
took much longer than expected, the workload would finish before the initial
scan!

This patch removes the need to specify a duration for the src cluster workload.
Instead, the goroutine running the workload will get cancelled at cutover time,
determined by the `replicationTestSpec.additionalDuration` field, which
specifies how long the workload should after the initial scan completes.

This patch also adds additional logging which provides instructions for opening
a sql session to the tenant and opening a tenant's dbconsole.

Informs cockroachdb#89176

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Jan 19, 2023
Previously in c2c roachtests, the foreground workload on the src cluster would
run for a predefined amount of time, based on the expected initial scan time.
But, if this estimated initial scan time wasn't accurate, the roachtest would
not properly simulate c2c customer workload. E.g. if the initial scan actually
took much longer than expected, the workload would finish before the initial
scan!

This patch removes the need to specify a duration for the src cluster workload.
Instead, the goroutine running the workload will get cancelled at cutover time,
determined by the `replicationTestSpec.additionalDuration` field, which
specifies how long the workload should after the initial scan completes.

This patch also adds additional logging which provides instructions for opening
a sql session to the tenant and opening a tenant's dbconsole.

Informs cockroachdb#89176

Release note: None
craig bot pushed a commit that referenced this issue Jan 19, 2023
95161: kv: Test to measure slowdown after a node restart r=irfansharif a=andrewbaptist

After a node is down for a few minutes and then starts up again, there is a slowdown related to it catching up on Raft messages it missed while down. This can cause an IO Overload scenario and greatly impact performance on the cluster.

This adds a test for the issue, a separate PR will be created to enable this test and fix the issue.

Informs: #95159

Epic: none
Release note: None

95191: c2c: increase c2c roachtest workload flexibility r=stevendanna,renatolabs a=msbutler

Previously in c2c roachtests, the foreground workload on the src cluster would run for a predefined amount of time, based on the expected initial scan time. But, if this estimated initial scan time wasn't accurate, the roachtest would not properly simulate c2c customer workload. E.g. if the initial scan actually took much longer than expected, the workload would finish before the initial scan!

This patch removes the need to specify a duration for the src cluster workload. Instead, the goroutine running the workload will get cancelled at cutover time, determined by the `replicationTestSpec.additionalDuration` field, which specifies how long the workload should after the initial scan completes.

This patch also adds additional logging which provides instructions for opening a sql session to the tenant and opening a tenant's dbconsole.

Informs #89176

Release note: None

95407: pkg/cloud/azure: migrate to new azure sdk r=benbardin,dt a=msbutler

This patch replaces the deprecated azure sdk with azure's new sdk, used to read
and write backups to azure. This PR also reverts the custom azure put uploader, 
which if necessary, can be re added to use the new sdk in the future. 

This PR should unblock azure kms work. 

Informs #86903

Epic CRDB-18954

Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
craig bot pushed a commit that referenced this issue Feb 27, 2023
97465: c2c: gather perf metrics from prometheus r=stevendanna a=msbutler

c2c roachtest performance metrics are now gathered by a prom/grafana instance running locally on the roachprod cluster. This change allows us to gather and process any metrics exposed to the crdb prom endpoint. Specifically, we now gather: `capacity_used`, `replication_logical_bytes`, `replication_sst_bytes` at various points during the c2c roachtest, allowing us to measure:
- Initial Scan Throughput: initial scan size / initial scan duration
- Workload Throughput: data ingested during workload / workload duration
- Cutover Throughput: (data ingested between cutover time and cutover cmd) / (cutover process duration)

where the size of these operations can be measured as either physical replicated bytes, logical ingested bytes, or physical ingested bytes on the source cluster.

This patch also fixes a recent bug which mislabeled src cluster throughput as initial scan throughput.

Informs #89176

Release note: None

97505: server, ui: remove interpreted jobs retrying status  r=xinhaoz a=xinhaoz

This commit removes the 'Retrying' status from the jobs UX.
Previously, we were interpolating this status from the running
status. This just added confusion and incorectness to the status
of the job being displayed. The status being surfaced now aligns
directly with what is shown in the `crdb_internal.jobs` table.

Some missing job statuses were also added as request options to
the 'Status' dropdown, including:
- Pause Requested
- Cancel Requested
- Revert Failed

Fixes: #95712

Release note (ui change): Retrying is no longer a status shown
in the jobs page.

<img width="1326" alt="image" src="https://user-images.githubusercontent.com/20136951/220738075-733b0cc8-9f77-4ace-a944-3791ff159c62.png">


Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
msbutler added a commit to msbutler/cockroach that referenced this issue Mar 13, 2023
This patch refactors the roachtest driver such that:
1) the streamingWorkload interface can run a custom workload with arbitrary sql
queries.
2) to reduce helper function signature bloat, many helper functions are now
replicationTestSpec methods.
3) the test writer can specity an `additionalDuration` of 0, which allows the
workload to terminate on its own.
4) a health monitor will fail the test if it cannot connect to a node

This patch also adds two new roachtests:
- c2c/BulkOps: runs the backup/mvcc-range-tombstones roachtest on
  the source cluster (without the backup-restore roundtrips for now), and
  streams it to the destination.
- c2c/UnitTest: is quick roachtest that can be used to debug the c2c roachtest
  infrastructure.

Informs cockroachdb#89176

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Mar 14, 2023
This patch refactors the roachtest driver such that:
1) the streamingWorkload interface can run a custom workload with arbitrary sql
queries.
2) to reduce helper function signature bloat, many helper functions are now
replicationTestSpec methods.
3) the test writer can specity an `additionalDuration` of 0, which allows the
workload to terminate on its own.
4) a health monitor will fail the test if it cannot connect to a node

This patch also adds two new roachtests:
- c2c/BulkOps: runs the backup/mvcc-range-tombstones roachtest on
  the source cluster (without the backup-restore roundtrips for now), and
  streams it to the destination.
- c2c/UnitTest: is quick roachtest that can be used to debug the c2c roachtest
  infrastructure.

Informs cockroachdb#89176

Release note: None
craig bot pushed a commit that referenced this issue Mar 14, 2023
98295: c2c: refactor roachtest driver to run and stream arbitrary workloads r=stevendanna a=msbutler

This patch refactors the roachtest driver such that:
1) the streamingWorkload interface can run a custom workload with arbitrary sql
queries.
2) to reduce helper function signature bloat, many helper functions are now
replicationTestSpec methods.
3) the test writer can specity an `additionalDuration` of 0, which allows the
workload to terminate on its own.
4) a health monitor will fail the test if it cannot connect to a node

This patch also adds two new roachtests:
- c2c/BulkOps: runs the backup/mvcc-range-tombstones roachtest on
  the source cluster (without the backup-restore roundtrips for now), and
  streams it to the destination.
- c2c/UnitTest: is quick roachtest that can be used to debug the c2c roachtest
  infrastructure.

Informs #89176

Release note: None

98445: sql_instance: migrate to rbr compatible index r=JeffSwenson a=JeffSwenson

Migrate sql_instance to a regional by row compatible index. The version
gates are intended to follow the protocol discussed in the comment at
the top of upgrades/system_rbr_indexes.go

The crdb_region column ID was changed from 5 to 6 in order to match the
logical order in which the sql_addr and crdb_region columns were added.
The exact ID doesn't really matter in this case since the sql_addr
column was added in v23.1.

Most of the rbr migration work is the same for sqlliveness, lease, and
sql_instances. The main exception to that is the migration cache used by
the sql instance reader. The cache is backed by a range feed and we need
to switch implementations when the version setting changes.

Part of #94843

Relase note: None

Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Jeff <[email protected]>
@msbutler
Copy link
Collaborator

not actively working on this. unassigning.

@msbutler msbutler removed their assignment Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery
Projects
None yet
Development

No branches or pull requests

2 participants