-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c2c: write-rate roachtests for cluster-to-cluster streaming #89176
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-disaster-recovery
Comments
stevendanna
added
the
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
label
Oct 3, 2022
5 tasks
cc @cockroachdb/disaster-recovery |
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Jan 13, 2023
Previously in c2c roachtests, the foreground workload on the src cluster would run for a predefined amount of time, based on the expected initial scan time. But, if this estimated initial scan time wasn't accurate, the roachtest would not properly simulate c2c customer workload. E.g. if the initial scan actually took much longer than expected, the workload would finish before the initial scan! This patch removes the need to specify a duration for the src cluster workload. Instead, the goroutine running the workload will get cancelled at cutover time, determined by the `replicationTestSpec.additionalDuration` field, which specifies how long the workload should after the initial scan completes. This patch also adds additional logging which provides instructions for opening a sql session to the tenant and opening a tenant's dbconsole. Informs cockroachdb#89176 Release note: None
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Jan 13, 2023
Previously in c2c roachtests, the foreground workload on the src cluster would run for a predefined amount of time, based on the expected initial scan time. But, if this estimated initial scan time wasn't accurate, the roachtest would not properly simulate c2c customer workload. E.g. if the initial scan actually took much longer than expected, the workload would finish before the initial scan! This patch removes the need to specify a duration for the src cluster workload. Instead, the goroutine running the workload will get cancelled at cutover time, determined by the `replicationTestSpec.additionalDuration` field, which specifies how long the workload should after the initial scan completes. This patch also adds additional logging which provides instructions for opening a sql session to the tenant and opening a tenant's dbconsole. Informs cockroachdb#89176 Release note: None
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Jan 19, 2023
Previously in c2c roachtests, the foreground workload on the src cluster would run for a predefined amount of time, based on the expected initial scan time. But, if this estimated initial scan time wasn't accurate, the roachtest would not properly simulate c2c customer workload. E.g. if the initial scan actually took much longer than expected, the workload would finish before the initial scan! This patch removes the need to specify a duration for the src cluster workload. Instead, the goroutine running the workload will get cancelled at cutover time, determined by the `replicationTestSpec.additionalDuration` field, which specifies how long the workload should after the initial scan completes. This patch also adds additional logging which provides instructions for opening a sql session to the tenant and opening a tenant's dbconsole. Informs cockroachdb#89176 Release note: None
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Jan 19, 2023
Previously in c2c roachtests, the foreground workload on the src cluster would run for a predefined amount of time, based on the expected initial scan time. But, if this estimated initial scan time wasn't accurate, the roachtest would not properly simulate c2c customer workload. E.g. if the initial scan actually took much longer than expected, the workload would finish before the initial scan! This patch removes the need to specify a duration for the src cluster workload. Instead, the goroutine running the workload will get cancelled at cutover time, determined by the `replicationTestSpec.additionalDuration` field, which specifies how long the workload should after the initial scan completes. This patch also adds additional logging which provides instructions for opening a sql session to the tenant and opening a tenant's dbconsole. Informs cockroachdb#89176 Release note: None
craig bot
pushed a commit
that referenced
this issue
Jan 19, 2023
95161: kv: Test to measure slowdown after a node restart r=irfansharif a=andrewbaptist After a node is down for a few minutes and then starts up again, there is a slowdown related to it catching up on Raft messages it missed while down. This can cause an IO Overload scenario and greatly impact performance on the cluster. This adds a test for the issue, a separate PR will be created to enable this test and fix the issue. Informs: #95159 Epic: none Release note: None 95191: c2c: increase c2c roachtest workload flexibility r=stevendanna,renatolabs a=msbutler Previously in c2c roachtests, the foreground workload on the src cluster would run for a predefined amount of time, based on the expected initial scan time. But, if this estimated initial scan time wasn't accurate, the roachtest would not properly simulate c2c customer workload. E.g. if the initial scan actually took much longer than expected, the workload would finish before the initial scan! This patch removes the need to specify a duration for the src cluster workload. Instead, the goroutine running the workload will get cancelled at cutover time, determined by the `replicationTestSpec.additionalDuration` field, which specifies how long the workload should after the initial scan completes. This patch also adds additional logging which provides instructions for opening a sql session to the tenant and opening a tenant's dbconsole. Informs #89176 Release note: None 95407: pkg/cloud/azure: migrate to new azure sdk r=benbardin,dt a=msbutler This patch replaces the deprecated azure sdk with azure's new sdk, used to read and write backups to azure. This PR also reverts the custom azure put uploader, which if necessary, can be re added to use the new sdk in the future. This PR should unblock azure kms work. Informs #86903 Epic CRDB-18954 Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Steven Danna <[email protected]>
craig bot
pushed a commit
that referenced
this issue
Feb 27, 2023
97465: c2c: gather perf metrics from prometheus r=stevendanna a=msbutler c2c roachtest performance metrics are now gathered by a prom/grafana instance running locally on the roachprod cluster. This change allows us to gather and process any metrics exposed to the crdb prom endpoint. Specifically, we now gather: `capacity_used`, `replication_logical_bytes`, `replication_sst_bytes` at various points during the c2c roachtest, allowing us to measure: - Initial Scan Throughput: initial scan size / initial scan duration - Workload Throughput: data ingested during workload / workload duration - Cutover Throughput: (data ingested between cutover time and cutover cmd) / (cutover process duration) where the size of these operations can be measured as either physical replicated bytes, logical ingested bytes, or physical ingested bytes on the source cluster. This patch also fixes a recent bug which mislabeled src cluster throughput as initial scan throughput. Informs #89176 Release note: None 97505: server, ui: remove interpreted jobs retrying status r=xinhaoz a=xinhaoz This commit removes the 'Retrying' status from the jobs UX. Previously, we were interpolating this status from the running status. This just added confusion and incorectness to the status of the job being displayed. The status being surfaced now aligns directly with what is shown in the `crdb_internal.jobs` table. Some missing job statuses were also added as request options to the 'Status' dropdown, including: - Pause Requested - Cancel Requested - Revert Failed Fixes: #95712 Release note (ui change): Retrying is no longer a status shown in the jobs page. <img width="1326" alt="image" src="https://user-images.githubusercontent.com/20136951/220738075-733b0cc8-9f77-4ace-a944-3791ff159c62.png"> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]>
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Mar 13, 2023
This patch refactors the roachtest driver such that: 1) the streamingWorkload interface can run a custom workload with arbitrary sql queries. 2) to reduce helper function signature bloat, many helper functions are now replicationTestSpec methods. 3) the test writer can specity an `additionalDuration` of 0, which allows the workload to terminate on its own. 4) a health monitor will fail the test if it cannot connect to a node This patch also adds two new roachtests: - c2c/BulkOps: runs the backup/mvcc-range-tombstones roachtest on the source cluster (without the backup-restore roundtrips for now), and streams it to the destination. - c2c/UnitTest: is quick roachtest that can be used to debug the c2c roachtest infrastructure. Informs cockroachdb#89176 Release note: None
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Mar 14, 2023
This patch refactors the roachtest driver such that: 1) the streamingWorkload interface can run a custom workload with arbitrary sql queries. 2) to reduce helper function signature bloat, many helper functions are now replicationTestSpec methods. 3) the test writer can specity an `additionalDuration` of 0, which allows the workload to terminate on its own. 4) a health monitor will fail the test if it cannot connect to a node This patch also adds two new roachtests: - c2c/BulkOps: runs the backup/mvcc-range-tombstones roachtest on the source cluster (without the backup-restore roundtrips for now), and streams it to the destination. - c2c/UnitTest: is quick roachtest that can be used to debug the c2c roachtest infrastructure. Informs cockroachdb#89176 Release note: None
craig bot
pushed a commit
that referenced
this issue
Mar 14, 2023
98295: c2c: refactor roachtest driver to run and stream arbitrary workloads r=stevendanna a=msbutler This patch refactors the roachtest driver such that: 1) the streamingWorkload interface can run a custom workload with arbitrary sql queries. 2) to reduce helper function signature bloat, many helper functions are now replicationTestSpec methods. 3) the test writer can specity an `additionalDuration` of 0, which allows the workload to terminate on its own. 4) a health monitor will fail the test if it cannot connect to a node This patch also adds two new roachtests: - c2c/BulkOps: runs the backup/mvcc-range-tombstones roachtest on the source cluster (without the backup-restore roundtrips for now), and streams it to the destination. - c2c/UnitTest: is quick roachtest that can be used to debug the c2c roachtest infrastructure. Informs #89176 Release note: None 98445: sql_instance: migrate to rbr compatible index r=JeffSwenson a=JeffSwenson Migrate sql_instance to a regional by row compatible index. The version gates are intended to follow the protocol discussed in the comment at the top of upgrades/system_rbr_indexes.go The crdb_region column ID was changed from 5 to 6 in order to match the logical order in which the sql_addr and crdb_region columns were added. The exact ID doesn't really matter in this case since the sql_addr column was added in v23.1. Most of the rbr migration work is the same for sqlliveness, lease, and sql_instances. The main exception to that is the migration cache used by the sql instance reader. The cache is backed by a range feed and we need to switch implementations when the version setting changes. Part of #94843 Relase note: None Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Jeff <[email protected]>
not actively working on this. unassigning. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-disaster-recovery
We want a series of roachtests that provide us with information about the write rate we can sustain while maintaining a constant replication lag.
Jira issue: CRDB-20159
Epic CRDB-18751
The text was updated successfully, but these errors were encountered: