Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: Add new roachtest to test fast rebalance #103030

Closed
itsbilal opened this issue May 10, 2023 · 1 comment · Fixed by #107394
Closed

roachtest: Add new roachtest to test fast rebalance #103030

itsbilal opened this issue May 10, 2023 · 1 comment · Fixed by #107394
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team

Comments

@itsbilal
Copy link
Member

itsbilal commented May 10, 2023

Once #103028 is complete, a roachtest that creates a cluster, loads a fixture, and adds a node or two (and ensures they catch up in replica count without crashing) would be good to have as an end-to-end test for disaggregated ingestions / fast rebalances.

Jira issue: CRDB-27802

@itsbilal itsbilal added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels May 10, 2023
@itsbilal
Copy link
Member Author

One challenge with this is going to be orchestrating access to an S3 or GCS bucket to be able to use as a shared storage, and also handling things like cleanup, quotas/billing, etc. One way to work around this could be to just spin up one extra node in the cluster and use that as the blob storage, through minio or something on that node.

cc @RaduBerinde (thanks for bringing this issue up)

itsbilal added a commit to itsbilal/cockroach that referenced this issue Jul 21, 2023
This test adds a roachtest that spins up a cluster with
3 nodes using S3 as the --experimental-shared-storage, and then
adds a fourth node after loading a tpcc fixture and with a foreground
workload running on it. It confirms the fourth node gets hydrated
without transferring all live bytes over the wire.

Epic: none
Fixes: cockroachdb#103030

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this issue Jul 24, 2023
This test adds a roachtest that spins up a cluster with
3 nodes using S3 as the --experimental-shared-storage, and then
adds a fourth node after loading a tpcc fixture and with a foreground
workload running on it. It confirms the fourth node gets hydrated
without transferring all live bytes over the wire.

Epic: none
Fixes: cockroachdb#103030

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this issue Jul 25, 2023
This test adds a roachtest that spins up a cluster with
3 nodes using S3 as the --experimental-shared-storage, and then
adds a fourth node after loading a tpcc fixture and with a foreground
workload running on it. It confirms the fourth node gets hydrated
without transferring all live bytes over the wire.

Epic: none
Fixes: cockroachdb#103030

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this issue Jul 25, 2023
This test adds a roachtest that spins up a cluster with
3 nodes using S3 as the --experimental-shared-storage, and then
adds a fourth node after loading a tpcc fixture and with a foreground
workload running on it. It confirms the fourth node gets hydrated
without transferring all live bytes over the wire.

Epic: none
Fixes: cockroachdb#103030

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this issue Jul 25, 2023
This test adds a roachtest that spins up a cluster with
3 nodes using S3 as the --experimental-shared-storage, and then
adds a fourth node after loading a tpcc fixture and with a foreground
workload running on it. It confirms the fourth node gets hydrated
without transferring all live bytes over the wire.

Epic: none
Fixes: cockroachdb#103030

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this issue Jul 25, 2023
This test adds a roachtest that spins up a cluster with
3 nodes using S3 as the --experimental-shared-storage, and then
adds a fourth node after loading a tpcc fixture and with a foreground
workload running on it. It confirms the fourth node gets hydrated
without transferring all live bytes over the wire.

Epic: none
Fixes: cockroachdb#103030

Release note: None
craig bot pushed a commit that referenced this issue Aug 17, 2023
107394: cmd/roachtest: add disagg-rebalance roachtest r=renatolabs a=itsbilal

This test adds a roachtest that spins up a cluster with 3 nodes using S3 as the --experimental-shared-storage, and then adds a fourth node after loading a tpcc fixture and with a foreground workload running on it. It confirms the fourth node gets hydrated without transferring all live bytes over the wire.

Epic: none
Fixes: #103030

Release note: None

108154: kvcoord: refactor ambiguous commit tests r=AlexTalks a=AlexTalks

In #107323, testing for the ambiguous write case that leads to the "transaction unexpectedly committed" bug were introduced, however to increase test coverage of the fix, multiple schedules of operations need to be tested. This change simply refactors the framework of the existing test in order to enable the addition of muliple subtests. The subtests are included in a separate patch.

Part of: #103817

Release note: None

108819: roachtest: add a c2c cutover `TO LATEST` test r=lidorcarmel a=lidorcarmel

We only have c2c roachtests that cutover to the past, adding one that does a cutover to LATEST. Using the `TO LATEST` sql because we expect that to be used more in production.

Epic: none

Release note: None

108910: streamingccl: minor log updates and code reorg r=lidorcarmel a=stevendanna

See individual commits.

Epic: none

108914: sqlproxyccl: do not report BackendDown metrics on throttle and routing errors r=JeffSwenson,andy-kimball a=jaylim-crl

#### sqlproxyccl: do not report BackendDown metrics on throttle and routing errors

Previously, we were reporting the backend_down metric on the following errors:
- codeProxyRefusedConnection
- codeParamsRoutingFailed
- codeUnavailable

These errors do not imply that the backend is down. We originally introduced
this in #57431, but looking at the PR, it appears unintentional. This commit
fixes that by not reporting the backend_down metric when the proxy returns
such errors.

Release note: None

Epic: none

#### sqlproxyccl: rename codeBackendDown to codeBackendDialFailed

This commit renames codeBackendDown to codeBackendDialFailed to prevent
confusions by developers. Note that we don't rename the metric here to avoid
breaking downstream consumers. At the same time, we will remove the old
codeBackendRefusedTLS code as it does not serve any purpose, and there wasn't
a metric for it as well.

Release note: None

Epic: none



Release justification: This fixes accuracy issues with SQL Proxy metrics.

108920: util/log: add custom crash tags to sentry r=dhartunian a=pjtatlow

In #106786 we added the ability to provide an environment variable that was meant to add custom tags to sentry crash reports. That change added the function that would create the map of crash report tags / values, but it was never actually used. This change ensures that tags from that environment variable will actually show up in the sentry reports.

Release note: None

Epic: None

Co-authored-by: Bilal Akhtar <[email protected]>
Co-authored-by: Alex Sarkesian <[email protected]>
Co-authored-by: Lidor Carmel <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: Jay <[email protected]>
Co-authored-by: PJ Tatlow <[email protected]>
@craig craig bot closed this as completed in 97f17ff Aug 17, 2023
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant