Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/roachtest: add disagg-rebalance roachtest #107394

Merged
merged 1 commit into from
Aug 17, 2023

Conversation

itsbilal
Copy link
Member

This test adds a roachtest that spins up a cluster with 3 nodes using S3 as the --experimental-shared-storage, and then adds a fourth node after loading a tpcc fixture and with a foreground workload running on it. It confirms the fourth node gets hydrated without transferring all live bytes over the wire.

Epic: none
Fixes: #103030

Release note: None

@itsbilal itsbilal requested a review from a team July 21, 2023 21:35
@itsbilal itsbilal self-assigned this Jul 21, 2023
@itsbilal itsbilal requested a review from a team as a code owner July 21, 2023 21:35
@itsbilal itsbilal requested review from srosenberg and renatolabs and removed request for a team July 21, 2023 21:35
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@itsbilal
Copy link
Member Author

Tested this with #105839 and it reliably passes. Tested it without that change and one of the assertions fails. Won't merge this until that PR is merged.

@itsbilal itsbilal removed their assignment Jul 21, 2023
@itsbilal itsbilal force-pushed the disagg-rebalance-roachtest branch from 386c2f2 to b2e861b Compare July 21, 2023 22:44
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nice!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal and @renatolabs)


pkg/cmd/roachtest/tests/disagg-rebalance.go line 41 at r1 (raw file):

			c.Put(ctx, t.Cockroach(), "./cockroach")
			c.Put(ctx, t.DeprecatedWorkload(), "./workload")
			s3dir := "s3://cockroachdb-backup-testing/disagg-rebalance/" + c.Name() + "?AUTH=implicit"

Please use testutils.BackupTestingBucket() [1]. @renatolabs Need to think how to surface this better; we also need to start discouraging (maybe a linter?) test authors from using any hardcoded buckets.

[1] #107073

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal and @renatolabs)


pkg/cmd/roachtest/tests/disagg-rebalance.go line 54 at r1 (raw file):

				warehouses,
			)}
			c.Run(ctx, c.Node(1), cmd...)

Please use c.NewMonitor(ctx, c.Range(1, 3)) [1], this way we catch silent deaths. The pattern is typically as follows,

m := c.NewMonitor(ctx, crdbNodes)
...
m.Go(doSomething())
...
m.Go(doSomethingElse())
...
m.Wait()

@renatolabs I wonder if we should make it mandatory? i.e., every time a test starts a cluster, the (roachtest) API automatically adds a monitor.

[1] #36224

@itsbilal itsbilal force-pushed the disagg-rebalance-roachtest branch from b2e861b to e0e242c Compare July 24, 2023 15:20
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs)


pkg/cmd/roachtest/tests/disagg-rebalance.go line 41 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Please use testutils.BackupTestingBucket() [1]. @renatolabs Need to think how to surface this better; we also need to start discouraging (maybe a linter?) test authors from using any hardcoded buckets.

[1] #107073

Done.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 54 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Please use c.NewMonitor(ctx, c.Range(1, 3)) [1], this way we catch silent deaths. The pattern is typically as follows,

m := c.NewMonitor(ctx, crdbNodes)
...
m.Go(doSomething())
...
m.Go(doSomethingElse())
...
m.Wait()

@renatolabs I wonder if we should make it mandatory? i.e., every time a test starts a cluster, the (roachtest) API automatically adds a monitor.

[1] #36224

Done.

@itsbilal
Copy link
Member Author

Will merge this after #105839 merges.

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal and @srosenberg)


pkg/cmd/roachtest/tests/disagg-rebalance.go line 30 at r2 (raw file):

func registerDisaggRebalance(r registry.Registry) {
	disaggRebalanceSpec := r.MakeClusterSpec(4)
	r.Add(registry.TestSpec{

You need to add the aws tag for this to be picked up by the nightly build.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 51 at r2 (raw file):

			t.Status(`workload initialization`)
			cmd := []string{fmt.Sprintf(

Nit: this can be a plain string.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 52 at r2 (raw file):

			t.Status(`workload initialization`)
			cmd := []string{fmt.Sprintf(
				"./workload fixtures import tpcc --warehouses=%d {pgurl:1}",

We can replace the uses of ./workload in this test with ./cockroach workload. As an added bonus, we'll then no longer need to upload the workload binary to all nodes.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 61 at r2 (raw file):

			m.Wait()

			ctx, cancel := context.WithCancel(ctx)

Do we need a child context? It is only canceled when the function returns, so the waits for <-ctx.Done() below would only trigger if the parent context canceled. If there is a reason for it that I missed, a comment should make it clearer.

If we don't need it, it's probably better to remove this, since it can get kind of confusing with all the shadowing (ctx passed to Run, this local ctx, then ctx passed to monitor.Go.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 64 at r2 (raw file):

			defer cancel()
			cmdDone := make(chan error)
			m2 := c.NewMonitor(ctx, c.Range(1, 3))

Any reason why we need a new monitor? I thought they could be reused.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 74 at r2 (raw file):

				err := c.RunE(ctx, c.Node(1), cmd)
				cmdDone <- err

If I'm reading this test right, if one of the assertions below fail (e.g., one of the db.QueryRows), the test would fail and this goroutine will stay blocked in this channel send. I believe we can fix this by changing the wait loop at the end of this test, see my comment there.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 97 at r2 (raw file):

			if err := waitForRebalance(ctx, t.L(), db, 10 /* maxStdDev */, 20 /* stableSeconds */); err != nil {
				t.Fatal(err)
				return

Nit: unreachable.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 107 at r2 (raw file):

			).Scan(&count); err != nil {
				t.Fatal(err)
				return

Nit: unreachable.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 143 at r2 (raw file):

				return
			}
			m2.Wait()

Couldn't this select block be replaced by just a Wait on the monitor? Then we can even do away with cmdDone. If the context is canceled (most likely reason would be a test timeout), Wait should return since the RunE should observe that cancelation as well and return.

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal and @srosenberg)


pkg/cmd/roachtest/tests/disagg-rebalance.go line 41 at r1 (raw file):

Need to think how to surface this better; we also need to start discouraging (maybe a linter?) test authors from using any hardcoded buckets.

Agree, but not sure what the best approach would be. A general purpose linter is tricky, and there would probably be false positives.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 54 at r1 (raw file):

I wonder if we should make it mandatory? i.e., every time a test starts a cluster, the (roachtest) API automatically adds a monitor.

Not a bad idea; we would then also need to lift the ExpectDeath API to the (already crowded) cluster interface. IMO, most tests would fail if a node suddenly crashed, but the error message would be less direct than unexpected node death.

What we could add a linter for would be using go in roachtests.

@itsbilal itsbilal force-pushed the disagg-rebalance-roachtest branch 2 times, most recently from 0425be1 to c403fc9 Compare July 25, 2023 15:29
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Dismissed @srosenberg from 2 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs and @srosenberg)


pkg/cmd/roachtest/tests/disagg-rebalance.go line 30 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

You need to add the aws tag for this to be picked up by the nightly build.

Done.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 51 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nit: this can be a plain string.

Done.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 52 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

We can replace the uses of ./workload in this test with ./cockroach workload. As an added bonus, we'll then no longer need to upload the workload binary to all nodes.

Done.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 61 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Do we need a child context? It is only canceled when the function returns, so the waits for <-ctx.Done() below would only trigger if the parent context canceled. If there is a reason for it that I missed, a comment should make it clearer.

If we don't need it, it's probably better to remove this, since it can get kind of confusing with all the shadowing (ctx passed to Run, this local ctx, then ctx passed to monitor.Go.

Done.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 64 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Any reason why we need a new monitor? I thought they could be reused.

Done.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 74 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

If I'm reading this test right, if one of the assertions below fail (e.g., one of the db.QueryRows), the test would fail and this goroutine will stay blocked in this channel send. I believe we can fix this by changing the wait loop at the end of this test, see my comment there.

Done.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 143 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Couldn't this select block be replaced by just a Wait on the monitor? Then we can even do away with cmdDone. If the context is canceled (most likely reason would be a test timeout), Wait should return since the RunE should observe that cancelation as well and return.

Done.

@itsbilal itsbilal force-pushed the disagg-rebalance-roachtest branch from c403fc9 to bb24031 Compare July 25, 2023 16:16
@itsbilal
Copy link
Member Author

pkg/cmd/roachtest/tests/disagg-rebalance.go line 52 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done.

Ended up reverting this, for some reason it was erroring out with cockroach workload but not with workload ?

@itsbilal itsbilal force-pushed the disagg-rebalance-roachtest branch from bb24031 to 8b18f99 Compare July 25, 2023 16:29
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs and @srosenberg)


pkg/cmd/roachtest/tests/disagg-rebalance.go line 52 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Ended up reverting this, for some reason it was erroring out with cockroach workload but not with workload ?

nevermind, the error was due to monitor reuse. Brought back cockroach workload.


pkg/cmd/roachtest/tests/disagg-rebalance.go line 64 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done.

Had to revert this; two monitors are necessary because Wait() cancels its context so calling Go() after a wait is not allowed.

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal and @srosenberg)


pkg/cmd/roachtest/tests/disagg-rebalance.go line 51 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done.

I still see it as []string?


pkg/cmd/roachtest/tests/disagg-rebalance.go line 64 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Had to revert this; two monitors are necessary because Wait() cancels its context so calling Go() after a wait is not allowed.

Ah yes, makes sense. Sorry for the misleading comment!


pkg/cmd/roachtest/tests/disagg_rebalance.go line 42 at r5 (raw file):

			}
			c.Put(ctx, t.Cockroach(), "./cockroach")
			c.Put(ctx, t.DeprecatedWorkload(), "./workload")

Nit: no longer needed.

@itsbilal itsbilal force-pushed the disagg-rebalance-roachtest branch from 8b18f99 to b9df926 Compare July 25, 2023 17:07
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Gonna wait for the dependent changes to merge before I merge this, which could take ~2 weeks.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @renatolabs and @srosenberg)


pkg/cmd/roachtest/tests/disagg-rebalance.go line 51 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

I still see it as []string?

Ah I thought you were referring to the t.Status string above which was previously a raw string literal. Fixed the []string now.

This test adds a roachtest that spins up a cluster with
3 nodes using S3 as the --experimental-shared-storage, and then
adds a fourth node after loading a tpcc fixture and with a foreground
workload running on it. It confirms the fourth node gets hydrated
without transferring all live bytes over the wire.

Epic: none
Fixes: cockroachdb#103030

Release note: None
@itsbilal itsbilal force-pushed the disagg-rebalance-roachtest branch from b9df926 to 97f17ff Compare August 16, 2023 17:20
itsbilal added a commit to itsbilal/cockroach that referenced this pull request Aug 17, 2023
We require sstables to be written with obsolete bits
for shared sstable ingestion to work correctly. This
change moves to writing sstables with the new format
if the cluster version is high enough.

Unblocks cockroachdb#107394.

Epic: none

Release note: None
@itsbilal
Copy link
Member Author

bors r=renatolabs

@craig
Copy link
Contributor

craig bot commented Aug 17, 2023

Build failed (retrying...):

craig bot pushed a commit that referenced this pull request Aug 17, 2023
…108925

107863: sql: add bitwise operation for varbit with variable length r=rharding6373 a=cty123

As raised in #107821, the existing bitwise operations in CRDB follow the same standard as Postgresql that require the two operands of varbit type to have the same length. The restriction makes it easier for the implementation but harder for users. As of today, we need to apply casting and left padding to make sure the operands have the same length. Instead, it might be useful to have CRDB built-in functions that can handle this. Here I have implemented 2 basic bitwise operation functions that are tailored for varbit typed data of arbitrary length.

108525: sql: TestRandomSyntaxFunctions: skip function call and fix CREATE STATS use of span after finish r=fqazi a=fqazi

Previously, this test could depending on the concurrency,
could spawn a large number of schema telemetry jobs via
(ctdb_internal.create_sql_schema_telemetry_job).
This could lead to contention that will eventually
cause this test to time out. To address this, this
patch limits calling the telemetry job creation function.

Fixes: #108153

Release note: None

108807: server,log: report tenant names in logging output r=stevendanna,abargainier a=knz

Epic: CRDB-27982
Fixes #103406.

Prior to this change, the tenant details reported in logging output
were limited to just the tenant ID. This is because the original
tenant server initialization code only had access to the tenant
ID (provided as CLI argument).

Since recently, the tenant name is also becoming known during tenant
server initialization. This is currently done via the tenant
connector.

This commit expands on this foundation by injecting the tenant name
into the logging output as soon as it is available.

For example:

```
I230815 19:31:07.290757 922 sql/temporary_schema.go:554 ⋮ [T4,Vblah,n1] 148  found 0 temporary schemas
                                                              ^^^^^ here
```

or using JSON:
```
{"channel_numeric":0,"channel":"DEV",...,"tenant_id":4,"tenant_name":"blah",...}
                                                       ^^^^^^^^^^^^^^^^^^^^ here
```


Note: we are choosing to report the tenant name *in addition* to the
tenant ID to preserve compatibilit with automation (eg. logspy) which
needs to filter entries based on ID.


Release note (cluster virtualization): The name of the virtual
cluster (tenant), when known, is now reported in logging events.
Refer to the documentation of individual logging format for details.

108823: row: remove no longer nil txn check r=yuzefovich a=yuzefovich

I believe this non-nil txn check is no longer needed as of 8f7f2f4 (which fixed the only known place where we forgot to set the txn field on the FlowCtx).

Epic: None

Release note: None

108858: *: bump some core dependencies r=knz a=rickystewart

Unfortunately we're still getting `nogo` crashes. Try bumping these
dependencies even further.

Informs #99988.

Epic: none
Release note: None

108866: storage: Write sstables in v4 format for shared ingestions r=RaduBerinde a=itsbilal

We require sstables to be written with obsolete bits for shared sstable ingestion to work correctly. This change moves to writing sstables with the new format if the cluster version is high enough.

Unblocks #107394.

Epic: none

Release note: None

108886: kvstreamer, row: adjust all tests to work with the test tenant r=yuzefovich a=yuzefovich

Epic: CRDB-18499
Informs #76378

Release note: None

108924: logictest: avoid FATAL during cleanup r=rafiss a=rafiss

This prevents the test from failing if there is an error closing a connection at the end of the test. There is no strong need for us to assert that closing the connection worked.

fixes #108833
Release note: None

108925: streamingest: speed-up c2c cutover r=lidorcarmel a=lidorcarmel

Each c2c processor checks the job info table every 30 seconds to see whether a cutover has started. Normally a cutover should take about 10 seconds. This pr changes the default to 10 seconds, to reduce the total time for cutting over.

Epic: none

Release note: None

Co-authored-by: cty123 <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Bilal Akhtar <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Lidor Carmel <[email protected]>
@craig
Copy link
Contributor

craig bot commented Aug 17, 2023

Build succeeded:

@craig craig bot merged commit 07266c3 into cockroachdb:master Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: Add new roachtest to test fast rebalance
5 participants