-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/roachtest: add disagg-rebalance roachtest #107394
Conversation
Tested this with #105839 and it reliably passes. Tested it without that change and one of the assertions fails. Won't merge this until that PR is merged. |
386c2f2
to
b2e861b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal and @renatolabs)
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 41 at r1 (raw file):
c.Put(ctx, t.Cockroach(), "./cockroach") c.Put(ctx, t.DeprecatedWorkload(), "./workload") s3dir := "s3://cockroachdb-backup-testing/disagg-rebalance/" + c.Name() + "?AUTH=implicit"
Please use testutils.BackupTestingBucket()
[1]. @renatolabs Need to think how to surface this better; we also need to start discouraging (maybe a linter?) test authors from using any hardcoded buckets.
[1] #107073
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal and @renatolabs)
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 54 at r1 (raw file):
warehouses, )} c.Run(ctx, c.Node(1), cmd...)
Please use c.NewMonitor(ctx, c.Range(1, 3))
[1], this way we catch silent deaths. The pattern is typically as follows,
m := c.NewMonitor(ctx, crdbNodes)
...
m.Go(doSomething())
...
m.Go(doSomethingElse())
...
m.Wait()
@renatolabs I wonder if we should make it mandatory? i.e., every time a test starts a cluster, the (roachtest) API automatically adds a monitor.
[1] #36224
b2e861b
to
e0e242c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs)
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 41 at r1 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Please use
testutils.BackupTestingBucket()
[1]. @renatolabs Need to think how to surface this better; we also need to start discouraging (maybe a linter?) test authors from using any hardcoded buckets.[1] #107073
Done.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 54 at r1 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Please use
c.NewMonitor(ctx, c.Range(1, 3))
[1], this way we catch silent deaths. The pattern is typically as follows,m := c.NewMonitor(ctx, crdbNodes) ... m.Go(doSomething()) ... m.Go(doSomethingElse()) ... m.Wait()
@renatolabs I wonder if we should make it mandatory? i.e., every time a test starts a cluster, the (roachtest) API automatically adds a monitor.
[1] #36224
Done.
Will merge this after #105839 merges. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal and @srosenberg)
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 30 at r2 (raw file):
func registerDisaggRebalance(r registry.Registry) { disaggRebalanceSpec := r.MakeClusterSpec(4) r.Add(registry.TestSpec{
You need to add the aws tag for this to be picked up by the nightly build.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 51 at r2 (raw file):
t.Status(`workload initialization`) cmd := []string{fmt.Sprintf(
Nit: this can be a plain string.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 52 at r2 (raw file):
t.Status(`workload initialization`) cmd := []string{fmt.Sprintf( "./workload fixtures import tpcc --warehouses=%d {pgurl:1}",
We can replace the uses of ./workload
in this test with ./cockroach workload
. As an added bonus, we'll then no longer need to upload the workload
binary to all nodes.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 61 at r2 (raw file):
m.Wait() ctx, cancel := context.WithCancel(ctx)
Do we need a child context? It is only canceled when the function returns, so the waits for <-ctx.Done()
below would only trigger if the parent context canceled. If there is a reason for it that I missed, a comment should make it clearer.
If we don't need it, it's probably better to remove this, since it can get kind of confusing with all the shadowing (ctx
passed to Run
, this local ctx
, then ctx
passed to monitor.Go
.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 64 at r2 (raw file):
defer cancel() cmdDone := make(chan error) m2 := c.NewMonitor(ctx, c.Range(1, 3))
Any reason why we need a new monitor? I thought they could be reused.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 74 at r2 (raw file):
err := c.RunE(ctx, c.Node(1), cmd) cmdDone <- err
If I'm reading this test right, if one of the assertions below fail (e.g., one of the db.QueryRow
s), the test would fail and this goroutine will stay blocked in this channel send. I believe we can fix this by changing the wait loop at the end of this test, see my comment there.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 97 at r2 (raw file):
if err := waitForRebalance(ctx, t.L(), db, 10 /* maxStdDev */, 20 /* stableSeconds */); err != nil { t.Fatal(err) return
Nit: unreachable.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 107 at r2 (raw file):
).Scan(&count); err != nil { t.Fatal(err) return
Nit: unreachable.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 143 at r2 (raw file):
return } m2.Wait()
Couldn't this select
block be replaced by just a Wait on the monitor? Then we can even do away with cmdDone
. If the context is canceled (most likely reason would be a test timeout), Wait
should return since the RunE
should observe that cancelation as well and return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal and @srosenberg)
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 41 at r1 (raw file):
Need to think how to surface this better; we also need to start discouraging (maybe a linter?) test authors from using any hardcoded buckets.
Agree, but not sure what the best approach would be. A general purpose linter is tricky, and there would probably be false positives.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 54 at r1 (raw file):
I wonder if we should make it mandatory? i.e., every time a test starts a cluster, the (roachtest) API automatically adds a monitor.
Not a bad idea; we would then also need to lift the ExpectDeath
API to the (already crowded) cluster
interface. IMO, most tests would fail if a node suddenly crashed, but the error message would be less direct than unexpected node death
.
What we could add a linter for would be using go
in roachtests.
0425be1
to
c403fc9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Dismissed @srosenberg from 2 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs and @srosenberg)
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 30 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
You need to add the aws tag for this to be picked up by the nightly build.
Done.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 51 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: this can be a plain string.
Done.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 52 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
We can replace the uses of
./workload
in this test with./cockroach workload
. As an added bonus, we'll then no longer need to upload theworkload
binary to all nodes.
Done.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 61 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Do we need a child context? It is only canceled when the function returns, so the waits for
<-ctx.Done()
below would only trigger if the parent context canceled. If there is a reason for it that I missed, a comment should make it clearer.If we don't need it, it's probably better to remove this, since it can get kind of confusing with all the shadowing (
ctx
passed toRun
, this localctx
, thenctx
passed tomonitor.Go
.
Done.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 64 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Any reason why we need a new monitor? I thought they could be reused.
Done.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 74 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
If I'm reading this test right, if one of the assertions below fail (e.g., one of the
db.QueryRow
s), the test would fail and this goroutine will stay blocked in this channel send. I believe we can fix this by changing the wait loop at the end of this test, see my comment there.
Done.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 143 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Couldn't this
select
block be replaced by just a Wait on the monitor? Then we can even do away withcmdDone
. If the context is canceled (most likely reason would be a test timeout),Wait
should return since theRunE
should observe that cancelation as well and return.
Done.
c403fc9
to
bb24031
Compare
Previously, itsbilal (Bilal Akhtar) wrote…
Ended up reverting this, for some reason it was erroring out with |
bb24031
to
8b18f99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs and @srosenberg)
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 52 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Ended up reverting this, for some reason it was erroring out with
cockroach workload
but not withworkload
?
nevermind, the error was due to monitor reuse. Brought back cockroach workload
.
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 64 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Done.
Had to revert this; two monitors are necessary because Wait()
cancels its context so calling Go() after a wait is not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal and @srosenberg)
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 51 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Done.
I still see it as []string
?
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 64 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Had to revert this; two monitors are necessary because
Wait()
cancels its context so calling Go() after a wait is not allowed.
Ah yes, makes sense. Sorry for the misleading comment!
pkg/cmd/roachtest/tests/disagg_rebalance.go
line 42 at r5 (raw file):
} c.Put(ctx, t.Cockroach(), "./cockroach") c.Put(ctx, t.DeprecatedWorkload(), "./workload")
Nit: no longer needed.
8b18f99
to
b9df926
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Gonna wait for the dependent changes to merge before I merge this, which could take ~2 weeks.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @renatolabs and @srosenberg)
pkg/cmd/roachtest/tests/disagg-rebalance.go
line 51 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
I still see it as
[]string
?
Ah I thought you were referring to the t.Status string above which was previously a raw string literal. Fixed the []string now.
This test adds a roachtest that spins up a cluster with 3 nodes using S3 as the --experimental-shared-storage, and then adds a fourth node after loading a tpcc fixture and with a foreground workload running on it. It confirms the fourth node gets hydrated without transferring all live bytes over the wire. Epic: none Fixes: cockroachdb#103030 Release note: None
b9df926
to
97f17ff
Compare
We require sstables to be written with obsolete bits for shared sstable ingestion to work correctly. This change moves to writing sstables with the new format if the cluster version is high enough. Unblocks cockroachdb#107394. Epic: none Release note: None
bors r=renatolabs |
Build failed (retrying...): |
…108925 107863: sql: add bitwise operation for varbit with variable length r=rharding6373 a=cty123 As raised in #107821, the existing bitwise operations in CRDB follow the same standard as Postgresql that require the two operands of varbit type to have the same length. The restriction makes it easier for the implementation but harder for users. As of today, we need to apply casting and left padding to make sure the operands have the same length. Instead, it might be useful to have CRDB built-in functions that can handle this. Here I have implemented 2 basic bitwise operation functions that are tailored for varbit typed data of arbitrary length. 108525: sql: TestRandomSyntaxFunctions: skip function call and fix CREATE STATS use of span after finish r=fqazi a=fqazi Previously, this test could depending on the concurrency, could spawn a large number of schema telemetry jobs via (ctdb_internal.create_sql_schema_telemetry_job). This could lead to contention that will eventually cause this test to time out. To address this, this patch limits calling the telemetry job creation function. Fixes: #108153 Release note: None 108807: server,log: report tenant names in logging output r=stevendanna,abargainier a=knz Epic: CRDB-27982 Fixes #103406. Prior to this change, the tenant details reported in logging output were limited to just the tenant ID. This is because the original tenant server initialization code only had access to the tenant ID (provided as CLI argument). Since recently, the tenant name is also becoming known during tenant server initialization. This is currently done via the tenant connector. This commit expands on this foundation by injecting the tenant name into the logging output as soon as it is available. For example: ``` I230815 19:31:07.290757 922 sql/temporary_schema.go:554 ⋮ [T4,Vblah,n1] 148 found 0 temporary schemas ^^^^^ here ``` or using JSON: ``` {"channel_numeric":0,"channel":"DEV",...,"tenant_id":4,"tenant_name":"blah",...} ^^^^^^^^^^^^^^^^^^^^ here ``` Note: we are choosing to report the tenant name *in addition* to the tenant ID to preserve compatibilit with automation (eg. logspy) which needs to filter entries based on ID. Release note (cluster virtualization): The name of the virtual cluster (tenant), when known, is now reported in logging events. Refer to the documentation of individual logging format for details. 108823: row: remove no longer nil txn check r=yuzefovich a=yuzefovich I believe this non-nil txn check is no longer needed as of 8f7f2f4 (which fixed the only known place where we forgot to set the txn field on the FlowCtx). Epic: None Release note: None 108858: *: bump some core dependencies r=knz a=rickystewart Unfortunately we're still getting `nogo` crashes. Try bumping these dependencies even further. Informs #99988. Epic: none Release note: None 108866: storage: Write sstables in v4 format for shared ingestions r=RaduBerinde a=itsbilal We require sstables to be written with obsolete bits for shared sstable ingestion to work correctly. This change moves to writing sstables with the new format if the cluster version is high enough. Unblocks #107394. Epic: none Release note: None 108886: kvstreamer, row: adjust all tests to work with the test tenant r=yuzefovich a=yuzefovich Epic: CRDB-18499 Informs #76378 Release note: None 108924: logictest: avoid FATAL during cleanup r=rafiss a=rafiss This prevents the test from failing if there is an error closing a connection at the end of the test. There is no strong need for us to assert that closing the connection worked. fixes #108833 Release note: None 108925: streamingest: speed-up c2c cutover r=lidorcarmel a=lidorcarmel Each c2c processor checks the job info table every 30 seconds to see whether a cutover has started. Normally a cutover should take about 10 seconds. This pr changes the default to 10 seconds, to reduce the total time for cutting over. Epic: none Release note: None Co-authored-by: cty123 <[email protected]> Co-authored-by: Faizan Qazi <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Lidor Carmel <[email protected]>
Build succeeded: |
This test adds a roachtest that spins up a cluster with 3 nodes using S3 as the --experimental-shared-storage, and then adds a fourth node after loading a tpcc fixture and with a foreground workload running on it. It confirms the fourth node gets hydrated without transferring all live bytes over the wire.
Epic: none
Fixes: #103030
Release note: None