-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: benchmark node decommission #81565
Conversation
2d1b84c
to
5744487
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks)
pkg/cmd/roachtest/tests/decommissionbench.go
line 283 at r1 (raw file):
} // TODO(sarkesian): wipe the node and re-add to cluster
note: long running benchmark test in follow-up commit #81874
04659d6
to
0709829
Compare
0709829
to
3bb96e9
Compare
6795f05
to
32195ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is awesome to have, do you already have numbers to show? if yes, it would be great to put that in the commit/pr message.
Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @irfansharif)
pkg/cmd/roachtest/tests/decommission.go
line 1167 at r2 (raw file):
} // pinRangesOnNode forces all default ranges to be replicated to the target node.
can we say "forces all ranges"? not sure I know what's a default range.
Code quote:
forces all default ranges
pkg/cmd/roachtest/tests/decommission.go
line 1168 at r2 (raw file):
// pinRangesOnNode forces all default ranges to be replicated to the target node. func (h *decommTestHelper) pinRangesOnNode(ctx context.Context, target, runNode int) error {
for symmetry it would be nice to name these targetNode
and runNode
.
Code quote:
target, runNode int
pkg/cmd/roachtest/tests/decommission.go
line 1183 at r2 (raw file):
ctx context.Context, target, runNode int, unpin bool, ) error { db := h.c.Conn(ctx, h.t.L(), runNode)
looks like this is always node 1? if yes, then maybe just put 1 here and avoid propagating this all the way from runDecommissionBench
?
Code quote:
runNode)
pkg/cmd/roachtest/tests/decommission.go
line 1246 at r2 (raw file):
defer db.Close() var membership string if err := db.QueryRow("SELECT membership FROM crdb_internal.gossip_liveness WHERE node_id = $1", downNodeID).Scan(&membership); err != nil {
line too long?
Code quote:
./cockroach quit --insecure --host
pkg/cmd/roachtest/tests/decommission.go
line 1264 at r2 (raw file):
db := h.c.Conn(ctx, h.t.L(), runNode) defer func() { _ = db.Close()
is
defer db.Close()
good enough here? same below.
(I know it's not new code but still asking..)
Code quote:
defer func() {
_ = db.Close()
}()
pkg/cmd/roachtest/tests/decommission.go
line 1316 at r2 (raw file):
return err } h.t.L().Printf("node%d (n%d) awaiting %d replica(s)", targetNode, targetNodeID, count)
can node%d
be different from n%d
? why? it would be great to have just one of these (the n%d only).
Code quote:
node%d (n%d)
pkg/cmd/roachtest/tests/decommissionbench.go
line 283 at r1 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
note: long running benchmark test in follow-up commit
why do you need to reuse the same node?
First, do we really need a test that decommissions a node, and then adds a node and then decommissions again? I'm assuming this is roughly what you have in mind.
Second, even if this is needed, you can use a new VM - or more than that - you will need the ability to spin a new VM anyway to test things like, decommission and add a new node at the same time.
I would drop this reuse feature in this commit and only introduce it when it's really needed.
pkg/cmd/roachtest/tests/decommissionbench.go
line 9 at r2 (raw file):
// the Business Source License, use of this software will be governed // by the Apache License, Version 2.0, included in the file // licenses/APL.txt.
should the filename be decommission_bench.go
?
pkg/cmd/roachtest/tests/decommissionbench.go
line 34 at r2 (raw file):
const ( defaultTimeout = 1 * time.Hour defaultSnapshotRateMb = 32
do you think it makes sense to use the default we have in the replicate queue? meaning, treat this as implementation details, so that when we change the prod default we might have roachtests failing, which will be great to see. If we have a default here then the test may succeed when updating the prod code.
pkg/cmd/roachtest/tests/decommissionbench.go
line 43 at r2 (raw file):
load bool admissionControl bool whileDown bool
I'm assuming whileDown
means we first take the node down and then decommission? vs decommission a live node? a comment would be nice IMO.
pkg/cmd/roachtest/tests/decommissionbench.go
line 54 at r2 (raw file):
nodes: 4, cpus: 4, warehouses: 100,
why only 100? do you get interesting results with such a small number? how many ranges and bytes will the decommissioned node have?
I think roachtests need to be more realistic and interesting, like a test a user would run. You're not verifying a decommission works, you're benchmarking whether it takes an hour or two, and I think that with 100 warehouses it will be way faster than that? I might be wrong..
pkg/cmd/roachtest/tests/decommissionbench.go
line 69 at r2 (raw file):
nodes: 4, cpus: 16, warehouses: 1000,
looks like you have 16 cores because you're using 1000 warehouses, right? I think we can load a lot more warehouses on 3x16 cores. It would be nice to know the cpu usage roughly, to make sure we don't benchmark with a 90% idle machines, both because of waste and being more realistic.
also - I think all test cases can use the same number of warehouses and cores to reduce scenarios and simplify things, but that's definitely up to you so feel free to ignore me. (Unless you want a single huge decommission where you have a node with many TBs)
pkg/cmd/roachtest/tests/decommissionbench.go
line 81 at r2 (raw file):
}, { nodes: 4,
I said it in slack - I really think you need to benchmark more than 4 nodes. 4 is special because after the decommission you only have 3 nodes, which means all replicas have exactly one node they can be moved to, so it's an easy job for the allocator. 5 nodes might be tricky because system ranges will need to be down replicated to 3 replicas (right? maybe not important?), I would try 4 and 8 or something like that.
pkg/cmd/roachtest/tests/decommissionbench.go
line 85 at r2 (raw file):
warehouses: 1000, load: false, admissionControl: false,
I'd consider admission off as a rare thing and have 0 or 1 test cases for it, but definitely feel free to ignore me.. (also it depends how heavy/slow these test cases are).
pkg/cmd/roachtest/tests/decommissionbench.go
line 144 at r2 (raw file):
t test.Test, c cluster.Cluster, tc decommissionBenchSpec,
maybe s/tc/spec/ ? I know tc is for test case.. but tc decommissionBenchSpec
looks very random to me.
Code quote:
tc decommissionBenchSpec,
pkg/cmd/roachtest/tests/decommissionbench.go
line 162 at r2 (raw file):
database := "tpcc" importCmd := fmt.Sprintf(`./cockroach workload fixtures import tpcc --warehouses=%d`, tc.warehouses) workloadCmd := fmt.Sprintf("./workload run tpcc --warehouses=%d --duration=%s --histograms=%s/stats.json --tolerate-errors {pgurl:1-%d}", tc.warehouses, testTimeout, t.PerfArtifactsDir(), tc.nodes)
long line?
pkg/cmd/roachtest/tests/decommissionbench.go
line 210 at r2 (raw file):
tick, perfBuf := initBulkJobPerfArtifacts(t.Name(), defaultTimeout) recorder := &decommBenchTicker{pre: tick, post: tick}
can you put a comment about what these lines are doing?
Code quote:
tick, perfBuf := initBulkJobPerfArtifacts(t.Name(), defaultTimeout)
recorder := &decommBenchTicker{pre: tick, post: tick}
pkg/cmd/roachtest/tests/decommissionbench.go
line 253 at r2 (raw file):
h.t.Status(fmt.Sprintf("targeting node%d (n%d) for decommission", target, target)) // Pin all default ranges (all warehouses) to the node, for uniformity.
I think this is not great, but let's see if I understand this test well and I'll explain what I have in mind:
Say you have 6 nodes, and you write to 1000 ranges. You then pin those ranges to have one replica on the node you want to later decommission. This means that that node will have 1000 replicas, and the other 5 nodes will have the other 2000 replicas, 400 on each. Now the cluster is imbalanced. Then you unpin replicas and decommission the node with the largest number of replicas. Note that the allocator now has 2 things to do: rebalance because some replicas are on a decommissioned node, and rebalanced because some nodes are imbalanced due to range count (and load, in the test cases with load).
Is that fair to say?
I see why this would work, and I'm sure this would be useful, but I still think it would be better to have a more realistic scenario where the cluster is balanced, for example:
have 6 nodes, write to 1000 ranges, wait until the allocator does the right thing and things are balanced (roughly 500 replicas per node), decommission a node, wait until things are balanced again. This would work with any number of nodes.
Alternatively you can still pin stuff but in a balanced way. The problem still might be that once you unpin the allocator will start reshuffling which is not desired. We can probably make it work but it would be nice to first understand why the straightforward approach (no pinning) is not good.
Anyway, maybe we should have a quick chat.
Code quote:
default
pkg/cmd/roachtest/tests/decommissionbench.go
line 268 at r2 (raw file):
defer dbNode.Close() if err := dbNode.QueryRow(`SELECT sum(range_count), sum(used) FROM crdb_internal.kv_store_status where node_id = $1`, target).Scan(&rangeCount, &bytesUsed); err != nil {
long line.
pkg/cmd/roachtest/tests/decommissionbench.go
line 296 at r2 (raw file):
return err } if err := h.checkDecommissioned(ctx, target, pinnedNode); err != nil {
can this be flaky? i.e. the replicas all migrated so waitReplicatedAwayFrom returns, but the status of the node is not yet decommissioned, and will be in a bit. should we waitForDecommissioned() instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to take a closer look at at the body for runSingleDecommission
but this looks great! Like Lidor I'm also curious about results from your runs, how long things take for these configurations.
db := h.c.Conn(ctx, h.t.L(), runNode) | ||
defer db.Close() | ||
var membership string | ||
if err := db.QueryRow("SELECT membership FROM crdb_internal.gossip_liveness WHERE node_id = $1", downNodeID).Scan(&membership); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of crdb_internal.gossip_liveness
, let's use crdb_internal.kv_node_liveness
. It does a consistent read against KV as opposed to waiting on gossip propagations. I don't think the difference will really matter in practice, but the KV state is authoritative and the gossip state follows from it.
@@ -1230,6 +1164,43 @@ func newDecommTestHelper(t test.Test, c cluster.Cluster) *decommTestHelper { | |||
} | |||
} | |||
|
|||
// pinRangesOnNode forces all default ranges to be replicated to the target node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] I find the primitive earlier slightly easier to follow/trace. How about a test helper to run arbitrary SQL + helpful comments?
h.run(fmt.Sprintf(`ALTER RANGE default CONFIGURE ZONE = 'constraints: {"-node%d"}'`, node)) // commentary about whether we're pinning or unpinning.
Feel free to ignore.
|
||
type decommissionBenchSpec struct { | ||
nodes int | ||
cpus int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestion to reduce the test variants here:
- Do we like environments where AC is disabled? If not, I'm curious what the utility is for this knob in particular. Roachtests are expensive given they run nightly and with real VMs; here we're also importing a non-trivial dataset and with enough moving parts, the fewer variants we run the better.
- Ditto for CPUs, I see below for some of the variants we're using 4 vCPUs, which feels too little for realistic CRDB deployments. If we want to get data for constrained environments, with this benchmark test we can edit things accordingly, but for nightly runs I'm not sure we'd get as much value.
- Ditto for snapshot rate (where I don't think we're using non-default anywhere below).
- Ditto for warehouse count, would we get more more value from nightly runs compared to pinning a single count and having the results for it plotted over time?
|
||
database := "tpcc" | ||
importCmd := fmt.Sprintf(`./cockroach workload fixtures import tpcc --warehouses=%d`, tc.warehouses) | ||
workloadCmd := fmt.Sprintf("./workload run tpcc --warehouses=%d --duration=%s --histograms=%s/stats.json --tolerate-errors {pgurl:1-%d}", tc.warehouses, testTimeout, t.PerfArtifactsDir(), tc.nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: How long do these runs take currently?
h.t.Status(fmt.Sprintf("decommissioned node%d (n%d) in %s (ranges: %d, size: %s)", | ||
target, target, elapsed, rangeCount, humanizeutil.IBytes(bytesUsed))) | ||
|
||
if reuse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add+litigate this bit in the subsequent PR: #81874.
pinnedNode int, | ||
database string, | ||
stopFirst, reuse bool, | ||
decommTicker, upreplicateTicker *decommBenchTicker, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we mean to use the upreplicateTicker data somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM mod comments here + earlier.
importCmd := fmt.Sprintf(`./cockroach workload fixtures import tpcc --warehouses=%d`, tc.warehouses) | ||
workloadCmd := fmt.Sprintf("./workload run tpcc --warehouses=%d --duration=%s --histograms=%s/stats.json --tolerate-errors {pgurl:1-%d}", tc.warehouses, testTimeout, t.PerfArtifactsDir(), tc.nodes) | ||
t.Status(fmt.Sprintf("initializing cluster with %d warehouses", tc.warehouses)) | ||
c.Run(ctx, c.Node(pinnedNode), importCmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps also worth plotting how long this import takes, I imagine it's a non-trivial portion of an individual run.
m := c.NewMonitor(workloadCtx, crdbNodes) | ||
|
||
if tc.load { | ||
m.Go( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these workloads have a ramp time? Should they? If so, should we wait until that ramp period is done before initiating the decommission?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @irfansharif, and @lidorcarmel)
pkg/cmd/roachtest/tests/decommission.go
line 1167 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
can we say "forces all ranges"? not sure I know what's a default range.
"default" ranges means all ranges in the "default" replication zone config which all non-system ranges are in by default, but I'll elaborate a little more here.
pkg/cmd/roachtest/tests/decommission.go
line 1264 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
is
defer db.Close()
good enough here? same below.
(I know it's not new code but still asking..)
Based on this documentation it sounds like even that is not strictly required, though it is a common pattern of resource cleanup in our roachtests.
If anything, it would be more performant to create a single db
handle through the cluster.Conn()
method and then use it as a connection pool, grabbing connections with conn := db.Conn(ctx)
and closing them with defer conn.Close()
, but I don't really see that pattern in use in roachtests and in this case in particular, given that we are using connections to multiple nodes and we are simulating a user running cockroach node decommission
via a CLI, it doesn't hurt to create new connections each time (the overhead is fairly minimal).
pkg/cmd/roachtest/tests/decommission.go
line 1316 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
can
node%d
be different fromn%d
? why? it would be great to have just one of these (the n%d only).
Yes - I actually tripped on this initially, and unfortunately it only becomes more clear in the following PR with repeated decommissions. It is because if you have 4 VMs running Cockroach, initially they will get node IDs 1-4. When you decommission node 3, that ID 3 will never be reused... assuming you wipe the VM and start Cockroach anew, the Cockroach process on VM 3 will now have node ID 5. I attempted to detangle this by calling the ID of the VM "nodeID
" and the ID of the Cockroach process on that VM the "logicalNodeID
".
pkg/cmd/roachtest/tests/decommissionbench.go
line 9 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
should the filename be
decommission_bench.go
?
I would actually prefer that, though the other benchmark roachtests seems to be of the form tpccbench.go
, kvbench.go
!
pkg/cmd/roachtest/tests/decommissionbench.go
line 34 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
do you think it makes sense to use the default we have in the replicate queue? meaning, treat this as implementation details, so that when we change the prod default we might have roachtests failing, which will be great to see. If we have a default here then the test may succeed when updating the prod code.
Sure - I can change this so that if snapshot rate isn't specified in the test config, we don't try to set the cluster settings. That way a change in the default will be reflected in changes in roachperf.
pkg/cmd/roachtest/tests/decommissionbench.go
line 39 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
A few suggestion to reduce the test variants here:
- Do we like environments where AC is disabled? If not, I'm curious what the utility is for this knob in particular. Roachtests are expensive given they run nightly and with real VMs; here we're also importing a non-trivial dataset and with enough moving parts, the fewer variants we run the better.
- Ditto for CPUs, I see below for some of the variants we're using 4 vCPUs, which feels too little for realistic CRDB deployments. If we want to get data for constrained environments, with this benchmark test we can edit things accordingly, but for nightly runs I'm not sure we'd get as much value.
- Ditto for snapshot rate (where I don't think we're using non-default anywhere below).
- Ditto for warehouse count, would we get more more value from nightly runs compared to pinning a single count and having the results for it plotted over time?
I like the AC enabled/disabled knob, and the snapshot rate knob is also useful for experiments, but as for the variance in the number of CPUs and warehouses, they could probably be reduced.
pkg/cmd/roachtest/tests/decommissionbench.go
line 43 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
I'm assuming
whileDown
means we first take the node down and then decommission? vs decommission a live node? a comment would be nice IMO.
Yes - can add a comment.
pkg/cmd/roachtest/tests/decommissionbench.go
line 54 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
why only 100? do you get interesting results with such a small number? how many ranges and bytes will the decommissioned node have?
I think roachtests need to be more realistic and interesting, like a test a user would run. You're not verifying a decommission works, you're benchmarking whether it takes an hour or two, and I think that with 100 warehouses it will be way faster than that? I might be wrong..
Mostly useful for quick experiments. I suppose I can comment this out though...
Or I wonder if I can make it subject to a environment variable that has a default?
pkg/cmd/roachtest/tests/decommissionbench.go
line 81 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
I said it in slack - I really think you need to benchmark more than 4 nodes. 4 is special because after the decommission you only have 3 nodes, which means all replicas have exactly one node they can be moved to, so it's an easy job for the allocator. 5 nodes might be tricky because system ranges will need to be down replicated to 3 replicas (right? maybe not important?), I would try 4 and 8 or something like that.
Good to know, will add.
pkg/cmd/roachtest/tests/decommissionbench.go
line 144 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
maybe s/tc/spec/ ? I know tc is for test case.. but
tc decommissionBenchSpec
looks very random to me.
I think tc
is often roachtest shorthand for "test config", but I could write that out, sure.
pkg/cmd/roachtest/tests/decommissionbench.go
line 162 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Curious: How long do these runs take currently?
cpu=4, warehouses=100: ~3.5 minutes.
cpu=16, warehouses=1000: ~20 minutes.
pkg/cmd/roachtest/tests/decommissionbench.go
line 164 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Perhaps also worth plotting how long this import takes, I imagine it's a non-trivial portion of an individual run.
True - can use the upreplicate ticker, though I may want to do this in a follow-up commit.
pkg/cmd/roachtest/tests/decommissionbench.go
line 192 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Do these workloads have a ramp time? Should they? If so, should we wait until that ramp period is done before initiating the decommission?
Good question - in fact, I have seen some strange perf results at the start of the run, so I think it would be good to incorporate that, thanks!
pkg/cmd/roachtest/tests/decommissionbench.go
line 210 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
can you put a comment about what these lines are doing?
Will do!
pkg/cmd/roachtest/tests/decommissionbench.go
line 248 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Did we mean to use the upreplicateTicker data somewhere?
I have it used in the follow-up PR (which runs decommissions in a loop).
pkg/cmd/roachtest/tests/decommissionbench.go
line 253 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
I think this is not great, but let's see if I understand this test well and I'll explain what I have in mind:
Say you have 6 nodes, and you write to 1000 ranges. You then pin those ranges to have one replica on the node you want to later decommission. This means that that node will have 1000 replicas, and the other 5 nodes will have the other 2000 replicas, 400 on each. Now the cluster is imbalanced. Then you unpin replicas and decommission the node with the largest number of replicas. Note that the allocator now has 2 things to do: rebalance because some replicas are on a decommissioned node, and rebalanced because some nodes are imbalanced due to range count (and load, in the test cases with load).
Is that fair to say?
I see why this would work, and I'm sure this would be useful, but I still think it would be better to have a more realistic scenario where the cluster is balanced, for example:
have 6 nodes, write to 1000 ranges, wait until the allocator does the right thing and things are balanced (roughly 500 replicas per node), decommission a node, wait until things are balanced again. This would work with any number of nodes.Alternatively you can still pin stuff but in a balanced way. The problem still might be that once you unpin the allocator will start reshuffling which is not desired. We can probably make it work but it would be nice to first understand why the straightforward approach (no pinning) is not good.
Anyway, maybe we should have a quick chat.
Let's chat about this, I'm curious about your thoughts - my thought was about making the decommission time stable across runs, though perhaps as you mentioned it would be better to attempt to decommission on a balanced cluster.
I'm not entirely sure how to define "balanced" however - the allocator
roachtest seems to do it based on a hard-coded limit in the stddev of range counts. I'll set up some time to chat.
pkg/cmd/roachtest/tests/decommissionbench.go
line 296 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
can this be flaky? i.e. the replicas all migrated so waitReplicatedAwayFrom returns, but the status of the node is not yet decommissioned, and will be in a bit. should we waitForDecommissioned() instead?
It shouldn't be - and if it is, it would be an error. This is because h.decommission(..)
a few lines above runs cockroach node decommission --wait=all
, and as such it should not return without transitioning the liveness status of the node to decommissioned
. So I think this check is ok - waitReplicatedAwayFrom()
is really just a sanity check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and yes, can put some of the results in the commit message too!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @irfansharif, and @lidorcarmel)
32195ab
to
af7ba4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @irfansharif, and @lidorcarmel)
pkg/cmd/roachtest/tests/decommission.go
line 1168 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
for symmetry it would be nice to name these
targetNode
andrunNode
.
Done. It's a little tricky as there's sort of 3 different concepts of "node" going on here:
- the Roachprod node, I.e. the idx of the VM in the Roachprod cluster.
- the CockroachDB node ID, I.e. the logical NodeID assigned to the running Cockroach process at the time when it joins the cluster.
- the
node#
attribute we give each Cockroach process at startup so we have the ability to pin/unpin ranges using constraints like+node3
. These IDs match the roachprod node IDs.
It might make the most sense to remove this entirely, however, if we don't want to force all the ranges onto our decommissioning node before starting the decommission.
pkg/cmd/roachtest/tests/decommission.go
line 1183 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
looks like this is always node 1? if yes, then maybe just put 1 here and avoid propagating this all the way from
runDecommissionBench
?
I think it might be safer to always pass in the target node we want to run against - that said based on our discussion I may remove this entirely.
pkg/cmd/roachtest/tests/decommission.go
line 1246 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Instead of
crdb_internal.gossip_liveness
, let's usecrdb_internal.kv_node_liveness
. It does a consistent read against KV as opposed to waiting on gossip propagations. I don't think the difference will really matter in practice, but the KV state is authoritative and the gossip state follows from it.
Changed to use this, though don't we receive the gossip liveness after the trigger on updating kv_node_liveness
is fired? Which I think means we'd see the state transition to decommissioned
only after received by gossip, no? (I don't think it matters too much either way, though, no).
pkg/cmd/roachtest/tests/decommission.go
line 1246 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
line too long?
Done.
pkg/cmd/roachtest/tests/decommissionbench.go
line 69 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
looks like you have 16 cores because you're using 1000 warehouses, right? I think we can load a lot more warehouses on 3x16 cores. It would be nice to know the cpu usage roughly, to make sure we don't benchmark with a 90% idle machines, both because of waste and being more realistic.
also - I think all test cases can use the same number of warehouses and cores to reduce scenarios and simplify things, but that's definitely up to you so feel free to ignore me. (Unless you want a single huge decommission where you have a node with many TBs)
According to https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/tpcc.go#L719 our estimated max warehouses on GCP 3x16cores is only 2400 warehouses - these tests (so far) are mostly to benchmark with moderate load, not overloaded, so I went with just 1000. If this doesn't seem right though and I should bump it up let me know! Maybe should have chosen KV as a more straightforward workload, but I figured the amount of data in TPCC made for a better fit.
pkg/cmd/roachtest/tests/decommissionbench.go
line 144 at r2 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
I think
tc
is often roachtest shorthand for "test config", but I could write that out, sure.
Done.
pkg/cmd/roachtest/tests/decommissionbench.go
line 162 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
long line?
Done.
pkg/cmd/roachtest/tests/decommissionbench.go
line 210 at r2 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
Will do!
Done.
pkg/cmd/roachtest/tests/decommissionbench.go
line 268 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
long line.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @irfansharif, and @lidorcarmel)
pkg/cmd/roachtest/tests/decommission.go
line 1264 at r2 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
Based on this documentation it sounds like even that is not strictly required, though it is a common pattern of resource cleanup in our roachtests.
If anything, it would be more performant to create a single
db
handle through thecluster.Conn()
method and then use it as a connection pool, grabbing connections withconn := db.Conn(ctx)
and closing them withdefer conn.Close()
, but I don't really see that pattern in use in roachtests and in this case in particular, given that we are using connections to multiple nodes and we are simulating a user runningcockroach node decommission
via a CLI, it doesn't hurt to create new connections each time (the overhead is fairly minimal).
agree it's ok to recreate connections in a test, no need to reuse.
we do have this patterns, but I think we use the defer db.Close()
more often, and it's only 33% of the lines!
pkg/cmd/roachtest/tests/decommission.go
line 1316 at r2 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
Yes - I actually tripped on this initially, and unfortunately it only becomes more clear in the following PR with repeated decommissions. It is because if you have 4 VMs running Cockroach, initially they will get node IDs 1-4. When you decommission node 3, that ID 3 will never be reused... assuming you wipe the VM and start Cockroach anew, the Cockroach process on VM 3 will now have node ID 5. I attempted to detangle this by calling the ID of the VM "
nodeID
" and the ID of the Cockroach process on that VM the "logicalNodeID
".
I see, thanks.
I think we discussed it a bit - if you cannot spin a new VM, how can you test something like decommissioning and adding a node at the same time?
If you will need to spin a new VM anyway then maybe never reuse VMs and only have one set of ids? (similar to how clusters work in prod).
If you want to proceed with this then probably you need a better comment, and maybe a TODO if you plan to get rid of this in the future.
pkg/cmd/roachtest/tests/decommissionbench.go
line 9 at r2 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
I would actually prefer that, though the other benchmark roachtests seems to be of the form
tpccbench.go
,kvbench.go
!
I still vote to the _
but up to you..
pkg/cmd/roachtest/tests/decommissionbench.go
line 43 at r2 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
Yes - can add a comment.
forgot maybe?
pkg/cmd/roachtest/tests/decommissionbench.go
line 54 at r2 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
Mostly useful for quick experiments. I suppose I can comment this out though...
Or I wonder if I can make it subject to a environment variable that has a default?
I'd avoid env vars..
pkg/cmd/roachtest/tests/decommissionbench.go
line 69 at r2 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
According to https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/tpcc.go#L719 our estimated max warehouses on GCP 3x16cores is only 2400 warehouses - these tests (so far) are mostly to benchmark with moderate load, not overloaded, so I went with just 1000. If this doesn't seem right though and I should bump it up let me know! Maybe should have chosen KV as a more straightforward workload, but I figured the amount of data in TPCC made for a better fit.
good question about tpcc vs kv load.. I guess we can change later if needed. And we can also always change the load, so 1000 sounds like a good start, thanks!
pkg/cmd/roachtest/tests/decommissionbench.go
line 296 at r2 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
It shouldn't be - and if it is, it would be an error. This is because
h.decommission(..)
a few lines above runscockroach node decommission --wait=all
, and as such it should not return without transitioning the liveness status of the node todecommissioned
. So I think this check is ok -waitReplicatedAwayFrom()
is really just a sanity check.
sg, thanks.
pkg/cmd/roachtest/tests/decommissionbench.go
line 174 at r3 (raw file):
defer db.Close() if benchSpec.snapshotRate != 0 {
looks like you're not using snapshotRate
in the test cases, intentional? if not using it then maybe remove?
pkg/cmd/roachtest/tests/decommissionbench.go
line 265 at r3 (raw file):
} // Get the workload perf artifacts and move them to the pinned node, so that
is the term pinned node
still valid after your recent update? here and maybe other places (a few lines above for example).
Code quote:
pinned node
pkg/cmd/roachtest/tests/decommissionbench.go
line 330 at r3 (raw file):
h.t.Status("waiting for cluster balance") if err := waitForRebalance(ctx, h.t.L(), dbNode, float64(totalRanges)/3.0); err != nil {
why total/3
? pretty large stddev?
Code quote:
float64(totalRanges)/3.0
65f81f4
to
7b1cab0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @irfansharif, and @lidorcarmel)
pkg/cmd/roachtest/tests/decommission.go
line 1264 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
agree it's ok to recreate connections in a test, no need to reuse.
we do have this patterns, but I think we use thedefer db.Close()
more often, and it's only 33% of the lines!
Done.
pkg/cmd/roachtest/tests/decommission.go
line 1316 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
I see, thanks.
I think we discussed it a bit - if you cannot spin a new VM, how can you test something like decommissioning and adding a node at the same time?
If you will need to spin a new VM anyway then maybe never reuse VMs and only have one set of ids? (similar to how clusters work in prod).If you want to proceed with this then probably you need a better comment, and maybe a TODO if you plan to get rid of this in the future.
If we want to do simultaneous decommissioning and adding a node we just have to increase our allocated Roachprod nodes by 1, and then just start it later - then immediately decommission our target node (I.e. before the new node reaches range count balance). Can add a TODO about this though.
Also I think the code this comment is attached to isn't actually used by this new test anymore and is only for existing tests, heh!
pkg/cmd/roachtest/tests/decommissionbench.go
line 43 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
forgot maybe?
I did, thanks!
pkg/cmd/roachtest/tests/decommissionbench.go
line 54 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
I'd avoid env vars..
Done.
pkg/cmd/roachtest/tests/decommissionbench.go
line 69 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
good question about tpcc vs kv load.. I guess we can change later if needed. And we can also always change the load, so 1000 sounds like a good start, thanks!
Done.
pkg/cmd/roachtest/tests/decommissionbench.go
line 85 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
I'd consider admission off as a rare thing and have 0 or 1 test cases for it, but definitely feel free to ignore me.. (also it depends how heavy/slow these test cases are).
Done.
pkg/cmd/roachtest/tests/decommissionbench.go
line 296 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
sg, thanks.
Done.
pkg/cmd/roachtest/tests/decommissionbench.go
line 305 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Lets add+litigate this bit in the subsequent PR: #81874.
Done.
pkg/cmd/roachtest/tests/decommissionbench.go
line 174 at r3 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
looks like you're not using
snapshotRate
in the test cases, intentional? if not using it then maybe remove?
I've been mostly using it for manual testing right now but may add some configs that use it if we need to soon...
pkg/cmd/roachtest/tests/decommissionbench.go
line 265 at r3 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
is the term
pinned node
still valid after your recent update? here and maybe other places (a few lines above for example).
Yeah, I think it is still valid - it's basically the node we run our metadata queries and such against for the most part.
pkg/cmd/roachtest/tests/decommissionbench.go
line 330 at r3 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
why
total/3
? pretty large stddev?
It is - if I should make it tighter let me know, but for now I mainly just wanted something that I knew wouldn't error - the main thing necessary was waiting for the rebalance operations to stop. I figure the allocator test is the right place for the stddev bounds checking rather than here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @irfansharif, and @lidorcarmel)
pkg/cmd/roachtest/tests/decommissionbench.go
line 39 at r2 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
I like the AC enabled/disabled knob, and the snapshot rate knob is also useful for experiments, but as for the variance in the number of CPUs and warehouses, they could probably be reduced.
Did you mean to reduce the variance in CPUs+warehouses? For the snapshot rate knob if our intent is to use it during experiments, I wonder if we should just edit the code when running them instead of checking this in (we'd have to do this anyway with the current changes). As it stands, the unused knob is a bit confusing to readers. I also noticed that load
is set to true for all of the variants. Do we want to get rid of it, or try a variant with/without load.
I want to press you slightly on whether it's a useful benchmark setup to run with AC disabled -- have we observed different results based on it? I ask all these because each variant is costly to run nightly and introduces more possibility for flakes to debug/diagnose. In my view just the following set (two variants) will be sufficient. We can always add more if we learn they provide additional value, or our replica movement considers different states differently.
nodes=4, cpu=16, warehouses1000,whileDown={true,false},load=true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @irfansharif, and @lidorcarmel)
pkg/cmd/roachtest/tests/decommission.go
line 1316 at r2 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
If we want to do simultaneous decommissioning and adding a node we just have to increase our allocated Roachprod nodes by 1, and then just start it later - then immediately decommission our target node (I.e. before the new node reaches range count balance). Can add a TODO about this though.
Also I think the code this comment is attached to isn't actually used by this new test anymore and is only for existing tests, heh!
-
can you do the same strategy with repeated decommissions? say you want to add and decommission 10 nodes to a 4 node cluster, just create a 14 node cluster.
-
the line here is new in this PR, right? this PR introduces the notion of 2 different node ids. if you have to do that - sure, but explain it well somewhere.
-
I definitely see why you want this but the direction you're taking means you'll sometimes reuse an existing VM and then there will be 2 ids (hard to debug, someone will fall there for sure), and sometimes you will add an extra VM ahead and not reuse ids. picking one approach will simplify things IMO.
-
one thing I'm not sure I understand - do we really care about the vm id? can we use the node id only? so that things will look the same whether we reuse an existing node or get a fresh one.
maybe a quick chat would be easier.
pkg/cmd/roachtest/tests/decommissionbench.go
line 174 at r3 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
I've been mostly using it for manual testing right now but may add some configs that use it if we need to soon...
add it now? just reduce the snapshot rate for one of the configs.
I don't like code that doesn't run under any test. unfortunately we don't have unit tests for roachtests so if someone brakes the way we use snapshotRate here nothing will fail until we this manually.
pkg/cmd/roachtest/tests/decommissionbench.go
line 265 at r3 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
Yeah, I think it is still valid - it's basically the node we run our metadata queries and such against for the most part.
ah, thanks, my misunderstanding - I thought this is the node with the pinned replicas.
so I guess we call it "pinned" because we don't decommission it? can be operator node but fine, maybe I'm just still thinking about pinning replicas, up to you..
pkg/cmd/roachtest/tests/decommissionbench.go
line 330 at r3 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
It is - if I should make it tighter let me know, but for now I mainly just wanted something that I knew wouldn't error - the main thing necessary was waiting for the rebalance operations to stop. I figure the allocator test is the right place for the stddev bounds checking rather than here.
I was thinking it should be a function of the ranges per node, not total. imagine we have 100 nodes, then the stddev is so high that we immediately call success (but maybe we also wait for no rebalancing? anyway, it means the stddev will not be used).
7b1cab0
to
46671fd
Compare
While previously some roachtests existed for the purposes of testing the decommission process, we have not had any benchmarks to track how long it takes to decommission a node, making it difficult to reason about how to understand what makes decommission so slow. This change adds benchmarks for node decommission under a number of configurations, including variable numbers of nodes/cpus, TPCC warehouses, and with admission control enabled vs. disabled. Some initial runs of the test have shown the following averages: ``` decommissionBench/nodes=4/cpu=16/warehouses=1000: 16m14s decommissionBench/nodes=4/cpu=16/warehouses=1000/no-admission: 15m48s decommissionBench/nodes=4/cpu=16/warehouses=1000/while-down: 20m36s decommissionBench/nodes=8/cpu=16/warehouses=3000: 18m30s ``` Release note: None
46671fd
to
90eb477
Compare
bors r+ |
Build succeeded: |
81874: roachtest: benchmark repeated decommissions over time r=AlexTalks a=AlexTalks This adds new long-running configurations to the `decommissionBench` roachtest, so that repeated decommissions over time in a cluster can be benchmarked individually. With these long-running tests, we've seen the following averages: ``` decommissionBench/nodes=4/cpu=16/warehouses=1000/duration=1h0m0s: avg decommission: 16m41s, avg upreplication: 33m17s decommissionBench/nodes=4/cpu=4/warehouses=100: avg decommission: 2m44s, avg upreplication: 4m3s ``` Depends on #81565. Release Note: None 84869: sql: don't embed NodeInfo into ExecutorConfig r=andreimatei a=andreimatei This patch turns an embedded struct into a field. With the NodeInfo being embedded, it was hard to grep for all the places that uses its member fields, and I found myself wanting to know that. Release note: None Co-authored-by: Alex Sarkesian <[email protected]> Co-authored-by: Andrei Matei <[email protected]>
roachtest: benchmark node decommission
While previously some roachtests existed for the purposes of
testing the decommission process, we have not had any benchmarks to
track how long it takes to decommission a node, making it difficult to
reason about how to understand what makes decommission so slow. This
change adds benchmarks for node decommission under a number of
configurations, including variable numbers of nodes/cpus, TPCC
warehouses, and with admission control enabled vs. disabled.
Some initial runs of the test have shown the following averages:
Release note: None