Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: benchmark node decommission #81565

Merged
merged 1 commit into from
Jun 11, 2022

Conversation

AlexTalks
Copy link
Contributor

@AlexTalks AlexTalks commented May 20, 2022

roachtest: benchmark node decommission

While previously some roachtests existed for the purposes of
testing the decommission process, we have not had any benchmarks to
track how long it takes to decommission a node, making it difficult to
reason about how to understand what makes decommission so slow. This
change adds benchmarks for node decommission under a number of
configurations, including variable numbers of nodes/cpus, TPCC
warehouses, and with admission control enabled vs. disabled.

Some initial runs of the test have shown the following averages:

decommissionBench/nodes=4/cpu=16/warehouses=1000: 16m14s
decommissionBench/nodes=4/cpu=16/warehouses=1000/no-admission: 15m48s
decommissionBench/nodes=4/cpu=16/warehouses=1000/while-down: 20m36s
decommissionBench/nodes=8/cpu=16/warehouses=3000: 18m30s

Release note: None

@AlexTalks AlexTalks added the do-not-merge bors won't merge a PR with this label. label May 20, 2022
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks)


pkg/cmd/roachtest/tests/decommissionbench.go line 283 at r1 (raw file):

		}

		// TODO(sarkesian): wipe the node and re-add to cluster

note: long running benchmark test in follow-up commit #81874

@AlexTalks AlexTalks force-pushed the benchmark_decomm branch 2 times, most recently from 04659d6 to 0709829 Compare June 1, 2022 17:23
@AlexTalks AlexTalks removed the do-not-merge bors won't merge a PR with this label. label Jun 2, 2022
@AlexTalks AlexTalks requested a review from irfansharif June 2, 2022 20:27
@AlexTalks AlexTalks force-pushed the benchmark_decomm branch 3 times, most recently from 6795f05 to 32195ab Compare June 2, 2022 23:24
Copy link
Contributor

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome to have, do you already have numbers to show? if yes, it would be great to put that in the commit/pr message.

Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @irfansharif)


pkg/cmd/roachtest/tests/decommission.go line 1167 at r2 (raw file):

}

// pinRangesOnNode forces all default ranges to be replicated to the target node.

can we say "forces all ranges"? not sure I know what's a default range.

Code quote:

forces all default ranges

pkg/cmd/roachtest/tests/decommission.go line 1168 at r2 (raw file):

// pinRangesOnNode forces all default ranges to be replicated to the target node.
func (h *decommTestHelper) pinRangesOnNode(ctx context.Context, target, runNode int) error {

for symmetry it would be nice to name these targetNode and runNode.

Code quote:

target, runNode int

pkg/cmd/roachtest/tests/decommission.go line 1183 at r2 (raw file):

	ctx context.Context, target, runNode int, unpin bool,
) error {
	db := h.c.Conn(ctx, h.t.L(), runNode)

looks like this is always node 1? if yes, then maybe just put 1 here and avoid propagating this all the way from runDecommissionBench?

Code quote:

runNode)

pkg/cmd/roachtest/tests/decommission.go line 1246 at r2 (raw file):

	defer db.Close()
	var membership string
	if err := db.QueryRow("SELECT membership FROM crdb_internal.gossip_liveness WHERE node_id = $1", downNodeID).Scan(&membership); err != nil {

line too long?

Code quote:

./cockroach quit --insecure --host

pkg/cmd/roachtest/tests/decommission.go line 1264 at r2 (raw file):

	db := h.c.Conn(ctx, h.t.L(), runNode)
	defer func() {
		_ = db.Close()

is

defer db.Close()

good enough here? same below.
(I know it's not new code but still asking..)

Code quote:

	defer func() {
		_ = db.Close()
	}()

pkg/cmd/roachtest/tests/decommission.go line 1316 at r2 (raw file):

			return err
		}
		h.t.L().Printf("node%d (n%d) awaiting %d replica(s)", targetNode, targetNodeID, count)

can node%d be different from n%d? why? it would be great to have just one of these (the n%d only).

Code quote:

node%d (n%d)

pkg/cmd/roachtest/tests/decommissionbench.go line 283 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

note: long running benchmark test in follow-up commit

why do you need to reuse the same node?
First, do we really need a test that decommissions a node, and then adds a node and then decommissions again? I'm assuming this is roughly what you have in mind.
Second, even if this is needed, you can use a new VM - or more than that - you will need the ability to spin a new VM anyway to test things like, decommission and add a new node at the same time.
I would drop this reuse feature in this commit and only introduce it when it's really needed.


pkg/cmd/roachtest/tests/decommissionbench.go line 9 at r2 (raw file):

// the Business Source License, use of this software will be governed
// by the Apache License, Version 2.0, included in the file
// licenses/APL.txt.

should the filename be decommission_bench.go?


pkg/cmd/roachtest/tests/decommissionbench.go line 34 at r2 (raw file):

const (
	defaultTimeout        = 1 * time.Hour
	defaultSnapshotRateMb = 32

do you think it makes sense to use the default we have in the replicate queue? meaning, treat this as implementation details, so that when we change the prod default we might have roachtests failing, which will be great to see. If we have a default here then the test may succeed when updating the prod code.


pkg/cmd/roachtest/tests/decommissionbench.go line 43 at r2 (raw file):

	load             bool
	admissionControl bool
	whileDown        bool

I'm assuming whileDown means we first take the node down and then decommission? vs decommission a live node? a comment would be nice IMO.


pkg/cmd/roachtest/tests/decommissionbench.go line 54 at r2 (raw file):

			nodes:            4,
			cpus:             4,
			warehouses:       100,

why only 100? do you get interesting results with such a small number? how many ranges and bytes will the decommissioned node have?
I think roachtests need to be more realistic and interesting, like a test a user would run. You're not verifying a decommission works, you're benchmarking whether it takes an hour or two, and I think that with 100 warehouses it will be way faster than that? I might be wrong..


pkg/cmd/roachtest/tests/decommissionbench.go line 69 at r2 (raw file):

			nodes:            4,
			cpus:             16,
			warehouses:       1000,

looks like you have 16 cores because you're using 1000 warehouses, right? I think we can load a lot more warehouses on 3x16 cores. It would be nice to know the cpu usage roughly, to make sure we don't benchmark with a 90% idle machines, both because of waste and being more realistic.

also - I think all test cases can use the same number of warehouses and cores to reduce scenarios and simplify things, but that's definitely up to you so feel free to ignore me. (Unless you want a single huge decommission where you have a node with many TBs)


pkg/cmd/roachtest/tests/decommissionbench.go line 81 at r2 (raw file):

		},
		{
			nodes:            4,

I said it in slack - I really think you need to benchmark more than 4 nodes. 4 is special because after the decommission you only have 3 nodes, which means all replicas have exactly one node they can be moved to, so it's an easy job for the allocator. 5 nodes might be tricky because system ranges will need to be down replicated to 3 replicas (right? maybe not important?), I would try 4 and 8 or something like that.


pkg/cmd/roachtest/tests/decommissionbench.go line 85 at r2 (raw file):

			warehouses:       1000,
			load:             false,
			admissionControl: false,

I'd consider admission off as a rare thing and have 0 or 1 test cases for it, but definitely feel free to ignore me.. (also it depends how heavy/slow these test cases are).


pkg/cmd/roachtest/tests/decommissionbench.go line 144 at r2 (raw file):

	t test.Test,
	c cluster.Cluster,
	tc decommissionBenchSpec,

maybe s/tc/spec/ ? I know tc is for test case.. but tc decommissionBenchSpec looks very random to me.

Code quote:

tc decommissionBenchSpec,

pkg/cmd/roachtest/tests/decommissionbench.go line 162 at r2 (raw file):

	database := "tpcc"
	importCmd := fmt.Sprintf(`./cockroach workload fixtures import tpcc --warehouses=%d`, tc.warehouses)
	workloadCmd := fmt.Sprintf("./workload run tpcc --warehouses=%d --duration=%s --histograms=%s/stats.json --tolerate-errors {pgurl:1-%d}", tc.warehouses, testTimeout, t.PerfArtifactsDir(), tc.nodes)

long line?


pkg/cmd/roachtest/tests/decommissionbench.go line 210 at r2 (raw file):

	tick, perfBuf := initBulkJobPerfArtifacts(t.Name(), defaultTimeout)
	recorder := &decommBenchTicker{pre: tick, post: tick}

can you put a comment about what these lines are doing?

Code quote:

	tick, perfBuf := initBulkJobPerfArtifacts(t.Name(), defaultTimeout)
	recorder := &decommBenchTicker{pre: tick, post: tick}

pkg/cmd/roachtest/tests/decommissionbench.go line 253 at r2 (raw file):

	h.t.Status(fmt.Sprintf("targeting node%d (n%d) for decommission", target, target))

	// Pin all default ranges (all warehouses) to the node, for uniformity.

I think this is not great, but let's see if I understand this test well and I'll explain what I have in mind:

Say you have 6 nodes, and you write to 1000 ranges. You then pin those ranges to have one replica on the node you want to later decommission. This means that that node will have 1000 replicas, and the other 5 nodes will have the other 2000 replicas, 400 on each. Now the cluster is imbalanced. Then you unpin replicas and decommission the node with the largest number of replicas. Note that the allocator now has 2 things to do: rebalance because some replicas are on a decommissioned node, and rebalanced because some nodes are imbalanced due to range count (and load, in the test cases with load).

Is that fair to say?

I see why this would work, and I'm sure this would be useful, but I still think it would be better to have a more realistic scenario where the cluster is balanced, for example:
have 6 nodes, write to 1000 ranges, wait until the allocator does the right thing and things are balanced (roughly 500 replicas per node), decommission a node, wait until things are balanced again. This would work with any number of nodes.

Alternatively you can still pin stuff but in a balanced way. The problem still might be that once you unpin the allocator will start reshuffling which is not desired. We can probably make it work but it would be nice to first understand why the straightforward approach (no pinning) is not good.

Anyway, maybe we should have a quick chat.

Code quote:

default

pkg/cmd/roachtest/tests/decommissionbench.go line 268 at r2 (raw file):

		defer dbNode.Close()

		if err := dbNode.QueryRow(`SELECT sum(range_count), sum(used) FROM crdb_internal.kv_store_status where node_id = $1`, target).Scan(&rangeCount, &bytesUsed); err != nil {

long line.


pkg/cmd/roachtest/tests/decommissionbench.go line 296 at r2 (raw file):

		return err
	}
	if err := h.checkDecommissioned(ctx, target, pinnedNode); err != nil {

can this be flaky? i.e. the replicas all migrated so waitReplicatedAwayFrom returns, but the status of the node is not yet decommissioned, and will be in a bit. should we waitForDecommissioned() instead?

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to take a closer look at at the body for runSingleDecommission but this looks great! Like Lidor I'm also curious about results from your runs, how long things take for these configurations.

db := h.c.Conn(ctx, h.t.L(), runNode)
defer db.Close()
var membership string
if err := db.QueryRow("SELECT membership FROM crdb_internal.gossip_liveness WHERE node_id = $1", downNodeID).Scan(&membership); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of crdb_internal.gossip_liveness, let's use crdb_internal.kv_node_liveness. It does a consistent read against KV as opposed to waiting on gossip propagations. I don't think the difference will really matter in practice, but the KV state is authoritative and the gossip state follows from it.

@@ -1230,6 +1164,43 @@ func newDecommTestHelper(t test.Test, c cluster.Cluster) *decommTestHelper {
}
}

// pinRangesOnNode forces all default ranges to be replicated to the target node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I find the primitive earlier slightly easier to follow/trace. How about a test helper to run arbitrary SQL + helpful comments?

h.run(fmt.Sprintf(`ALTER RANGE default CONFIGURE ZONE = 'constraints: {"-node%d"}'`, node)) // commentary about whether we're pinning or unpinning.

Feel free to ignore.


type decommissionBenchSpec struct {
nodes int
cpus int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestion to reduce the test variants here:

  • Do we like environments where AC is disabled? If not, I'm curious what the utility is for this knob in particular. Roachtests are expensive given they run nightly and with real VMs; here we're also importing a non-trivial dataset and with enough moving parts, the fewer variants we run the better.
  • Ditto for CPUs, I see below for some of the variants we're using 4 vCPUs, which feels too little for realistic CRDB deployments. If we want to get data for constrained environments, with this benchmark test we can edit things accordingly, but for nightly runs I'm not sure we'd get as much value.
  • Ditto for snapshot rate (where I don't think we're using non-default anywhere below).
  • Ditto for warehouse count, would we get more more value from nightly runs compared to pinning a single count and having the results for it plotted over time?


database := "tpcc"
importCmd := fmt.Sprintf(`./cockroach workload fixtures import tpcc --warehouses=%d`, tc.warehouses)
workloadCmd := fmt.Sprintf("./workload run tpcc --warehouses=%d --duration=%s --histograms=%s/stats.json --tolerate-errors {pgurl:1-%d}", tc.warehouses, testTimeout, t.PerfArtifactsDir(), tc.nodes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: How long do these runs take currently?

h.t.Status(fmt.Sprintf("decommissioned node%d (n%d) in %s (ranges: %d, size: %s)",
target, target, elapsed, rangeCount, humanizeutil.IBytes(bytesUsed)))

if reuse {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add+litigate this bit in the subsequent PR: #81874.

pinnedNode int,
database string,
stopFirst, reuse bool,
decommTicker, upreplicateTicker *decommBenchTicker,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we mean to use the upreplicateTicker data somewhere?

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod comments here + earlier.

importCmd := fmt.Sprintf(`./cockroach workload fixtures import tpcc --warehouses=%d`, tc.warehouses)
workloadCmd := fmt.Sprintf("./workload run tpcc --warehouses=%d --duration=%s --histograms=%s/stats.json --tolerate-errors {pgurl:1-%d}", tc.warehouses, testTimeout, t.PerfArtifactsDir(), tc.nodes)
t.Status(fmt.Sprintf("initializing cluster with %d warehouses", tc.warehouses))
c.Run(ctx, c.Node(pinnedNode), importCmd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also worth plotting how long this import takes, I imagine it's a non-trivial portion of an individual run.

m := c.NewMonitor(workloadCtx, crdbNodes)

if tc.load {
m.Go(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these workloads have a ramp time? Should they? If so, should we wait until that ramp period is done before initiating the decommission?

Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @irfansharif, and @lidorcarmel)


pkg/cmd/roachtest/tests/decommission.go line 1167 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

can we say "forces all ranges"? not sure I know what's a default range.

"default" ranges means all ranges in the "default" replication zone config which all non-system ranges are in by default, but I'll elaborate a little more here.


pkg/cmd/roachtest/tests/decommission.go line 1264 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

is

defer db.Close()

good enough here? same below.
(I know it's not new code but still asking..)

Based on this documentation it sounds like even that is not strictly required, though it is a common pattern of resource cleanup in our roachtests.

If anything, it would be more performant to create a single db handle through the cluster.Conn() method and then use it as a connection pool, grabbing connections with conn := db.Conn(ctx) and closing them with defer conn.Close(), but I don't really see that pattern in use in roachtests and in this case in particular, given that we are using connections to multiple nodes and we are simulating a user running cockroach node decommission via a CLI, it doesn't hurt to create new connections each time (the overhead is fairly minimal).


pkg/cmd/roachtest/tests/decommission.go line 1316 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

can node%d be different from n%d? why? it would be great to have just one of these (the n%d only).

Yes - I actually tripped on this initially, and unfortunately it only becomes more clear in the following PR with repeated decommissions. It is because if you have 4 VMs running Cockroach, initially they will get node IDs 1-4. When you decommission node 3, that ID 3 will never be reused... assuming you wipe the VM and start Cockroach anew, the Cockroach process on VM 3 will now have node ID 5. I attempted to detangle this by calling the ID of the VM "nodeID" and the ID of the Cockroach process on that VM the "logicalNodeID".


pkg/cmd/roachtest/tests/decommissionbench.go line 9 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

should the filename be decommission_bench.go?

I would actually prefer that, though the other benchmark roachtests seems to be of the form tpccbench.go, kvbench.go!


pkg/cmd/roachtest/tests/decommissionbench.go line 34 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

do you think it makes sense to use the default we have in the replicate queue? meaning, treat this as implementation details, so that when we change the prod default we might have roachtests failing, which will be great to see. If we have a default here then the test may succeed when updating the prod code.

Sure - I can change this so that if snapshot rate isn't specified in the test config, we don't try to set the cluster settings. That way a change in the default will be reflected in changes in roachperf.


pkg/cmd/roachtest/tests/decommissionbench.go line 39 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

A few suggestion to reduce the test variants here:

  • Do we like environments where AC is disabled? If not, I'm curious what the utility is for this knob in particular. Roachtests are expensive given they run nightly and with real VMs; here we're also importing a non-trivial dataset and with enough moving parts, the fewer variants we run the better.
  • Ditto for CPUs, I see below for some of the variants we're using 4 vCPUs, which feels too little for realistic CRDB deployments. If we want to get data for constrained environments, with this benchmark test we can edit things accordingly, but for nightly runs I'm not sure we'd get as much value.
  • Ditto for snapshot rate (where I don't think we're using non-default anywhere below).
  • Ditto for warehouse count, would we get more more value from nightly runs compared to pinning a single count and having the results for it plotted over time?

I like the AC enabled/disabled knob, and the snapshot rate knob is also useful for experiments, but as for the variance in the number of CPUs and warehouses, they could probably be reduced.


pkg/cmd/roachtest/tests/decommissionbench.go line 43 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

I'm assuming whileDown means we first take the node down and then decommission? vs decommission a live node? a comment would be nice IMO.

Yes - can add a comment.


pkg/cmd/roachtest/tests/decommissionbench.go line 54 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

why only 100? do you get interesting results with such a small number? how many ranges and bytes will the decommissioned node have?
I think roachtests need to be more realistic and interesting, like a test a user would run. You're not verifying a decommission works, you're benchmarking whether it takes an hour or two, and I think that with 100 warehouses it will be way faster than that? I might be wrong..

Mostly useful for quick experiments. I suppose I can comment this out though...

Or I wonder if I can make it subject to a environment variable that has a default?


pkg/cmd/roachtest/tests/decommissionbench.go line 81 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

I said it in slack - I really think you need to benchmark more than 4 nodes. 4 is special because after the decommission you only have 3 nodes, which means all replicas have exactly one node they can be moved to, so it's an easy job for the allocator. 5 nodes might be tricky because system ranges will need to be down replicated to 3 replicas (right? maybe not important?), I would try 4 and 8 or something like that.

Good to know, will add.


pkg/cmd/roachtest/tests/decommissionbench.go line 144 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

maybe s/tc/spec/ ? I know tc is for test case.. but tc decommissionBenchSpec looks very random to me.

I think tc is often roachtest shorthand for "test config", but I could write that out, sure.


pkg/cmd/roachtest/tests/decommissionbench.go line 162 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Curious: How long do these runs take currently?

cpu=4, warehouses=100: ~3.5 minutes.
cpu=16, warehouses=1000: ~20 minutes.


pkg/cmd/roachtest/tests/decommissionbench.go line 164 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Perhaps also worth plotting how long this import takes, I imagine it's a non-trivial portion of an individual run.

True - can use the upreplicate ticker, though I may want to do this in a follow-up commit.


pkg/cmd/roachtest/tests/decommissionbench.go line 192 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Do these workloads have a ramp time? Should they? If so, should we wait until that ramp period is done before initiating the decommission?

Good question - in fact, I have seen some strange perf results at the start of the run, so I think it would be good to incorporate that, thanks!


pkg/cmd/roachtest/tests/decommissionbench.go line 210 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

can you put a comment about what these lines are doing?

Will do!


pkg/cmd/roachtest/tests/decommissionbench.go line 248 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Did we mean to use the upreplicateTicker data somewhere?

I have it used in the follow-up PR (which runs decommissions in a loop).


pkg/cmd/roachtest/tests/decommissionbench.go line 253 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

I think this is not great, but let's see if I understand this test well and I'll explain what I have in mind:

Say you have 6 nodes, and you write to 1000 ranges. You then pin those ranges to have one replica on the node you want to later decommission. This means that that node will have 1000 replicas, and the other 5 nodes will have the other 2000 replicas, 400 on each. Now the cluster is imbalanced. Then you unpin replicas and decommission the node with the largest number of replicas. Note that the allocator now has 2 things to do: rebalance because some replicas are on a decommissioned node, and rebalanced because some nodes are imbalanced due to range count (and load, in the test cases with load).

Is that fair to say?

I see why this would work, and I'm sure this would be useful, but I still think it would be better to have a more realistic scenario where the cluster is balanced, for example:
have 6 nodes, write to 1000 ranges, wait until the allocator does the right thing and things are balanced (roughly 500 replicas per node), decommission a node, wait until things are balanced again. This would work with any number of nodes.

Alternatively you can still pin stuff but in a balanced way. The problem still might be that once you unpin the allocator will start reshuffling which is not desired. We can probably make it work but it would be nice to first understand why the straightforward approach (no pinning) is not good.

Anyway, maybe we should have a quick chat.

Let's chat about this, I'm curious about your thoughts - my thought was about making the decommission time stable across runs, though perhaps as you mentioned it would be better to attempt to decommission on a balanced cluster.

I'm not entirely sure how to define "balanced" however - the allocator roachtest seems to do it based on a hard-coded limit in the stddev of range counts. I'll set up some time to chat.


pkg/cmd/roachtest/tests/decommissionbench.go line 296 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

can this be flaky? i.e. the replicas all migrated so waitReplicatedAwayFrom returns, but the status of the node is not yet decommissioned, and will be in a bit. should we waitForDecommissioned() instead?

It shouldn't be - and if it is, it would be an error. This is because h.decommission(..) a few lines above runs cockroach node decommission --wait=all, and as such it should not return without transitioning the liveness status of the node to decommissioned. So I think this check is ok - waitReplicatedAwayFrom() is really just a sanity check.

Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and yes, can put some of the results in the commit message too!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @irfansharif, and @lidorcarmel)

@blathers-crl blathers-crl bot requested a review from irfansharif June 9, 2022 00:23
Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @irfansharif, and @lidorcarmel)


pkg/cmd/roachtest/tests/decommission.go line 1168 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

for symmetry it would be nice to name these targetNode and runNode.

Done. It's a little tricky as there's sort of 3 different concepts of "node" going on here:

  1. the Roachprod node, I.e. the idx of the VM in the Roachprod cluster.
  2. the CockroachDB node ID, I.e. the logical NodeID assigned to the running Cockroach process at the time when it joins the cluster.
  3. the node# attribute we give each Cockroach process at startup so we have the ability to pin/unpin ranges using constraints like +node3. These IDs match the roachprod node IDs.

It might make the most sense to remove this entirely, however, if we don't want to force all the ranges onto our decommissioning node before starting the decommission.


pkg/cmd/roachtest/tests/decommission.go line 1183 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

looks like this is always node 1? if yes, then maybe just put 1 here and avoid propagating this all the way from runDecommissionBench?

I think it might be safer to always pass in the target node we want to run against - that said based on our discussion I may remove this entirely.


pkg/cmd/roachtest/tests/decommission.go line 1246 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Instead of crdb_internal.gossip_liveness, let's use crdb_internal.kv_node_liveness. It does a consistent read against KV as opposed to waiting on gossip propagations. I don't think the difference will really matter in practice, but the KV state is authoritative and the gossip state follows from it.

Changed to use this, though don't we receive the gossip liveness after the trigger on updating kv_node_liveness is fired? Which I think means we'd see the state transition to decommissioned only after received by gossip, no? (I don't think it matters too much either way, though, no).


pkg/cmd/roachtest/tests/decommission.go line 1246 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

line too long?

Done.


pkg/cmd/roachtest/tests/decommissionbench.go line 69 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

looks like you have 16 cores because you're using 1000 warehouses, right? I think we can load a lot more warehouses on 3x16 cores. It would be nice to know the cpu usage roughly, to make sure we don't benchmark with a 90% idle machines, both because of waste and being more realistic.

also - I think all test cases can use the same number of warehouses and cores to reduce scenarios and simplify things, but that's definitely up to you so feel free to ignore me. (Unless you want a single huge decommission where you have a node with many TBs)

According to https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/tpcc.go#L719 our estimated max warehouses on GCP 3x16cores is only 2400 warehouses - these tests (so far) are mostly to benchmark with moderate load, not overloaded, so I went with just 1000. If this doesn't seem right though and I should bump it up let me know! Maybe should have chosen KV as a more straightforward workload, but I figured the amount of data in TPCC made for a better fit.


pkg/cmd/roachtest/tests/decommissionbench.go line 144 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

I think tc is often roachtest shorthand for "test config", but I could write that out, sure.

Done.


pkg/cmd/roachtest/tests/decommissionbench.go line 162 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

long line?

Done.


pkg/cmd/roachtest/tests/decommissionbench.go line 210 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Will do!

Done.


pkg/cmd/roachtest/tests/decommissionbench.go line 268 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

long line.

Done.

Copy link
Contributor

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @irfansharif, and @lidorcarmel)


pkg/cmd/roachtest/tests/decommission.go line 1264 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Based on this documentation it sounds like even that is not strictly required, though it is a common pattern of resource cleanup in our roachtests.

If anything, it would be more performant to create a single db handle through the cluster.Conn() method and then use it as a connection pool, grabbing connections with conn := db.Conn(ctx) and closing them with defer conn.Close(), but I don't really see that pattern in use in roachtests and in this case in particular, given that we are using connections to multiple nodes and we are simulating a user running cockroach node decommission via a CLI, it doesn't hurt to create new connections each time (the overhead is fairly minimal).

agree it's ok to recreate connections in a test, no need to reuse.
we do have this patterns, but I think we use the defer db.Close() more often, and it's only 33% of the lines!


pkg/cmd/roachtest/tests/decommission.go line 1316 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Yes - I actually tripped on this initially, and unfortunately it only becomes more clear in the following PR with repeated decommissions. It is because if you have 4 VMs running Cockroach, initially they will get node IDs 1-4. When you decommission node 3, that ID 3 will never be reused... assuming you wipe the VM and start Cockroach anew, the Cockroach process on VM 3 will now have node ID 5. I attempted to detangle this by calling the ID of the VM "nodeID" and the ID of the Cockroach process on that VM the "logicalNodeID".

I see, thanks.
I think we discussed it a bit - if you cannot spin a new VM, how can you test something like decommissioning and adding a node at the same time?
If you will need to spin a new VM anyway then maybe never reuse VMs and only have one set of ids? (similar to how clusters work in prod).

If you want to proceed with this then probably you need a better comment, and maybe a TODO if you plan to get rid of this in the future.


pkg/cmd/roachtest/tests/decommissionbench.go line 9 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

I would actually prefer that, though the other benchmark roachtests seems to be of the form tpccbench.go, kvbench.go!

I still vote to the _ but up to you..


pkg/cmd/roachtest/tests/decommissionbench.go line 43 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Yes - can add a comment.

forgot maybe?


pkg/cmd/roachtest/tests/decommissionbench.go line 54 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Mostly useful for quick experiments. I suppose I can comment this out though...

Or I wonder if I can make it subject to a environment variable that has a default?

I'd avoid env vars..


pkg/cmd/roachtest/tests/decommissionbench.go line 69 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

According to https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/tpcc.go#L719 our estimated max warehouses on GCP 3x16cores is only 2400 warehouses - these tests (so far) are mostly to benchmark with moderate load, not overloaded, so I went with just 1000. If this doesn't seem right though and I should bump it up let me know! Maybe should have chosen KV as a more straightforward workload, but I figured the amount of data in TPCC made for a better fit.

good question about tpcc vs kv load.. I guess we can change later if needed. And we can also always change the load, so 1000 sounds like a good start, thanks!


pkg/cmd/roachtest/tests/decommissionbench.go line 296 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

It shouldn't be - and if it is, it would be an error. This is because h.decommission(..) a few lines above runs cockroach node decommission --wait=all, and as such it should not return without transitioning the liveness status of the node to decommissioned. So I think this check is ok - waitReplicatedAwayFrom() is really just a sanity check.

sg, thanks.


pkg/cmd/roachtest/tests/decommissionbench.go line 174 at r3 (raw file):

		defer db.Close()

		if benchSpec.snapshotRate != 0 {

looks like you're not using snapshotRate in the test cases, intentional? if not using it then maybe remove?


pkg/cmd/roachtest/tests/decommissionbench.go line 265 at r3 (raw file):

	}

	// Get the workload perf artifacts and move them to the pinned node, so that

is the term pinned node still valid after your recent update? here and maybe other places (a few lines above for example).

Code quote:

pinned node

pkg/cmd/roachtest/tests/decommissionbench.go line 330 at r3 (raw file):

		h.t.Status("waiting for cluster balance")
		if err := waitForRebalance(ctx, h.t.L(), dbNode, float64(totalRanges)/3.0); err != nil {

why total/3? pretty large stddev?

Code quote:

float64(totalRanges)/3.0

@AlexTalks AlexTalks force-pushed the benchmark_decomm branch 4 times, most recently from 65f81f4 to 7b1cab0 Compare June 10, 2022 05:28
Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @irfansharif, and @lidorcarmel)


pkg/cmd/roachtest/tests/decommission.go line 1264 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

agree it's ok to recreate connections in a test, no need to reuse.
we do have this patterns, but I think we use the defer db.Close() more often, and it's only 33% of the lines!

Done.


pkg/cmd/roachtest/tests/decommission.go line 1316 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

I see, thanks.
I think we discussed it a bit - if you cannot spin a new VM, how can you test something like decommissioning and adding a node at the same time?
If you will need to spin a new VM anyway then maybe never reuse VMs and only have one set of ids? (similar to how clusters work in prod).

If you want to proceed with this then probably you need a better comment, and maybe a TODO if you plan to get rid of this in the future.

If we want to do simultaneous decommissioning and adding a node we just have to increase our allocated Roachprod nodes by 1, and then just start it later - then immediately decommission our target node (I.e. before the new node reaches range count balance). Can add a TODO about this though.

Also I think the code this comment is attached to isn't actually used by this new test anymore and is only for existing tests, heh!


pkg/cmd/roachtest/tests/decommissionbench.go line 43 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

forgot maybe?

I did, thanks!


pkg/cmd/roachtest/tests/decommissionbench.go line 54 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

I'd avoid env vars..

Done.


pkg/cmd/roachtest/tests/decommissionbench.go line 69 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

good question about tpcc vs kv load.. I guess we can change later if needed. And we can also always change the load, so 1000 sounds like a good start, thanks!

Done.


pkg/cmd/roachtest/tests/decommissionbench.go line 85 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

I'd consider admission off as a rare thing and have 0 or 1 test cases for it, but definitely feel free to ignore me.. (also it depends how heavy/slow these test cases are).

Done.


pkg/cmd/roachtest/tests/decommissionbench.go line 296 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

sg, thanks.

Done.


pkg/cmd/roachtest/tests/decommissionbench.go line 305 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Lets add+litigate this bit in the subsequent PR: #81874.

Done.


pkg/cmd/roachtest/tests/decommissionbench.go line 174 at r3 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

looks like you're not using snapshotRate in the test cases, intentional? if not using it then maybe remove?

I've been mostly using it for manual testing right now but may add some configs that use it if we need to soon...


pkg/cmd/roachtest/tests/decommissionbench.go line 265 at r3 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

is the term pinned node still valid after your recent update? here and maybe other places (a few lines above for example).

Yeah, I think it is still valid - it's basically the node we run our metadata queries and such against for the most part.


pkg/cmd/roachtest/tests/decommissionbench.go line 330 at r3 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

why total/3? pretty large stddev?

It is - if I should make it tighter let me know, but for now I mainly just wanted something that I knew wouldn't error - the main thing necessary was waiting for the rebalance operations to stop. I figure the allocator test is the right place for the stddev bounds checking rather than here.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @irfansharif, and @lidorcarmel)


pkg/cmd/roachtest/tests/decommissionbench.go line 39 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

I like the AC enabled/disabled knob, and the snapshot rate knob is also useful for experiments, but as for the variance in the number of CPUs and warehouses, they could probably be reduced.

Did you mean to reduce the variance in CPUs+warehouses? For the snapshot rate knob if our intent is to use it during experiments, I wonder if we should just edit the code when running them instead of checking this in (we'd have to do this anyway with the current changes). As it stands, the unused knob is a bit confusing to readers. I also noticed that load is set to true for all of the variants. Do we want to get rid of it, or try a variant with/without load.

I want to press you slightly on whether it's a useful benchmark setup to run with AC disabled -- have we observed different results based on it? I ask all these because each variant is costly to run nightly and introduces more possibility for flakes to debug/diagnose. In my view just the following set (two variants) will be sufficient. We can always add more if we learn they provide additional value, or our replica movement considers different states differently.

nodes=4, cpu=16, warehouses1000,whileDown={true,false},load=true

Copy link
Contributor

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @irfansharif, and @lidorcarmel)


pkg/cmd/roachtest/tests/decommission.go line 1316 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

If we want to do simultaneous decommissioning and adding a node we just have to increase our allocated Roachprod nodes by 1, and then just start it later - then immediately decommission our target node (I.e. before the new node reaches range count balance). Can add a TODO about this though.

Also I think the code this comment is attached to isn't actually used by this new test anymore and is only for existing tests, heh!

  1. can you do the same strategy with repeated decommissions? say you want to add and decommission 10 nodes to a 4 node cluster, just create a 14 node cluster.

  2. the line here is new in this PR, right? this PR introduces the notion of 2 different node ids. if you have to do that - sure, but explain it well somewhere.

  3. I definitely see why you want this but the direction you're taking means you'll sometimes reuse an existing VM and then there will be 2 ids (hard to debug, someone will fall there for sure), and sometimes you will add an extra VM ahead and not reuse ids. picking one approach will simplify things IMO.

  4. one thing I'm not sure I understand - do we really care about the vm id? can we use the node id only? so that things will look the same whether we reuse an existing node or get a fresh one.

maybe a quick chat would be easier.


pkg/cmd/roachtest/tests/decommissionbench.go line 174 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

I've been mostly using it for manual testing right now but may add some configs that use it if we need to soon...

add it now? just reduce the snapshot rate for one of the configs.

I don't like code that doesn't run under any test. unfortunately we don't have unit tests for roachtests so if someone brakes the way we use snapshotRate here nothing will fail until we this manually.


pkg/cmd/roachtest/tests/decommissionbench.go line 265 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Yeah, I think it is still valid - it's basically the node we run our metadata queries and such against for the most part.

ah, thanks, my misunderstanding - I thought this is the node with the pinned replicas.

so I guess we call it "pinned" because we don't decommission it? can be operator node but fine, maybe I'm just still thinking about pinning replicas, up to you..


pkg/cmd/roachtest/tests/decommissionbench.go line 330 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

It is - if I should make it tighter let me know, but for now I mainly just wanted something that I knew wouldn't error - the main thing necessary was waiting for the rebalance operations to stop. I figure the allocator test is the right place for the stddev bounds checking rather than here.

I was thinking it should be a function of the ranges per node, not total. imagine we have 100 nodes, then the stddev is so high that we immediately call success (but maybe we also wait for no rebalancing? anyway, it means the stddev will not be used).

@blathers-crl blathers-crl bot requested a review from irfansharif June 10, 2022 22:00
While previously some roachtests existed for the purposes of
testing the decommission process, we have not had any benchmarks to
track how long it takes to decommission a node, making it difficult to
reason about how to understand what makes decommission so slow. This
change adds benchmarks for node decommission under a number of
configurations, including variable numbers of nodes/cpus, TPCC
warehouses, and with admission control enabled vs. disabled.

Some initial runs of the test have shown the following averages:
```
decommissionBench/nodes=4/cpu=16/warehouses=1000: 16m14s
decommissionBench/nodes=4/cpu=16/warehouses=1000/no-admission: 15m48s
decommissionBench/nodes=4/cpu=16/warehouses=1000/while-down: 20m36s
decommissionBench/nodes=8/cpu=16/warehouses=3000: 18m30s
```

Release note: None
@AlexTalks
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 11, 2022

Build succeeded:

@craig craig bot merged commit f2cc0b3 into cockroachdb:master Jun 11, 2022
craig bot pushed a commit that referenced this pull request Jul 26, 2022
81874: roachtest: benchmark repeated decommissions over time r=AlexTalks a=AlexTalks

This adds new long-running configurations to the `decommissionBench`
roachtest, so that repeated decommissions over time in a cluster can be
benchmarked individually.

With these long-running tests, we've seen the following averages:
```
decommissionBench/nodes=4/cpu=16/warehouses=1000/duration=1h0m0s:
avg decommission: 16m41s, avg upreplication: 33m17s

decommissionBench/nodes=4/cpu=4/warehouses=100:
avg decommission: 2m44s, avg upreplication: 4m3s
```

Depends on #81565.

Release Note: None

84869: sql: don't embed NodeInfo into ExecutorConfig r=andreimatei a=andreimatei

This patch turns an embedded struct into a field. With the NodeInfo
being embedded, it was hard to grep for all the places that uses its
member fields, and I found myself wanting to know that.

Release note: None

Co-authored-by: Alex Sarkesian <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants