roachtest: rebalance/by-load/replicas/mixed-version failed #107247

cockroach-teamcity · 2023-07-20T09:42:55Z

roachtest.rebalance/by-load/replicas/mixed-version failed with artifacts on release-23.1 @ d62a2f3ab1dff86bd80cac037b215121537a2a9a:

(rebalance_load.go:180).func1: CPU not evenly balanced after timeout: outside bounds mean=78.9 tolerance=15.0% (±11.8) bounds=[67.0, 90.7]
	below  = []
	within = [s2: 72 (-8.5%), s3: 76 (-3.2%), s4: 77 (-1.4%), s5: 74 (-5.1%), s6: 76 (-2.5%)]
	above  = [s1: 95 (+20.8%)]
test artifacts and logs in: /artifacts/rebalance/by-load/replicas/mixed-version/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!

Jira issue: CRDB-29962}

The text was updated successfully, but these errors were encountered:

kvoli · 2023-07-20T16:52:42Z

I'll take a look at this later today. We've made some improvements recently to reduce noise on this test – so I'm treating this failure as legit.

kvoli · 2023-07-20T19:31:28Z

This is a mixed version roachtest so the rebalancing objective is qps. The QPS is balanced, but CPU utilization is not:

I looked at the heap profiles captured in the artifacts, and they don't look too crazy. Its odd though, the average key size read (at batch response level) is different between nodes, while the average keys read per second is the same. s1 was servicing more read bytes in kv batches than the other stores - but around the same keys.

Its a KV workload with an unbounded rate, fixed concurrency. The different in read bytes, despite balanced leaseholders + balanced read keys, indicates that the CPU could be coming from outside the workload.

I guess I'll give this test a run and see if it repros the behavior.

Also removing the release blocker, this wasn't so far away from 15% target (20.8%) and the QPS/leases were balanced.

kvoli · 2023-07-21T15:03:10Z

This failed after 2 runs, with slightly different behavior. 3/4 nodes running 23.1 had high CPU, whilst the other 3 did not.

It also looks like this test now runs with 6 nodes, rather than the intended 7? Probably something to address separately -- we want it to work for most cluster sizes.

kvoli · 2023-07-27T20:48:32Z

I'm testing this on an earlier release-23.1 commit 5a53114 and haven't seen any failures (0/10). There have also been no failures for about a week.

I'm suspecting something got backported, then fixed. In either case, there's additional CPU usage on the failing commit which I'm not finding elsewhere.

kvoli · 2023-07-28T15:22:05Z

0/15 failures on d62a2f3 - the SHA the test failure happened on
0/15 failures on 7f8e0a9 - recent SHA on release 23.1

hmmm I'm now wondering how I got it to repro so quickly last week.

kvoli · 2023-07-28T20:51:13Z

I revisited the logs and found something which I think explains the failure:

09:36:54 rebalance_load.go:130: starting load generator
#
# s1 store rebalancer attempt 
I230720 09:37:27.398340 215 13@kv/kvserver/store_rebalancer.go:451 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5168  considering load-based lease transfers for s1 with ‹(queries-per-second=1673.9 cpu-per-second=204ms)› load, mean=‹(queries-per-second=1274.1 cpu-per-second=34ms)›, upperThreshold=‹(queries-per-second=1401.5 cpu-per-second=34ms)›
I230720 09:37:27.410553 215 13@kv/kvserver/store_rebalancer.go:564 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5185  load-based lease transfers successfully brought s1 down to ‹(queries-per-second=1360.5 cpu-per-second=177ms)› load, mean=‹(queries-per-second=1274.1 cpu-per-second=34ms)›, upperThreshold=‹(queries-per-second=1401.5 cpu-per-second=34ms)›)
#
# s1 store rebalancer attempt 
I230720 09:38:30.126920 215 13@kv/kvserver/store_rebalancer.go:451 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5289  considering load-based lease transfers for s1 with ‹(queries-per-second=2958.1 cpu-per-second=364ms)› load, mean=‹(queries-per-second=2290.7 cpu-per-second=61ms)›, upperThreshold=‹(queries-per-second=2519.7 cpu-per-second=61ms)›
I230720 09:38:30.151003 215 13@kv/kvserver/store_rebalancer.go:564 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5304  load-based lease transfers successfully brought s1 down to ‹(queries-per-second=2277.3 cpu-per-second=302ms)› load, mean=‹(queries-per-second=2232.9 cpu-per-second=55ms)›, upperThreshold=‹(queries-per-second=2456.2 cpu-per-second=55ms)›)
#
# s1 store rebalancer attempt 
I230720 09:39:40.747823 215 13@kv/kvserver/store_rebalancer.go:442 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5405  local load ‹(queries-per-second=3052.2 cpu-per-second=396ms)› is below max threshold ‹(queries-per-second=3081.3 cpu-per-second=66ms)› mean=‹(queries-per-second=2801.1 cpu-per-second=66ms)›; no rebalancing needed
#
# s1 store rebalancer attempt
I230720 09:40:55.553639 215 13@kv/kvserver/store_rebalancer.go:451 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5508  considering load-based lease transfers for s1 with ‹(queries-per-second=3580.4 cpu-per-second=452ms)› load, mean=‹(queries-per-second=3047.3 cpu-per-second=75ms)›, upperThreshold=‹(queries-per-second=3352.0 cpu-per-second=75ms)›
I230720 09:40:55.561197 215 13@kv/kvserver/store_rebalancer.go:564 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5519  load-based lease transfers successfully brought s1 down to ‹(queries-per-second=3068.0 cpu-per-second=405ms)› load, mean=‹(queries-per-second=3047.3 cpu-per-second=75ms)›, upperThreshold=‹(queries-per-second=3352.0 cpu-per-second=75ms)›)
#
# s1 store rebalancer attempt
I230720 09:41:54.142812 215 13@kv/kvserver/store_rebalancer.go:451 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5607  considering load-based lease transfers for s1 with ‹(queries-per-second=3952.0 cpu-per-second=502ms)› load, mean=‹(queries-per-second=3161.8 cpu-per-second=84ms)›, upperThreshold=‹(queries-per-second=3478.0 cpu-per-second=84ms)›
I230720 09:41:54.154695 215 13@kv/kvserver/store_rebalancer.go:564 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5615  load-based lease transfers successfully brought s1 down to ‹(queries-per-second=3258.1 cpu-per-second=433ms)› load, mean=‹(queries-per-second=3161.8 cpu-per-second=84ms)›, upperThreshold=‹(queries-per-second=3478.0 cpu-per-second=84ms)›)
#
# 5 minute timeout hits, test fails.
09:41:55 test_impl.go:345: test failure #1: full stack retained in failure_1.log: (rebalance_load.go:180).func1: CPU not evenly balanced after timeout: outside bounds mean=78.9 tolerance=15.0% (±11.8) bounds=[67.0, 90.7]
	below  = []
	within = [s2: 72 (-8.5%), s3: 76 (-3.2%), s4: 77 (-1.4%), s5: 74 (-5.1%), s6: 76 (-2.5%)]
	above  = [s1: 95 (+20.8%)]

If the test had run for another 30 seconds, the load would have been balanced.

It is surprising that it takes 5 rebalancing cycles to get there, although I suppose it is very possible with initial racing/trashing. I didn't confirm that was happening here, however.

kvoli · 2023-07-28T21:41:03Z

I suspect there might be some quick improvements mentioned in #104292.

Ideally, it should only take 2 rebalance cycles after beginning the workload, to balance. From stressing the test, the majority of the time this appears the case. However, not always.

The `rebalance/by-load` roachtests collect timeseries data from the cluster. If these tests started quickly enough, no datapoints could exist which would cause an out of bounds array error. Handle the no datapoints case, and log a message without failing the test. Informs: cockroachdb#107247 Release note: None

The `rebalance/by-load` roachtests would immediately pass once the balance target was hit. However, it was possible that the cluster was only transiently balanced. Bump the timeout of all tests to be uniformly 10 minutes, from a previous 5-10 minutes, require that the load remains balanced for at least 1 minute to pass. Informs: cockroachdb#107247 Release note: None

The `rebalance/by-load` roachtests collect timeseries data from the cluster. If these tests started quickly enough, no datapoints could exist which would cause an out of bounds array error. Handle the no datapoints case, and log a message without failing the test. Informs: cockroachdb#107247 Release note: None

The `rebalance/by-load` roachtests would immediately pass once the balance target was hit. However, it was possible that the cluster was only transiently balanced. Bump the timeout of all tests to be uniformly 10 minutes, from a previous 5-10 minutes, require that the load remains balanced for at least 1 minute to pass. Informs: cockroachdb#107247 Release note: None

107279: sql: remove isInternalPlanner boolean r=yuzefovich a=yuzefovich This boolean is very confusing and can mislead people, so let's remove it. It was introduced in bf9ffe9 with one specific use case in mind: lease management for user defined types when the execution flow occurs in the context of a user transaction. In that scenario, the execution engine uses the planner's `descs.Collection` when running the local plan. However, I don't see the reason why we cannot do the same thing for queries outside of the user transaction (where we have the "internal planner" object), so this commit removes that extra conditional. That said, I don't quite understand how these things work, but we can always go back to not using the planner's collection for "internal planners" (we'd name the boolean differently though) if we see any fallout. Epic: None Release note: None 107834: roachtest: require stability in rebalance by load roachtests r=andrewbaptist a=kvoli The `rebalance/by-load` roachtests collect timeseries data from the cluster. If these tests started quickly enough, no datapoints could exist which would cause a out of bounds array exception. Handle the no datapoints case, and log a message without failing the test. Informs: #107247 ---- The `rebalance/by-load` roachtests would immediately pass once the balance target was hit. However, it was possible that the cluster was only transiently balanced. Bump the timeout of all tests to be uniformly 10 minutes, from a previous 5 minutes, require that the load remains balanced for at least 1 minute to pass. Informs: #107247 Release note: None 107892: sql: fix formatting of import, backup and create tenant r=chengxiong-ruan a=chengxiong-ruan Informs: #99185 This commit cherry-picked changes from #107723 and add fixes for formatting `EXPORT` and `CHANGEFEED` options. Release note: None 107955: roachtest: increase the concurrency for gracefuldraining r=kvoli a=andrewbaptist This set of commits changes gracefuldraining to use the `measureQPS` method instead of looking at the stats from the stats database and additionally increases the concurrency so it no longer fails. Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Chengxiong Ruan <[email protected]> Co-authored-by: Andrew Baptist <[email protected]>

The `rebalance/by-load` roachtests collect timeseries data from the cluster. If these tests started quickly enough, no datapoints could exist which would cause an out of bounds array error. Handle the no datapoints case, and log a message without failing the test. Informs: #107247 Release note: None

The `rebalance/by-load` roachtests would immediately pass once the balance target was hit. However, it was possible that the cluster was only transiently balanced. Bump the timeout of all tests to be uniformly 10 minutes, from a previous 5-10 minutes, require that the load remains balanced for at least 1 minute to pass. Informs: #107247 Release note: None

kvoli · 2023-08-02T20:06:38Z

With #108000 merged -- any new failure will be on a sufficiently different test. Closing.

cockroach-teamcity added this to the 23.1 milestone Jul 20, 2023

kvoli self-assigned this Jul 20, 2023

kvoli added A-kv-distribution Relating to rebalancing and leasing. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 20, 2023

irfansharif added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 28, 2023

kvoli mentioned this issue Jul 28, 2023

roachtest: require stability in rebalance by load roachtests #107834

Merged

blathers-crl bot mentioned this issue Aug 2, 2023

release-23.1: roachtest: require stability in rebalance by load roachtests #108000

Merged

kvoli closed this as completed Aug 2, 2023

kvoli linked a pull request Aug 2, 2023 that will close this issue

release-23.1: roachtest: require stability in rebalance by load roachtests #108000

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: rebalance/by-load/replicas/mixed-version failed #107247

roachtest: rebalance/by-load/replicas/mixed-version failed #107247

cockroach-teamcity commented Jul 20, 2023 •

edited by cockroach-jira-scripts

Loading

kvoli commented Jul 20, 2023 •

edited

Loading

kvoli commented Jul 20, 2023 •

edited

Loading

kvoli commented Jul 21, 2023

kvoli commented Jul 27, 2023

kvoli commented Jul 28, 2023

kvoli commented Jul 28, 2023

kvoli commented Jul 28, 2023 •

edited

Loading

kvoli commented Aug 2, 2023

roachtest: rebalance/by-load/replicas/mixed-version failed #107247

roachtest: rebalance/by-load/replicas/mixed-version failed #107247

Comments

cockroach-teamcity commented Jul 20, 2023 • edited by cockroach-jira-scripts Loading

kvoli commented Jul 20, 2023 • edited Loading

kvoli commented Jul 20, 2023 • edited Loading

kvoli commented Jul 21, 2023

kvoli commented Jul 27, 2023

kvoli commented Jul 28, 2023

kvoli commented Jul 28, 2023

kvoli commented Jul 28, 2023 • edited Loading

kvoli commented Aug 2, 2023

cockroach-teamcity commented Jul 20, 2023 •

edited by cockroach-jira-scripts

Loading

kvoli commented Jul 20, 2023 •

edited

Loading

kvoli commented Jul 20, 2023 •

edited

Loading

kvoli commented Jul 28, 2023 •

edited

Loading