-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: rebalance/by-load/replicas/mixed-version failed #107247
Comments
I'll take a look at this later today. We've made some improvements recently to reduce noise on this test – so I'm treating this failure as legit. |
This is a mixed version roachtest so the rebalancing objective is I looked at the heap profiles captured in the artifacts, and they don't look too crazy. Its odd though, the average key size read (at batch response level) is different between nodes, while the average keys read per second is the same. s1 was servicing more read bytes in kv batches than the other stores - but around the same keys. Its a KV workload with an unbounded rate, fixed concurrency. The different in read bytes, despite balanced leaseholders + balanced read keys, indicates that the CPU could be coming from outside the workload. I guess I'll give this test a run and see if it repros the behavior. Also removing the release blocker, this wasn't so far away from 15% target (20.8%) and the QPS/leases were balanced. |
This failed after 2 runs, with slightly different behavior. 3/4 nodes running 23.1 had high CPU, whilst the other 3 did not. It also looks like this test now runs with 6 nodes, rather than the intended 7? Probably something to address separately -- we want it to work for most cluster sizes. |
I'm testing this on an earlier I'm suspecting something got backported, then fixed. In either case, there's additional CPU usage on the failing commit which I'm not finding elsewhere. |
I revisited the logs and found something which I think explains the failure: 09:36:54 rebalance_load.go:130: starting load generator
#
# s1 store rebalancer attempt
I230720 09:37:27.398340 215 13@kv/kvserver/store_rebalancer.go:451 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5168 considering load-based lease transfers for s1 with ‹(queries-per-second=1673.9 cpu-per-second=204ms)› load, mean=‹(queries-per-second=1274.1 cpu-per-second=34ms)›, upperThreshold=‹(queries-per-second=1401.5 cpu-per-second=34ms)›
I230720 09:37:27.410553 215 13@kv/kvserver/store_rebalancer.go:564 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5185 load-based lease transfers successfully brought s1 down to ‹(queries-per-second=1360.5 cpu-per-second=177ms)› load, mean=‹(queries-per-second=1274.1 cpu-per-second=34ms)›, upperThreshold=‹(queries-per-second=1401.5 cpu-per-second=34ms)›)
#
# s1 store rebalancer attempt
I230720 09:38:30.126920 215 13@kv/kvserver/store_rebalancer.go:451 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5289 considering load-based lease transfers for s1 with ‹(queries-per-second=2958.1 cpu-per-second=364ms)› load, mean=‹(queries-per-second=2290.7 cpu-per-second=61ms)›, upperThreshold=‹(queries-per-second=2519.7 cpu-per-second=61ms)›
I230720 09:38:30.151003 215 13@kv/kvserver/store_rebalancer.go:564 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5304 load-based lease transfers successfully brought s1 down to ‹(queries-per-second=2277.3 cpu-per-second=302ms)› load, mean=‹(queries-per-second=2232.9 cpu-per-second=55ms)›, upperThreshold=‹(queries-per-second=2456.2 cpu-per-second=55ms)›)
#
# s1 store rebalancer attempt
I230720 09:39:40.747823 215 13@kv/kvserver/store_rebalancer.go:442 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5405 local load ‹(queries-per-second=3052.2 cpu-per-second=396ms)› is below max threshold ‹(queries-per-second=3081.3 cpu-per-second=66ms)› mean=‹(queries-per-second=2801.1 cpu-per-second=66ms)›; no rebalancing needed
#
# s1 store rebalancer attempt
I230720 09:40:55.553639 215 13@kv/kvserver/store_rebalancer.go:451 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5508 considering load-based lease transfers for s1 with ‹(queries-per-second=3580.4 cpu-per-second=452ms)› load, mean=‹(queries-per-second=3047.3 cpu-per-second=75ms)›, upperThreshold=‹(queries-per-second=3352.0 cpu-per-second=75ms)›
I230720 09:40:55.561197 215 13@kv/kvserver/store_rebalancer.go:564 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5519 load-based lease transfers successfully brought s1 down to ‹(queries-per-second=3068.0 cpu-per-second=405ms)› load, mean=‹(queries-per-second=3047.3 cpu-per-second=75ms)›, upperThreshold=‹(queries-per-second=3352.0 cpu-per-second=75ms)›)
#
# s1 store rebalancer attempt
I230720 09:41:54.142812 215 13@kv/kvserver/store_rebalancer.go:451 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5607 considering load-based lease transfers for s1 with ‹(queries-per-second=3952.0 cpu-per-second=502ms)› load, mean=‹(queries-per-second=3161.8 cpu-per-second=84ms)›, upperThreshold=‹(queries-per-second=3478.0 cpu-per-second=84ms)›
I230720 09:41:54.154695 215 13@kv/kvserver/store_rebalancer.go:564 ⋮ [T1,n1,s1,store-rebalancer,obj=‹qps›] 5615 load-based lease transfers successfully brought s1 down to ‹(queries-per-second=3258.1 cpu-per-second=433ms)› load, mean=‹(queries-per-second=3161.8 cpu-per-second=84ms)›, upperThreshold=‹(queries-per-second=3478.0 cpu-per-second=84ms)›)
#
# 5 minute timeout hits, test fails.
09:41:55 test_impl.go:345: test failure #1: full stack retained in failure_1.log: (rebalance_load.go:180).func1: CPU not evenly balanced after timeout: outside bounds mean=78.9 tolerance=15.0% (±11.8) bounds=[67.0, 90.7]
below = []
within = [s2: 72 (-8.5%), s3: 76 (-3.2%), s4: 77 (-1.4%), s5: 74 (-5.1%), s6: 76 (-2.5%)]
above = [s1: 95 (+20.8%)] If the test had run for another 30 seconds, the load would have been balanced. It is surprising that it takes 5 rebalancing cycles to get there, although I suppose it is very possible with initial racing/trashing. I didn't confirm that was happening here, however. |
I suspect there might be some quick improvements mentioned in #104292. Ideally, it should only take 2 rebalance cycles after beginning the workload, to balance. From stressing the test, the majority of the time this appears the case. However, not always. |
The `rebalance/by-load` roachtests collect timeseries data from the cluster. If these tests started quickly enough, no datapoints could exist which would cause an out of bounds array error. Handle the no datapoints case, and log a message without failing the test. Informs: cockroachdb#107247 Release note: None
The `rebalance/by-load` roachtests would immediately pass once the balance target was hit. However, it was possible that the cluster was only transiently balanced. Bump the timeout of all tests to be uniformly 10 minutes, from a previous 5-10 minutes, require that the load remains balanced for at least 1 minute to pass. Informs: cockroachdb#107247 Release note: None
The `rebalance/by-load` roachtests collect timeseries data from the cluster. If these tests started quickly enough, no datapoints could exist which would cause an out of bounds array error. Handle the no datapoints case, and log a message without failing the test. Informs: cockroachdb#107247 Release note: None
The `rebalance/by-load` roachtests would immediately pass once the balance target was hit. However, it was possible that the cluster was only transiently balanced. Bump the timeout of all tests to be uniformly 10 minutes, from a previous 5-10 minutes, require that the load remains balanced for at least 1 minute to pass. Informs: cockroachdb#107247 Release note: None
107279: sql: remove isInternalPlanner boolean r=yuzefovich a=yuzefovich This boolean is very confusing and can mislead people, so let's remove it. It was introduced in bf9ffe9 with one specific use case in mind: lease management for user defined types when the execution flow occurs in the context of a user transaction. In that scenario, the execution engine uses the planner's `descs.Collection` when running the local plan. However, I don't see the reason why we cannot do the same thing for queries outside of the user transaction (where we have the "internal planner" object), so this commit removes that extra conditional. That said, I don't quite understand how these things work, but we can always go back to not using the planner's collection for "internal planners" (we'd name the boolean differently though) if we see any fallout. Epic: None Release note: None 107834: roachtest: require stability in rebalance by load roachtests r=andrewbaptist a=kvoli The `rebalance/by-load` roachtests collect timeseries data from the cluster. If these tests started quickly enough, no datapoints could exist which would cause a out of bounds array exception. Handle the no datapoints case, and log a message without failing the test. Informs: #107247 ---- The `rebalance/by-load` roachtests would immediately pass once the balance target was hit. However, it was possible that the cluster was only transiently balanced. Bump the timeout of all tests to be uniformly 10 minutes, from a previous 5 minutes, require that the load remains balanced for at least 1 minute to pass. Informs: #107247 Release note: None 107892: sql: fix formatting of import, backup and create tenant r=chengxiong-ruan a=chengxiong-ruan Informs: #99185 This commit cherry-picked changes from #107723 and add fixes for formatting `EXPORT` and `CHANGEFEED` options. Release note: None 107955: roachtest: increase the concurrency for gracefuldraining r=kvoli a=andrewbaptist This set of commits changes gracefuldraining to use the `measureQPS` method instead of looking at the stats from the stats database and additionally increases the concurrency so it no longer fails. Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Chengxiong Ruan <[email protected]> Co-authored-by: Andrew Baptist <[email protected]>
The `rebalance/by-load` roachtests collect timeseries data from the cluster. If these tests started quickly enough, no datapoints could exist which would cause an out of bounds array error. Handle the no datapoints case, and log a message without failing the test. Informs: #107247 Release note: None
The `rebalance/by-load` roachtests would immediately pass once the balance target was hit. However, it was possible that the cluster was only transiently balanced. Bump the timeout of all tests to be uniformly 10 minutes, from a previous 5-10 minutes, require that the load remains balanced for at least 1 minute to pass. Informs: #107247 Release note: None
With #108000 merged -- any new failure will be on a sufficiently different test. Closing. |
roachtest.rebalance/by-load/replicas/mixed-version failed with artifacts on release-23.1 @ d62a2f3ab1dff86bd80cac037b215121537a2a9a:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-29962
The text was updated successfully, but these errors were encountered: