Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: rebalance/by-load/replicas/mixed-version failed #107247

Closed
cockroach-teamcity opened this issue Jul 20, 2023 · 8 comments · Fixed by #108000
Closed

roachtest: rebalance/by-load/replicas/mixed-version failed #107247

cockroach-teamcity opened this issue Jul 20, 2023 · 8 comments · Fixed by #108000
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 20, 2023

roachtest.rebalance/by-load/replicas/mixed-version failed with artifacts on release-23.1 @ d62a2f3ab1dff86bd80cac037b215121537a2a9a:

(rebalance_load.go:180).func1: CPU not evenly balanced after timeout: outside bounds mean=78.9 tolerance=15.0% (±11.8) bounds=[67.0, 90.7]
	below  = []
	within = [s2: 72 (-8.5%), s3: 76 (-3.2%), s4: 77 (-1.4%), s5: 74 (-5.1%), s6: 76 (-2.5%)]
	above  = [s1: 95 (+20.8%)]
test artifacts and logs in: /artifacts/rebalance/by-load/replicas/mixed-version/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-29962

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jul 20, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jul 20, 2023
@kvoli kvoli self-assigned this Jul 20, 2023
@kvoli
Copy link
Collaborator

kvoli commented Jul 20, 2023

I'll take a look at this later today. We've made some improvements recently to reduce noise on this test – so I'm treating this failure as legit.

@kvoli
Copy link
Collaborator

kvoli commented Jul 20, 2023

This is a mixed version roachtest so the rebalancing objective is qps. The QPS is balanced, but CPU utilization is not:

image

I looked at the heap profiles captured in the artifacts, and they don't look too crazy. Its odd though, the average key size read (at batch response level) is different between nodes, while the average keys read per second is the same. s1 was servicing more read bytes in kv batches than the other stores - but around the same keys.

Its a KV workload with an unbounded rate, fixed concurrency. The different in read bytes, despite balanced leaseholders + balanced read keys, indicates that the CPU could be coming from outside the workload.

image

I guess I'll give this test a run and see if it repros the behavior.

Also removing the release blocker, this wasn't so far away from 15% target (20.8%) and the QPS/leases were balanced.

@kvoli kvoli added A-kv-distribution Relating to rebalancing and leasing. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 20, 2023
@kvoli
Copy link
Collaborator

kvoli commented Jul 21, 2023

This failed after 2 runs, with slightly different behavior. 3/4 nodes running 23.1 had high CPU, whilst the other 3 did not.

It also looks like this test now runs with 6 nodes, rather than the intended 7? Probably something to address separately -- we want it to work for most cluster sizes.

@kvoli
Copy link
Collaborator

kvoli commented Jul 27, 2023

I'm testing this on an earlier release-23.1 commit 5a53114 and haven't seen any failures (0/10). There have also been no failures for about a week.

I'm suspecting something got backported, then fixed. In either case, there's additional CPU usage on the failing commit which I'm not finding elsewhere.

@irfansharif irfansharif added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 28, 2023
@kvoli
Copy link
Collaborator

kvoli commented Jul 28, 2023

0/15 failures on d62a2f3 - the SHA the test failure happened on
0/15 failures on 7f8e0a9 - recent SHA on release 23.1

hmmm I'm now wondering how I got it to repro so quickly last week.

@kvoli
Copy link
Collaborator

kvoli commented Jul 28, 2023

I revisited the logs and found something which I think explains the failure:

09:36:54 rebalance_load.go:130: starting load generator
#
# s1 store rebalancer attempt 
I230720 09:37:27.398340 215 13@kv/kvserver/store_rebalancer.go:451 ⋮ [T1,n1,s1,store-rebalancer,obj=qps›] 5168  considering load-based lease transfers for s1 with ‹(queries-per-second=1673.9 cpu-per-second=204ms)› load, mean=‹(queries-per-second=1274.1 cpu-per-second=34ms)›, upperThreshold=‹(queries-per-second=1401.5 cpu-per-second=34ms)›
I230720 09:37:27.410553 215 13@kv/kvserver/store_rebalancer.go:564 ⋮ [T1,n1,s1,store-rebalancer,obj=qps›] 5185  load-based lease transfers successfully brought s1 down to ‹(queries-per-second=1360.5 cpu-per-second=177ms)› load, mean=‹(queries-per-second=1274.1 cpu-per-second=34ms)›, upperThreshold=‹(queries-per-second=1401.5 cpu-per-second=34ms)›)
#
# s1 store rebalancer attempt 
I230720 09:38:30.126920 215 13@kv/kvserver/store_rebalancer.go:451 ⋮ [T1,n1,s1,store-rebalancer,obj=qps›] 5289  considering load-based lease transfers for s1 with ‹(queries-per-second=2958.1 cpu-per-second=364ms)› load, mean=‹(queries-per-second=2290.7 cpu-per-second=61ms)›, upperThreshold=‹(queries-per-second=2519.7 cpu-per-second=61ms)›
I230720 09:38:30.151003 215 13@kv/kvserver/store_rebalancer.go:564 ⋮ [T1,n1,s1,store-rebalancer,obj=qps›] 5304  load-based lease transfers successfully brought s1 down to ‹(queries-per-second=2277.3 cpu-per-second=302ms)› load, mean=‹(queries-per-second=2232.9 cpu-per-second=55ms)›, upperThreshold=‹(queries-per-second=2456.2 cpu-per-second=55ms)›)
#
# s1 store rebalancer attempt 
I230720 09:39:40.747823 215 13@kv/kvserver/store_rebalancer.go:442 ⋮ [T1,n1,s1,store-rebalancer,obj=qps›] 5405  local load ‹(queries-per-second=3052.2 cpu-per-second=396ms)› is below max threshold ‹(queries-per-second=3081.3 cpu-per-second=66ms)› mean=‹(queries-per-second=2801.1 cpu-per-second=66ms)›; no rebalancing needed
#
# s1 store rebalancer attempt
I230720 09:40:55.553639 215 13@kv/kvserver/store_rebalancer.go:451 ⋮ [T1,n1,s1,store-rebalancer,obj=qps›] 5508  considering load-based lease transfers for s1 with ‹(queries-per-second=3580.4 cpu-per-second=452ms)› load, mean=‹(queries-per-second=3047.3 cpu-per-second=75ms)›, upperThreshold=‹(queries-per-second=3352.0 cpu-per-second=75ms)›
I230720 09:40:55.561197 215 13@kv/kvserver/store_rebalancer.go:564 ⋮ [T1,n1,s1,store-rebalancer,obj=qps›] 5519  load-based lease transfers successfully brought s1 down to ‹(queries-per-second=3068.0 cpu-per-second=405ms)› load, mean=‹(queries-per-second=3047.3 cpu-per-second=75ms)›, upperThreshold=‹(queries-per-second=3352.0 cpu-per-second=75ms)›)
#
# s1 store rebalancer attempt
I230720 09:41:54.142812 215 13@kv/kvserver/store_rebalancer.go:451 ⋮ [T1,n1,s1,store-rebalancer,obj=qps›] 5607  considering load-based lease transfers for s1 with ‹(queries-per-second=3952.0 cpu-per-second=502ms)› load, mean=‹(queries-per-second=3161.8 cpu-per-second=84ms)›, upperThreshold=‹(queries-per-second=3478.0 cpu-per-second=84ms)›
I230720 09:41:54.154695 215 13@kv/kvserver/store_rebalancer.go:564 ⋮ [T1,n1,s1,store-rebalancer,obj=qps›] 5615  load-based lease transfers successfully brought s1 down to ‹(queries-per-second=3258.1 cpu-per-second=433ms)› load, mean=‹(queries-per-second=3161.8 cpu-per-second=84ms)›, upperThreshold=‹(queries-per-second=3478.0 cpu-per-second=84ms)›)
#
# 5 minute timeout hits, test fails.
09:41:55 test_impl.go:345: test failure #1: full stack retained in failure_1.log: (rebalance_load.go:180).func1: CPU not evenly balanced after timeout: outside bounds mean=78.9 tolerance=15.0% (±11.8) bounds=[67.0, 90.7]
	below  = []
	within = [s2: 72 (-8.5%), s3: 76 (-3.2%), s4: 77 (-1.4%), s5: 74 (-5.1%), s6: 76 (-2.5%)]
	above  = [s1: 95 (+20.8%)]

If the test had run for another 30 seconds, the load would have been balanced.

It is surprising that it takes 5 rebalancing cycles to get there, although I suppose it is very possible with initial racing/trashing. I didn't confirm that was happening here, however.

@kvoli
Copy link
Collaborator

kvoli commented Jul 28, 2023

I suspect there might be some quick improvements mentioned in #104292.

Ideally, it should only take 2 rebalance cycles after beginning the workload, to balance. From stressing the test, the majority of the time this appears the case. However, not always.

kvoli added a commit to kvoli/cockroach that referenced this issue Jul 31, 2023
The `rebalance/by-load` roachtests collect timeseries data from the
cluster. If these tests started quickly enough, no datapoints could
exist which would cause an out of bounds array error.

Handle the no datapoints case, and log a message without failing the
test.

Informs: cockroachdb#107247

Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Jul 31, 2023
The `rebalance/by-load` roachtests would immediately pass once the
balance target was hit. However, it was possible that the cluster was
only transiently balanced.

Bump the timeout of all tests to be uniformly 10 minutes, from a
previous 5-10 minutes, require that the load remains balanced for at
least 1 minute to pass.

Informs: cockroachdb#107247

Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Jul 31, 2023
The `rebalance/by-load` roachtests collect timeseries data from the
cluster. If these tests started quickly enough, no datapoints could
exist which would cause an out of bounds array error.

Handle the no datapoints case, and log a message without failing the
test.

Informs: cockroachdb#107247

Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Jul 31, 2023
The `rebalance/by-load` roachtests would immediately pass once the
balance target was hit. However, it was possible that the cluster was
only transiently balanced.

Bump the timeout of all tests to be uniformly 10 minutes, from a
previous 5-10 minutes, require that the load remains balanced for at
least 1 minute to pass.

Informs: cockroachdb#107247

Release note: None
craig bot pushed a commit that referenced this issue Aug 2, 2023
107279: sql: remove isInternalPlanner boolean r=yuzefovich a=yuzefovich

This boolean is very confusing and can mislead people, so let's remove it.

It was introduced in bf9ffe9 with one specific use case in mind: lease management for user defined types when the execution flow occurs in the context of a user transaction. In that scenario, the execution engine uses the planner's `descs.Collection` when running the local plan. However, I don't see the reason why we cannot do the same thing for queries outside of the user transaction (where we have the "internal planner" object), so this commit removes that extra conditional. That said, I don't quite understand how these things work, but we can always go back to not using the planner's collection for "internal planners" (we'd name the boolean differently though) if we see any fallout.

Epic: None
Release note: None

107834: roachtest: require stability in rebalance by load roachtests r=andrewbaptist a=kvoli

The `rebalance/by-load` roachtests collect timeseries data from the
cluster. If these tests started quickly enough, no datapoints could
exist which would cause a out of bounds array exception.

Handle the no datapoints case, and log a message without failing the
test.

Informs: #107247

----

The `rebalance/by-load` roachtests would immediately pass once the
balance target was hit. However, it was possible that the cluster was
only transiently balanced.

Bump the timeout of all tests to be uniformly 10 minutes, from a
previous 5 minutes, require that the load remains balanced for at
least 1 minute to pass.

Informs: #107247

Release note: None

107892: sql: fix formatting of import, backup and create tenant r=chengxiong-ruan a=chengxiong-ruan

Informs: #99185

This commit cherry-picked changes from #107723 and add fixes for formatting `EXPORT` and `CHANGEFEED` options.

Release note: None

107955: roachtest: increase the concurrency for gracefuldraining r=kvoli a=andrewbaptist

This set of commits changes gracefuldraining to use the `measureQPS` method instead of looking at the stats from the stats database and additionally increases the concurrency so it no longer fails.

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Chengxiong Ruan <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Aug 2, 2023
The `rebalance/by-load` roachtests collect timeseries data from the
cluster. If these tests started quickly enough, no datapoints could
exist which would cause an out of bounds array error.

Handle the no datapoints case, and log a message without failing the
test.

Informs: #107247

Release note: None
blathers-crl bot pushed a commit that referenced this issue Aug 2, 2023
The `rebalance/by-load` roachtests would immediately pass once the
balance target was hit. However, it was possible that the cluster was
only transiently balanced.

Bump the timeout of all tests to be uniformly 10 minutes, from a
previous 5-10 minutes, require that the load remains balanced for at
least 1 minute to pass.

Informs: #107247

Release note: None
@kvoli
Copy link
Collaborator

kvoli commented Aug 2, 2023

With #108000 merged -- any new failure will be on a sufficiently different test. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants