Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: rebalance/by-load/leases failed #93105

Closed
cockroach-teamcity opened this issue Dec 6, 2022 · 4 comments
Closed

roachtest: rebalance/by-load/leases failed #93105

cockroach-teamcity opened this issue Dec 6, 2022 · 4 comments
Assignees
Labels
A-testing Testing tools and infrastructure branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-roachtest O-robot Originated from a bot. T-kv KV Team X-stale
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 6, 2022

roachtest.rebalance/by-load/leases failed with artifacts on release-22.2 @ adf24a585553397a7ad8e4bb6da0609bb5d2e930:

test artifacts and logs in: /artifacts/rebalance/by-load/leases/run_1
	rebalance_load.go:148,rebalance_load.go:167,test_runner.go:930: timed out before leases were evenly spread

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-22159

Epic CRDB-18656

@cockroach-teamcity cockroach-teamcity added branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Dec 6, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Dec 6, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Dec 6, 2022
@aliher1911
Copy link
Contributor

aliher1911 commented Dec 6, 2022

From the test log:

06:28:21 cluster.go:677: test status: uploading file
06:28:21 cluster_synced.go:1472: teamcity-7852350-1670307581-08-n4cpu4: putting (dist) /go/src/github.com/cockroachdb/cockroach/bin/workload ./workload on nodes [4]
06:28:34 cluster_synced.go:1670: 
06:28:34 cluster_synced.go:1673:    4: done
06:28:34 cluster.go:677: test status: 
06:28:36 rebalance_load.go:113: test status: checking for lease balance
06:28:36 rebalance_load.go:97: starting load generator
06:28:36 rebalance_load.go:118: test status: disable load based splitting
06:28:36 rebalance_load.go:271: not all stores have a lease yet: [s1: 3]
06:28:41 rebalance_load.go:271: not all stores have a lease yet: [s1: 3]
06:28:46 rebalance_load.go:271: not all stores have a lease yet: [s1: 3]
06:28:51 rebalance_load.go:271: not all stores have a lease yet: [s1: 3]
06:28:56 rebalance_load.go:271: not all stores have a lease yet: [s1: 3]
06:29:01 rebalance_load.go:271: not all stores have a lease yet: [s1: 3]
06:29:06 rebalance_load.go:271: not all stores have a lease yet: [s1: 3]
06:29:12 rebalance_load.go:271: not all stores have a lease yet: [s1: 3]
06:29:17 rebalance_load.go:271: not all stores have a lease yet: [s1: 3]
06:29:22 rebalance_load.go:271: not all stores have a lease yet: [s1: 3]
06:29:27 rebalance_load.go:271: not all stores have a lease yet: [s1: 2, s3: 1]
06:29:32 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:29:37 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:29:43 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:29:48 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:29:53 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:29:58 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:30:03 rebalance_load.go:271: not all stores have a lease yet: [s3: 3]
06:30:09 rebalance_load.go:271: not all stores have a lease yet: [s3: 3]
06:30:14 rebalance_load.go:271: not all stores have a lease yet: [s3: 3]
06:30:19 rebalance_load.go:271: not all stores have a lease yet: [s3: 3]
06:30:24 rebalance_load.go:271: not all stores have a lease yet: [s3: 3]
06:30:30 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:30:35 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:30:40 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:30:45 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:30:50 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:30:56 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:31:01 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:31:06 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:31:11 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:31:17 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:31:22 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:31:27 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:31:32 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:31:37 test_impl.go:323: test failure: 	rebalance_load.go:148,rebalance_load.go:167,test_runner.go:930: timed out before leases were evenly spread

There are few warnings in the logs regarding replica moves which may be related:

I221206 06:28:36.046118 3635 sql/schemachanger/scbuild/build.go:39 â‹® [n1,client=10.142.1.129:56166,user=root] 92  building declarative schema change targets for DROP DATABASE
I221206 06:28:36.047546 3635 sql/schemachanger/scbuild/build.go:39 â‹® [n1,client=10.142.1.129:56166,user=root] 93  building declarative schema change targets for CREATE DATABASE
W221206 06:28:36.047620 3635 runtime/panic.go:884 ⋮ [n1,client=10.142.1.129:56166,user=root] 94  failed building declarative schema change targets for CREATE DATABASE with error: ‹*tree.CreateDatabase not implemented in the new schema changer›
I221206 06:28:36.093974 3635 kv/kvserver/replica_command.go:416 ⋮ [n1,s1,r52/1:‹/{Table/52-Max}›,*roachpb.AdminSplitRequest] 95  initiating a split of this range at key /Table/106/1/‹3074457345618257920› [r53] (‹manual›)‹›
I221206 06:28:36.103876 3641 kv/kvserver/replica_command.go:1715 ⋮ [n1,replicate,s1,r52/1:‹/Table/{52-106/1/307…}›] 96  could not successfully add and upreplicate LEARNER replica(s) on [n3,s3], rolling back: change replicas of r52 failed: descriptor changed: [expected] r52:‹/{Table/52-Max}› [(n1,s1):1, next=2, gen=0] != [actual] r52:‹/Table/{52-106/1/3074457345618257920}› [(n1,s1):1, next=2, gen=1]
I221206 06:28:36.110887 3671 kv/kvserver/replica_command.go:1715 ⋮ [n1,replicate,s1,r53/1:‹/{Table/106/1/…-Max}›] 97  could not successfully add and upreplicate LEARNER replica(s) on [n3,s3], rolling back: change replicas of r53 failed: descriptor changed: [expected] r53:‹/{Table/106/1/3074457345618257920-Max}› [(n1,s1):1, next=2, gen=1, sticky=9223372036.854775807,2147483647] != [actual] r53:‹/{Table/106/1/3074457345618257920-Max}› [(n1,s1):1, (n3,s3):2LEARNER, next=3, gen=2, sticky=9223372036.854775807,2147483647]
I221206 06:28:36.120387 3671 kv/kvserver/replica_command.go:1994 ⋮ [n1,replicate,s1,r53/1:‹/{Table/106/1/…-Max}›] 98  rolled back LEARNER n3,s3 in r53:‹/{Table/106/1/3074457345618257920-Max}› [(n1,s1):1, (n3,s3):2LEARNER, next=3, gen=2, sticky=9223372036.854775807,2147483647]
I221206 06:28:36.125667 3635 kv/kvserver/replica_command.go:1098 ⋮ [n1,s1,r53/1:‹/{Table/106/1/…-Max}›,*roachpb.AdminScatterRequest] 99  could not promote [n3,s3] to voter, rolling back: change replicas of r53 failed: descriptor changed: [expected] r53:‹/{Table/106/1/3074457345618257920-Max}› [(n1,s1):1, (n3,s3):2LEARNER, next=3, gen=2, sticky=9223372036.854775807,2147483647] != [actual] r53:‹/{Table/106/1/3074457345618257920-Max}› [(n1,s1):1, next=3, gen=3, sticky=9223372036.854775807,2147483647]
I221206 06:28:36.128390 3635 kv/kvserver/replica_command.go:416 ⋮ [n1,s1,r52/1:‹/Table/{52-106/1/307…}›,*roachpb.AdminSplitRequest] 100  initiating a split of this range at key /Table/106/1/‹-3074457345618258944› [r54] (‹manual›)‹›
W221206 06:28:36.131824 3635 kv/kvserver/replica_command.go:557 ⋮ [n1,s1,r52/1:‹/Table/{52-106/1/307…}›,*roachpb.AdminSplitRequest] 101  retrying split after err: split at key /Table/106/1/‹-3074457345618258944› failed: descriptor changed: [expected] r52:‹/Table/{52-106/1/3074457345618257920}› [(n1,s1):1, next=2, gen=1] != [actual] r52:‹/Table/{52-106/1/3074457345618257920}› [(n1,s1):1, (n2,s2):2LEARNER, next=3, gen=2]: ‹unexpected value: raw_bytes:"\216\032P\277\003\0104\022\001\274\032\013\362\211\375*\252\252\252\252\252\250\000\"\010\010\001\020\001\030\001 \000\"\010\010\002\020\002\030\002 \001(\0030\002:\000" timestamp:<wall_time:1670308116121420005 > ›
I221206 06:28:36.204386 3635 kv/kvserver/replica_command.go:416 ⋮ [n1,s1,r52/1:‹/Table/{52-106/1/307…}›,*roachpb.AdminSplitRequest] 102  initiating a split of this range at key /Table/106/1/‹-3074457345618258944› [r55] (‹manual›)‹›
I221206 06:28:36.275073 3754 util/log/event_log.go:32 ⋮ [n1,client=34.85.143.77:60326,user=root] 103 ={"Timestamp":1670308116268494125,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"kv.range_split.by_load_enabled\" = false","Tag":"SET CLUSTER SETTING","User":"root","SettingName":"kv.range_split.by_load_enabled","Value":"‹false›"}
I221206 06:28:36.305019 3754 util/log/event_log.go:32 ⋮ [n1,client=34.85.143.77:60326,user=root] 104 ={"Timestamp":1670308116300035928,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"kv.allocator.load_based_rebalancing\" = $1::STRING","Tag":"SET CLUSTER SETTING","User":"root","PlaceholderValues":["‹'leases'›"],"SettingName":"kv.allocator.load_based_rebalancing","Value":"‹leases›"}
I221206 06:28:36.800124 4159 kv/kvserver/replica_consistency.go:283 ⋮ [n1,consistencyChecker,s1,r4/1:‹/System{/tsd-tse}›] 105  triggering stats recomputation to resolve delta of {ContainsEstimates:22652 LastUpdateNanos:1670308115988187101 IntentAge:0 GCBytesAge:0 LiveBytes:-175532 LiveCount:-4854 KeyBytes:-259668 KeyCount:-4854 ValBytes:84136 ValCount:-4854 IntentBytes:0 IntentCount:0 SeparatedIntentCount:0 RangeKeyCount:0 RangeKeyBytes:0 RangeValCount:0 RangeValBytes:0 SysBytes:0 SysCount:0 AbortSpanBytes:0}

@aliher1911 aliher1911 removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Dec 6, 2022
@kvoli kvoli self-assigned this Dec 9, 2022
@kvoli
Copy link
Collaborator

kvoli commented Dec 9, 2022

Seems like it does eventually resolve itself but not before timing out the test:

I221206 06:31:37.906302 219 13@kv/kvserver/store_rebalancer.go:464 â‹® [n3,s3,store-rebalancer] 384  transferring lease for r55 (qps=4095.79) to store s2 (qps=25.37) from local store s3 (qps=8181.25)
I221206 06:31:37.906328 219 13@kv/kvserver/replicate_queue.go:1862 â‹® [n3,s3,store-rebalancer] 385  transferring lease to s2
I221206 06:31:37.925788 219 13@kv/kvserver/store_rebalancer.go:279 â‹® [n3,s3,store-rebalancer] 386  load-based lease transfers successfully brought s3 down to 4085.46 qps (mean=4141.94, upperThreshold=4556.13)

s2 needs to grab a lease but doesn't till right at the end. Downloaded logs, will dig into further later.

@lunevalex lunevalex added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels Dec 9, 2022
@kvoli
Copy link
Collaborator

kvoli commented Dec 13, 2022

So it appears like this test failure is due to two reasons. If either one had
occurred in isolation it probably would have met the time requirements to
balance leases in 3 minutes. However, both of these occurring together with
poor timing caused this not to be the case.

  1. replicate queue transferring leases without consideration of QPS i.e.
    competing against the store rebalancer.
  2. previous transfer impact estimate is overwritten by new gossip information
    that doesn't always include actions that have occurred in prior steps.

For 1:

The replicate undoes the work of another store's store rebalancer component, by
transferring the lease back with high QPS, despite it causing greater
imbalance. The relevant logs lines appended together are shown below.

06:29:32 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
I221206 06:30:00.450118 450015 13@kv/kvserver/replicate_queue.go:1862 ⋮ [n1,replicate,s1,r53/1:‹/{Table/106/1/…-Max}›] 2902  transferring lease to s3
06:30:03 rebalance_load.go:271: not all stores have a lease yet: [s3: 3]

For 2:

The replicate queue transfers a lease s1 -> s3, the estimated impact of this
is lost however as when the ststore rebalancer begins its loop a few seconds
later, s3 only has 3 QPS whilst s1 has 6929.37, which must also have
overridden the updated estimate, however is correct.

This leads to s1 transferring another lease to s3 which it believes has 3
QPS but must actually have >3k.

The relevant timeline of appended logs are listed below.

06:29:22 rebalance_load.go:271: not all stores have a lease yet: [s1: 3]
I221206 06:29:24.778492 342366 13@kv/kvserver/replicate_queue.go:1862 ⋮ [n1,replicate,s1,r55/1:‹/Table/106/1/{-30744…-307445…}›] 2569  transferring lease to s3
06:29:27 rebalance_load.go:271: not all stores have a lease yet: [s1: 2, s3: 1]
I221206 06:29:27.864824 245 13@kv/kvserver/store_rebalancer.go:464 ⋮ [n1,s1,store-rebalancer] 2595  transferring lease for r56 (qps=3268.15) to store s3 (qps=3.72) from local store s1 (qps=6929.37)
06:29:32 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
I221206 06:30:00.450118 450015 13@kv/kvserver/replicate_queue.go:1862 ⋮ [n1,replicate,s1,r53/1:‹/{Table/106/1/…-Max}›] 2902  transferring lease to s3
06:30:03 rebalance_load.go:271: not all stores have a lease yet: [s3: 3]
I221206 06:30:29.799630 219 13@kv/kvserver/store_rebalancer.go:464 ⋮ [n3,s3,store-rebalancer] 213  transferring lease for r56 (qps=4053.37) to store s1 (qps=20.26) from local store s3 (qps=11973.62)
06:30:30 rebalance_load.go:271: not all stores have a lease yet: [s1: 1, s3: 2]
06:31:37 test_impl.go:323: test failure: 	rebalance_load.go:148,rebalance_load.go:167,test_runner.go:930: timed out before leases were evenly spread
I221206 06:31:37.906302 219 13@kv/kvserver/store_rebalancer.go:464 ⋮ [n3,s3,store-rebalancer] 384  transferring lease for r55 (qps=4095.79) to store s2 (qps=25.37) from local store s3 (qps=8181.25)

The resolution to (1) would ideally be to eliminate lease transfers from the
replicate queue or align on objectives.

The resolution to (2) is more complex, it would involve time stamping estimated
impacts that are applied to the storepool, then invalidating these estimates
based on the timestamp of new gossiped infos received.

I filed #93532 for 2.

@github-actions
Copy link

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-roachtest O-robot Originated from a bot. T-kv KV Team X-stale
Projects
None yet
Development

No branches or pull requests

4 participants