kvserver: make config change failures less scary #72546

ajwerner · 2021-11-08T22:42:12Z

Is your feature request related to a problem? Please describe.

Concurrency during replication changes can cause failure. These errors are benign. We should mark them as such and avoid the scary logging.

Consider:

E211108 22:30:25.824924 2036041 kv/kvserver/queue.go:1090  [n1,raftsnapshot,s1,r46/1:/Table/53/1/{13353-38300}] snapshot failed: (n4,s4):2LEARNER: remote couldn't accept LEARNER snapshot ce114580 at applied index 13 with error: [n4,s4],r46: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n4,s4,r38/2:/Table/53/1/{5197-38300}] (incoming /Table/53/1/{13353-38300})

and

E211108 22:30:25.825287 822 kv/kvserver/store_rebalancer.go:315  [n1,s1,store-rebalancer] unable to relocate range to [n4,s4]: change replicas of r38 failed: descriptor changed: [expected] r38:/Table/53/1/{5197-38300} [(n1,s1):1, (n4,s4):2LEARNER, next=3, gen=13] != [actual] r38:/Table/53/1/{5197-13353} [(n1,s1):1, next=3, gen=15]
while carrying out changes [{ADD_REPLICA n4,s4}]

cockroach/pkg/kv/kvserver/store_rebalancer.go

Line 350 in e5bc3e7

log.Errorf(ctx, "unable to relocate range to %v: %+v", voterTargets, err)

Describe the solution you'd like
We can keep logging these errors if we'd like, but at worst, with a warning, and better yet, with info.

Additional context
Relates to #41392.

Jira issue: CRDB-11200

The text was updated successfully, but these errors were encountered:

LKaemmerling · 2021-11-22T16:47:30Z

FYI: I run into this with my 6 Node Cluster. We recently moved to different servers (Dedicated Hardware instead of Cloud VMs) and the performance on the hardware was dramatically bad. The only thing I can find in the logs that look suspicious is those log lines similar to the one above.

E211121 14:17:17.059139 403 kv/kvserver/store_rebalancer.go:328 ⋮ [n27,s27,store-rebalancer] 119886  unable to relocate range to [n32,s32 n28,s28 n31,s31]: while carrying out changes [{ADD_VOTER n32,s32} {REMOVE_VOTER n27,s27}]: change replicas of r4 failed: descriptor changed: [expected] r4:‹/System/tsd{-/cr.node.changefeed.checkpoint_hist_nanos-p99.999/1/30m/2021-09-18T00:00:00Z}› [(n31,s31):1049, (n27,s27):1046, (n28,s28):1047, (n32,s32):1059LEARNER, next=1060, gen=2804] != [actual] r4:‹/System/tsd{-/cr.node.changefeed.checkpoint_hist_nanos-p99.999/1/30m/2021-09-18T00:00:00Z}› [(n31,s31):1049, (n27,s27):1046, (n28,s28):1047, next=1060, gen=2805]

We now moved back to the cloud servers and it looks like they are gone there too, which makes it even scarier. Even the hard is ways faster and should work out of the box extremely well, but for a reason, we can not find it is a really worse performance with SQL Latency up to 10 Seconds.

ajwerner · 2021-11-22T16:48:43Z

I assure you that that log line has nothing to do with your bad performance.

ajwerner · 2021-11-22T16:49:28Z

Consider looking at the Storage dashboard for high latency or various other hardware dashboards. Additionally, make sure that the latency between your nodes is reasonable.

LKaemmerling · 2021-11-22T17:01:45Z

Consider looking at the Storage dashboard for high latency or various other hardware dashboards. Additionally, make sure that the latency between your nodes is reasonable.

Thanks for you hint, the network latency was around 0.5 ms between those nodes (they are plugged into the same switch), the rest of the dashboards looked even better than with the cloud instances. The only log entries I can find that look "suspicious" are the ones I posted above. Unfortunately, the metrics are no longer visible in the dashboards because the nodes were decommissioned because of the significant performance issues.

github-actions · 2023-08-24T11:09:18Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

ajwerner added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv Anything in KV that doesn't belong in a more specific category. labels Nov 8, 2021

nvanbenschoten mentioned this issue Aug 17, 2023

kvserver: noisy "descriptor changed" error logging #108588

Open

github-actions bot added the no-issue-activity label Aug 24, 2023

github-actions bot added the X-stale label Sep 4, 2023

github-actions bot closed this as completed Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: make config change failures less scary #72546

kvserver: make config change failures less scary #72546

ajwerner commented Nov 8, 2021 •

edited by cockroach-jira-scripts

Loading

LKaemmerling commented Nov 22, 2021

ajwerner commented Nov 22, 2021

ajwerner commented Nov 22, 2021

LKaemmerling commented Nov 22, 2021

github-actions bot commented Aug 24, 2023

kvserver: make config change failures less scary #72546

kvserver: make config change failures less scary #72546

Comments

ajwerner commented Nov 8, 2021 • edited by cockroach-jira-scripts Loading

LKaemmerling commented Nov 22, 2021

ajwerner commented Nov 22, 2021

ajwerner commented Nov 22, 2021

LKaemmerling commented Nov 22, 2021

github-actions bot commented Aug 24, 2023

ajwerner commented Nov 8, 2021 •

edited by cockroach-jira-scripts

Loading