Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: make config change failures less scary #72546

Closed
ajwerner opened this issue Nov 8, 2021 · 5 comments
Closed

kvserver: make config change failures less scary #72546

ajwerner opened this issue Nov 8, 2021 · 5 comments
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity X-stale

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Nov 8, 2021

Is your feature request related to a problem? Please describe.

Concurrency during replication changes can cause failure. These errors are benign. We should mark them as such and avoid the scary logging.

Consider:

E211108 22:30:25.824924 2036041 kv/kvserver/queue.go:1090  [n1,raftsnapshot,s1,r46/1:/Table/53/1/{13353-38300}] snapshot failed: (n4,s4):2LEARNER: remote couldn't accept LEARNER snapshot ce114580 at applied index 13 with error: [n4,s4],r46: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n4,s4,r38/2:/Table/53/1/{5197-38300}] (incoming /Table/53/1/{13353-38300})

and

E211108 22:30:25.825287 822 kv/kvserver/store_rebalancer.go:315  [n1,s1,store-rebalancer] unable to relocate range to [n4,s4]: change replicas of r38 failed: descriptor changed: [expected] r38:/Table/53/1/{5197-38300} [(n1,s1):1, (n4,s4):2LEARNER, next=3, gen=13] != [actual] r38:/Table/53/1/{5197-13353} [(n1,s1):1, next=3, gen=15]
while carrying out changes [{ADD_REPLICA n4,s4}]

log.Errorf(ctx, "unable to relocate range to %v: %+v", voterTargets, err)

Describe the solution you'd like
We can keep logging these errors if we'd like, but at worst, with a warning, and better yet, with info.

Additional context
Relates to #41392.

Jira issue: CRDB-11200

@ajwerner ajwerner added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv Anything in KV that doesn't belong in a more specific category. labels Nov 8, 2021
@LKaemmerling
Copy link

FYI: I run into this with my 6 Node Cluster. We recently moved to different servers (Dedicated Hardware instead of Cloud VMs) and the performance on the hardware was dramatically bad. The only thing I can find in the logs that look suspicious is those log lines similar to the one above.

E211121 14:17:17.059139 403 kv/kvserver/store_rebalancer.go:328 ⋮ [n27,s27,store-rebalancer] 119886  unable to relocate range to [n32,s32 n28,s28 n31,s31]: while carrying out changes [{ADD_VOTER n32,s32} {REMOVE_VOTER n27,s27}]: change replicas of r4 failed: descriptor changed: [expected] r4:‹/System/tsd{-/cr.node.changefeed.checkpoint_hist_nanos-p99.999/1/30m/2021-09-18T00:00:00Z}› [(n31,s31):1049, (n27,s27):1046, (n28,s28):1047, (n32,s32):1059LEARNER, next=1060, gen=2804] != [actual] r4:‹/System/tsd{-/cr.node.changefeed.checkpoint_hist_nanos-p99.999/1/30m/2021-09-18T00:00:00Z}› [(n31,s31):1049, (n27,s27):1046, (n28,s28):1047, next=1060, gen=2805]

We now moved back to the cloud servers and it looks like they are gone there too, which makes it even scarier. Even the hard is ways faster and should work out of the box extremely well, but for a reason, we can not find it is a really worse performance with SQL Latency up to 10 Seconds.

@ajwerner
Copy link
Contributor Author

I assure you that that log line has nothing to do with your bad performance.

@ajwerner
Copy link
Contributor Author

Consider looking at the Storage dashboard for high latency or various other hardware dashboards. Additionally, make sure that the latency between your nodes is reasonable.

@LKaemmerling
Copy link

Consider looking at the Storage dashboard for high latency or various other hardware dashboards. Additionally, make sure that the latency between your nodes is reasonable.

Thanks for you hint, the network latency was around 0.5 ms between those nodes (they are plugged into the same switch), the rest of the dashboards looked even better than with the cloud instances. The only log entries I can find that look "suspicious" are the ones I posted above. Unfortunately, the metrics are no longer visible in the dashboards because the nodes were decommissioned because of the significant performance issues.

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity X-stale
Projects
None yet
Development

No branches or pull requests

2 participants