-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ranges configured with num_replicas=5 reported as underreplicated in a cluster with 4 live nodes but not in a cluster with 3 live nodes #52528
Comments
Hello, I am Blathers. I am here to help you get the issue triaged. Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here. I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
I cannot reproduce this issue. |
I haven't tried reproducing this specific set of steps, but FWIW today all I had to do to get my ranges erroneously not counted as underreplicated was to create a 1 node cluster using v20.1.3, create a couple tables, configure the default replication zone to require 5 replicas, and scale the cluster up to 4 nodes. There were 34 ranges reported as underreplicated when the cluster only had the one node, but once it got up to 4 nodes they were each reporting 0 ranges as underreplicated. |
you are right, I will fix this issue. |
The under-replicated does not represent the relationship between the number of replicas configured by the user (num_replicas=5) and the actual number of replicas. |
In hindsight, we should have never introduced this unexpected behavior. I also think we added it just to be able to run a five-replica-default for the system ranges, but there would've been more targeted approaches to getting that behavior there without an unfortunate UX across the board. |
Can you explain in detail what is "phase out the adaptive zone config behavior"? @tbg |
We discovered another footgun with the adaptive replication factor in #54444. Short example:
I believe we have some protections that try to stop the allocator from downreplicating into unavailability, but they are based on liveness and gossiped information, so they don't kick in until some time after the cluster restart. The adaptive replication factor basically automatically triggers this potential problem in this scenario. Granted, there is a pretty unlikely sequence of events (the full-cluster restart and tight timing of everything) but it's disconcerting and further evidence that the adaptive repl factor was a bad idea. |
We have marked this issue as stale because it has been inactive for |
Describe the problem
See issue title
To Reproduce
ALTER RANGE default CONFIGURE ZONE USING num_replicas = 5
)ranges_underreplicated
metric value increasesranges_underreplicated
drops back down to 0.Note that they might have to be decommissioned rather than killed, I didn't try reproducing myself and am not sure which it was when I saw this or if it makes a difference either way.
Expected behavior
For the ranges to still be reported as underreplicated.
Environment:
Additional context
I get that there was a special case put in place to avoid complaining about system ranges being underreplicated in a 3-node cluster, but I don't think our intent was ever for it to apply to other ranges in the cluster, or in a cluster that had previously had more than 3 live nodes in it.
Jira issue: CRDB-3941
The text was updated successfully, but these errors were encountered: