Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ranges configured with num_replicas=5 reported as underreplicated in a cluster with 4 live nodes but not in a cluster with 3 live nodes #52528

Closed
a-robinson opened this issue Aug 8, 2020 · 9 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity O-community Originated from the community T-kv KV Team X-stale

Comments

@a-robinson
Copy link
Contributor

a-robinson commented Aug 8, 2020

Describe the problem

See issue title

To Reproduce

  1. Set up a cluster with at least 5 nodes
  2. Configure the default replication zone with num_replicas=5 (ALTER RANGE default CONFIGURE ZONE USING num_replicas = 5)
  3. Create a table
  4. Kill a node. Observe that the ranges_underreplicated metric value increases
  5. Kill a second node. Observe that the ranges_underreplicated drops back down to 0.

Note that they might have to be decommissioned rather than killed, I didn't try reproducing myself and am not sure which it was when I saw this or if it makes a difference either way.

Expected behavior

For the ranges to still be reported as underreplicated.

Environment:

  • CockroachDB version: v20.1.3

Additional context

I get that there was a special case put in place to avoid complaining about system ranges being underreplicated in a 3-node cluster, but I don't think our intent was ever for it to apply to other ranges in the cluster, or in a cluster that had previously had more than 3 live nodes in it.

Jira issue: CRDB-3941

@blathers-crl
Copy link

blathers-crl bot commented Aug 8, 2020

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

  • @florence-crl (member of the technical support engineering team)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-oncall labels Aug 8, 2020
@ymxdgyx
Copy link
Contributor

ymxdgyx commented Aug 8, 2020

I cannot reproduce this issue.

@a-robinson
Copy link
Contributor Author

I haven't tried reproducing this specific set of steps, but FWIW today all I had to do to get my ranges erroneously not counted as underreplicated was to create a 1 node cluster using v20.1.3, create a couple tables, configure the default replication zone to require 5 replicas, and scale the cluster up to 4 nodes. There were 34 ranges reported as underreplicated when the cluster only had the one node, but once it got up to 4 nodes they were each reporting 0 ranges as underreplicated.

@ymxdgyx
Copy link
Contributor

ymxdgyx commented Aug 12, 2020

you are right, I will fix this issue.

@ymxdgyx
Copy link
Contributor

ymxdgyx commented Aug 12, 2020

The under-replicated does not represent the relationship between the number of replicas configured by the user (num_replicas=5) and the actual number of replicas.
I planed to add the under-replicated-for-config to represent it.

@tbg
Copy link
Member

tbg commented Aug 18, 2020

In hindsight, we should have never introduced this unexpected behavior. I also think we added it just to be able to run a five-replica-default for the system ranges, but there would've been more targeted approaches to getting that behavior there without an unfortunate UX across the board.
That said, I don't think the right fix is to slap another metric in there and to call it a day. My intuition is that we should phase out the adaptive zone config behavior.

@ymxdgyx
Copy link
Contributor

ymxdgyx commented Aug 19, 2020

Can you explain in detail what is "phase out the adaptive zone config behavior"? @tbg

@tbg
Copy link
Member

tbg commented Sep 24, 2020

We discovered another footgun with the adaptive replication factor in #54444. Short example:

  • 5x replicated
  • stop all nodes, restart only nodes 1-3
  • adaptive replication factor changes replication factor to 3
  • allocator will try to downreplicate, but due to technical restrictions isn't able to remove n4 or n5 (since the store desc isn't gossiped any more; we lost that due to the full-cluster restart)
  • allocator will remove, say, n3 (leaving n1 n2 n4 n5), still ok
  • allocator will remove, say, n2
  • loss of quorum: n1 n4 n5 has only one remaining live node.

I believe we have some protections that try to stop the allocator from downreplicating into unavailability, but they are based on liveness and gossiped information, so they don't kick in until some time after the cluster restart.

The adaptive replication factor basically automatically triggers this potential problem in this scenario. Granted, there is a pretty unlikely sequence of events (the full-cluster restart and tight timing of everything) but it's disconcerting and further evidence that the adaptive repl factor was a bad idea.

@github-actions
Copy link

github-actions bot commented Sep 7, 2023

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity O-community Originated from the community T-kv KV Team X-stale
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants