Ranges configured with num_replicas=5 reported as underreplicated in a cluster with 4 live nodes but not in a cluster with 3 live nodes #52528

a-robinson · 2020-08-08T02:34:02Z

Describe the problem

See issue title

To Reproduce

Set up a cluster with at least 5 nodes
Configure the default replication zone with num_replicas=5 (ALTER RANGE default CONFIGURE ZONE USING num_replicas = 5)
Create a table
Kill a node. Observe that the ranges_underreplicated metric value increases
Kill a second node. Observe that the ranges_underreplicated drops back down to 0.

Note that they might have to be decommissioned rather than killed, I didn't try reproducing myself and am not sure which it was when I saw this or if it makes a difference either way.

Expected behavior

For the ranges to still be reported as underreplicated.

Environment:

CockroachDB version: v20.1.3

Additional context

I get that there was a special case put in place to avoid complaining about system ranges being underreplicated in a 3-node cluster, but I don't think our intent was ever for it to apply to other ranges in the cluster, or in a cluster that had previously had more than 3 live nodes in it.

Jira issue: CRDB-3941

The text was updated successfully, but these errors were encountered:

blathers-crl · 2020-08-08T02:34:06Z

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@florence-crl (member of the technical support engineering team)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

ymxdgyx · 2020-08-08T12:34:52Z

I cannot reproduce this issue.

a-robinson · 2020-08-11T21:57:12Z

I haven't tried reproducing this specific set of steps, but FWIW today all I had to do to get my ranges erroneously not counted as underreplicated was to create a 1 node cluster using v20.1.3, create a couple tables, configure the default replication zone to require 5 replicas, and scale the cluster up to 4 nodes. There were 34 ranges reported as underreplicated when the cluster only had the one node, but once it got up to 4 nodes they were each reporting 0 ranges as underreplicated.

ymxdgyx · 2020-08-12T00:54:14Z

you are right, I will fix this issue.

ymxdgyx · 2020-08-12T08:58:28Z

The under-replicated does not represent the relationship between the number of replicas configured by the user (num_replicas=5) and the actual number of replicas.
I planed to add the under-replicated-for-config to represent it.

tbg · 2020-08-18T08:57:25Z

In hindsight, we should have never introduced this unexpected behavior. I also think we added it just to be able to run a five-replica-default for the system ranges, but there would've been more targeted approaches to getting that behavior there without an unfortunate UX across the board.
That said, I don't think the right fix is to slap another metric in there and to call it a day. My intuition is that we should phase out the adaptive zone config behavior.

ymxdgyx · 2020-08-19T06:03:09Z

Can you explain in detail what is "phase out the adaptive zone config behavior"? @tbg

tbg · 2020-09-24T10:35:23Z

We discovered another footgun with the adaptive replication factor in #54444. Short example:

5x replicated
stop all nodes, restart only nodes 1-3
adaptive replication factor changes replication factor to 3
allocator will try to downreplicate, but due to technical restrictions isn't able to remove n4 or n5 (since the store desc isn't gossiped any more; we lost that due to the full-cluster restart)
allocator will remove, say, n3 (leaving n1 n2 n4 n5), still ok
allocator will remove, say, n2
loss of quorum: n1 n4 n5 has only one remaining live node.

I believe we have some protections that try to stop the allocator from downreplicating into unavailability, but they are based on liveness and gossiped information, so they don't kick in until some time after the cluster restart.

The adaptive replication factor basically automatically triggers this potential problem in this scenario. Granted, there is a pretty unlikely sequence of events (the full-cluster restart and tight timing of everything) but it's disconcerting and further evidence that the adaptive repl factor was a bad idea.

github-actions · 2023-09-07T11:11:45Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

blathers-crl bot added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-oncall labels Aug 8, 2020

ymxdgyx mentioned this issue Aug 12, 2020

replica: Fixed under-replicated of metrics. #52663

Closed

This was referenced Aug 14, 2020

Replica: add UnderReplicatedForConfig and OverReplicatedForConfig of metrics. #52808

Closed

Replica: add UnderReplicatedForConfig and OverReplicatedForConfig of metrics. #52858

Closed

tbg mentioned this issue Sep 24, 2020

roachtest: replicate/wide failed #54444

Closed

tbg mentioned this issue Sep 24, 2020

storage: unable to down-replicate from stores not present in StorePool #34187

Closed

tbg mentioned this issue May 14, 2021

reports: does not handle adaptive zone configs #65187

Open

jlinder added the T-kv KV Team label Jun 16, 2021

otan removed the X-blathers-oncall label Oct 14, 2021

github-actions bot added the no-issue-activity label Sep 7, 2023

github-actions bot added the X-stale label Sep 18, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 18, 2023

exalate-issue-sync bot closed this as completed Sep 18, 2023

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ranges configured with num_replicas=5 reported as underreplicated in a cluster with 4 live nodes but not in a cluster with 3 live nodes #52528

Ranges configured with num_replicas=5 reported as underreplicated in a cluster with 4 live nodes but not in a cluster with 3 live nodes #52528

a-robinson commented Aug 8, 2020 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Aug 8, 2020

ymxdgyx commented Aug 8, 2020

a-robinson commented Aug 11, 2020

ymxdgyx commented Aug 12, 2020

ymxdgyx commented Aug 12, 2020

tbg commented Aug 18, 2020

ymxdgyx commented Aug 19, 2020

tbg commented Sep 24, 2020

github-actions bot commented Sep 7, 2023

Ranges configured with num_replicas=5 reported as underreplicated in a cluster with 4 live nodes but not in a cluster with 3 live nodes #52528

Ranges configured with num_replicas=5 reported as underreplicated in a cluster with 4 live nodes but not in a cluster with 3 live nodes #52528

Comments

a-robinson commented Aug 8, 2020 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Aug 8, 2020

ymxdgyx commented Aug 8, 2020

a-robinson commented Aug 11, 2020

ymxdgyx commented Aug 12, 2020

ymxdgyx commented Aug 12, 2020

tbg commented Aug 18, 2020

ymxdgyx commented Aug 19, 2020

tbg commented Sep 24, 2020

github-actions bot commented Sep 7, 2023

a-robinson commented Aug 8, 2020 •

edited by cockroach-jira-scripts

Loading