Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv,server: nodes started with different default replicaiton factors report inconsistent default zone configs in RPCs #39382

Open
knz opened this issue Aug 6, 2019 · 6 comments
Labels
A-kv-server Relating to the KV-level RPC server C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-server-and-security DB Server & Security

Comments

@knz
Copy link
Contributor

knz commented Aug 6, 2019

Found while investigating #39379: it's possible to start two or more nodes that have different default zone configs from the perspective of the admin RPC.

Specifically (s *adminServer) TableDetails() and (s *adminServer) DatabaseDetails().

To see this in action I can do this:

$ cockroach start-single-node
$ cockroach start --join=:26257 --port=26258 --http-port=8081

Then if I create a table/range on n1 it will get replication factor 1, and if I create one on n2 it will get replication factor 3 when inspecting the table details.

That's no good!

The correct behavior is to scan the system.zones and grab the default zone config from there, instead of grabbing it from the in-memory server Config object.

Jira issue: CRDB-5580

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-server Relating to the KV-level RPC server labels Aug 6, 2019
@knz
Copy link
Contributor Author

knz commented Aug 6, 2019

cc @tbg

@jeffrey-xiao can you have a look? something tells me you've looked at this last.

@knz
Copy link
Contributor Author

knz commented Aug 7, 2019

Maybe it's a bit more than that though. IT appears that the in-memory zone cfg is also used in (s *Store) systemGossipUpdate() to update replicas without a zone config:

  newStoreReplicaVisitor(s).Visit(func(repl *Replica) bool {
    key := repl.Desc().StartKey
    zone, err := sysCfg.GetZoneConfigForKey(key)
    if err != nil {
      if log.V(1) {
        log.Infof(context.TODO(), "failed to get zone config for key %s", key)
      }
      zone = s.cfg.DefaultZoneConfig
    }
    repl.SetZoneConfig(zone)

and also in newReplica().

@tbg: can you explain to me what was the rationale to use the global-scope, in-memory zone config as input to various things, as opposed to storing it in a system config range and fetching it from there on demand?

A side question: does the current design in turn mandates that the default config be consistent across all nodes? What bad things would happen if they are not?

@knz
Copy link
Contributor Author

knz commented Aug 7, 2019

discussed offline: the cfg.DefaultZoneConfig is used as fall-back in various places if the lookup in the gossiped zone information (presumably coming from persisted data in system.zones) fails. The lookup logic should (if everything is implemented right) have zone ID 0 (systemRangeID) in the fallback path, and that is stored. So the cfg.DEfaultZoneConfig fall-back should only be used rarely, if at all.

The two adminServer methods identified initially above are not going through the regular lookup logic, so these need to be corrected.

It remains unclear (to me) what happens in other places when the cfg vars are updated. I haven't yet found a common implementation for this fallback logic. It seems that different places have their different methods to access zone configs and fall backs in diverse ways. There's no clarity (to me) that all the variants of the fall-back logic go through zone ID 0 and look it up from KV.

@knz
Copy link
Contributor Author

knz commented Aug 7, 2019

I'll side-step the issue (and prevent it from becoming a real problem) by using the new SQL syntax from #39404 to initialize the replication factor using SQL, without changing the default zone config. However the current (mis)design will remain and this issue will remain relevant.

@tbg
Copy link
Member

tbg commented Aug 8, 2019

@tbg: can you explain to me what was the rationale to use the global-scope, in-memory zone config as input to various things, as opposed to storing it in a system config range and fetching it from there on demand?

I think it really all just boils down to poor design (when I say that I was usually involved, but I don't think I was this time, so I'm not too sure). It was possibly a mistake to have an in-memory fallback instead of "just" handling the case in which no zone information is available yet whenever zones are involved. I don't think you need to know which zone you're in to start serving commands, so I don't think starting the cluster needs the fallback.

knz added a commit to knz/cockroach that referenced this issue Aug 9, 2019
As established in cockroachdb#39382 it is not safe to modify the global
default zone config objects to non-standard values.

This commit changes `start-single-node` and `demo` to use
zone configs via SQL instead.

Also as requested by @piyush-singh it now informs the user
the replication was disabled, for example:

```
*
* INFO: Replication was disabled for this cluster.
* When/if adding nodes in the future, update zone configurations to increase the replication factor.
*
```

Release note: None
knz added a commit to knz/cockroach that referenced this issue Aug 10, 2019
As established in cockroachdb#39382 it is not safe to modify the global
default zone config objects to non-standard values.

This commit changes `start-single-node` and `demo` to use
zone configs via SQL instead.

Also as requested by @piyush-singh it now informs the user
the replication was disabled, for example:

```
*
* INFO: Replication was disabled for this cluster.
* When/if adding nodes in the future, update zone configurations to increase the replication factor.
*
```

Release note: None
knz added a commit to knz/cockroach that referenced this issue Aug 13, 2019
As established in cockroachdb#39382 it is not safe to modify the global
default zone config objects to non-standard values.

This commit changes `start-single-node` and `demo` to use
zone configs via SQL instead.

Also as requested by @piyush-singh it now informs the user
the replication was disabled, for example:

```
*
* INFO: Replication was disabled for this cluster.
* When/if adding nodes in the future, update zone configurations to increase the replication factor.
*
```

Release note: None
craig bot pushed a commit that referenced this issue Aug 13, 2019
39492: cli: use zone configs to disable replication r=knz a=knz

Fixes #39379.

As established in #39382 it is not safe to modify the global
default zone config objects to non-standard values.

This commit changes `start-single-node` and `demo` to use
zone configs via SQL instead.

Also as requested by @piyush-singh it now informs the user
the replication was disabled, for example:

```
*
* INFO: Replication was disabled for this cluster.
* When/if adding nodes in the future, update zone configurations to increase the replication factor.
*
```

Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
@knz knz added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Sep 9, 2020
@knz knz changed the title server: the admin RPC is confused about the default zone config server: nodes started with different default replicaiton factors report inconsistent default zone configs in RPCs May 28, 2021
@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. labels May 28, 2021
@knz
Copy link
Contributor Author

knz commented May 28, 2021

@irfansharif if you're trying to make something that works well with multitenant, you may want to take this issue into account as well.

@knz knz changed the title server: nodes started with different default replicaiton factors report inconsistent default zone configs in RPCs kv,server: nodes started with different default replicaiton factors report inconsistent default zone configs in RPCs May 28, 2021
@jlinder jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-server-and-security DB Server & Security
Projects
None yet
Development

No branches or pull requests

3 participants