kv,server: nodes started with different default replicaiton factors report inconsistent default zone configs in RPCs #39382

knz · 2019-08-06T22:11:10Z

Found while investigating #39379: it's possible to start two or more nodes that have different default zone configs from the perspective of the admin RPC.

Specifically (s *adminServer) TableDetails() and (s *adminServer) DatabaseDetails().

To see this in action I can do this:

$ cockroach start-single-node
$ cockroach start --join=:26257 --port=26258 --http-port=8081

Then if I create a table/range on n1 it will get replication factor 1, and if I create one on n2 it will get replication factor 3 when inspecting the table details.

That's no good!

The correct behavior is to scan the system.zones and grab the default zone config from there, instead of grabbing it from the in-memory server Config object.

Jira issue: CRDB-5580

The text was updated successfully, but these errors were encountered:

knz · 2019-08-06T22:12:34Z

cc @tbg

@jeffrey-xiao can you have a look? something tells me you've looked at this last.

knz · 2019-08-07T12:29:33Z

Maybe it's a bit more than that though. IT appears that the in-memory zone cfg is also used in (s *Store) systemGossipUpdate() to update replicas without a zone config:

  newStoreReplicaVisitor(s).Visit(func(repl *Replica) bool {
    key := repl.Desc().StartKey
    zone, err := sysCfg.GetZoneConfigForKey(key)
    if err != nil {
      if log.V(1) {
        log.Infof(context.TODO(), "failed to get zone config for key %s", key)
      }
      zone = s.cfg.DefaultZoneConfig
    }
    repl.SetZoneConfig(zone)

and also in newReplica().

@tbg: can you explain to me what was the rationale to use the global-scope, in-memory zone config as input to various things, as opposed to storing it in a system config range and fetching it from there on demand?

A side question: does the current design in turn mandates that the default config be consistent across all nodes? What bad things would happen if they are not?

knz · 2019-08-07T15:14:55Z

discussed offline: the cfg.DefaultZoneConfig is used as fall-back in various places if the lookup in the gossiped zone information (presumably coming from persisted data in system.zones) fails. The lookup logic should (if everything is implemented right) have zone ID 0 (systemRangeID) in the fallback path, and that is stored. So the cfg.DEfaultZoneConfig fall-back should only be used rarely, if at all.

The two adminServer methods identified initially above are not going through the regular lookup logic, so these need to be corrected.

It remains unclear (to me) what happens in other places when the cfg vars are updated. I haven't yet found a common implementation for this fallback logic. It seems that different places have their different methods to access zone configs and fall backs in diverse ways. There's no clarity (to me) that all the variants of the fall-back logic go through zone ID 0 and look it up from KV.

knz · 2019-08-07T16:48:32Z

I'll side-step the issue (and prevent it from becoming a real problem) by using the new SQL syntax from #39404 to initialize the replication factor using SQL, without changing the default zone config. However the current (mis)design will remain and this issue will remain relevant.

tbg · 2019-08-08T09:52:19Z

@tbg: can you explain to me what was the rationale to use the global-scope, in-memory zone config as input to various things, as opposed to storing it in a system config range and fetching it from there on demand?

I think it really all just boils down to poor design (when I say that I was usually involved, but I don't think I was this time, so I'm not too sure). It was possibly a mistake to have an in-memory fallback instead of "just" handling the case in which no zone information is available yet whenever zones are involved. I don't think you need to know which zone you're in to start serving commands, so I don't think starting the cluster needs the fallback.

@piyush-singh

As established in cockroachdb#39382 it is not safe to modify the global default zone config objects to non-standard values. This commit changes `start-single-node` and `demo` to use zone configs via SQL instead. Also as requested by @piyush-singh it now informs the user the replication was disabled, for example: ``` * * INFO: Replication was disabled for this cluster. * When/if adding nodes in the future, update zone configurations to increase the replication factor. * ``` Release note: None

@piyush-singh

As established in cockroachdb#39382 it is not safe to modify the global default zone config objects to non-standard values. This commit changes `start-single-node` and `demo` to use zone configs via SQL instead. Also as requested by @piyush-singh it now informs the user the replication was disabled, for example: ``` * * INFO: Replication was disabled for this cluster. * When/if adding nodes in the future, update zone configurations to increase the replication factor. * ``` Release note: None

@piyush-singh

As established in cockroachdb#39382 it is not safe to modify the global default zone config objects to non-standard values. This commit changes `start-single-node` and `demo` to use zone configs via SQL instead. Also as requested by @piyush-singh it now informs the user the replication was disabled, for example: ``` * * INFO: Replication was disabled for this cluster. * When/if adding nodes in the future, update zone configurations to increase the replication factor. * ``` Release note: None

@piyush-singh

39492: cli: use zone configs to disable replication r=knz a=knz Fixes #39379. As established in #39382 it is not safe to modify the global default zone config objects to non-standard values. This commit changes `start-single-node` and `demo` to use zone configs via SQL instead. Also as requested by @piyush-singh it now informs the user the replication was disabled, for example: ``` * * INFO: Replication was disabled for this cluster. * When/if adding nodes in the future, update zone configurations to increase the replication factor. * ``` Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]>

knz · 2021-05-28T22:46:16Z

@irfansharif if you're trying to make something that works well with multitenant, you may want to take this issue into account as well.

knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-server Relating to the KV-level RPC server labels Aug 6, 2019

knz mentioned this issue Aug 9, 2019

cli: use zone configs to disable replication #39492

Merged

knz added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Sep 9, 2020

knz changed the title ~~server: the admin RPC is confused about the default zone config~~ server: nodes started with different default replicaiton factors report inconsistent default zone configs in RPCs May 28, 2021

knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. labels May 28, 2021

knz changed the title ~~server: nodes started with different default replicaiton factors report inconsistent default zone configs in RPCs~~ kv,server: nodes started with different default replicaiton factors report inconsistent default zone configs in RPCs May 28, 2021

jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv,server: nodes started with different default replicaiton factors report inconsistent default zone configs in RPCs #39382

kv,server: nodes started with different default replicaiton factors report inconsistent default zone configs in RPCs #39382

knz commented Aug 6, 2019 •

edited by cockroach-jira-scripts

Loading

knz commented Aug 6, 2019

knz commented Aug 7, 2019

knz commented Aug 7, 2019 •

edited

Loading

knz commented Aug 7, 2019

tbg commented Aug 8, 2019

knz commented May 28, 2021

kv,server: nodes started with different default replicaiton factors report inconsistent default zone configs in RPCs #39382

kv,server: nodes started with different default replicaiton factors report inconsistent default zone configs in RPCs #39382

Comments

knz commented Aug 6, 2019 • edited by cockroach-jira-scripts Loading

knz commented Aug 6, 2019

knz commented Aug 7, 2019

knz commented Aug 7, 2019 • edited Loading

knz commented Aug 7, 2019

tbg commented Aug 8, 2019

knz commented May 28, 2021

knz commented Aug 6, 2019 •

edited by cockroach-jira-scripts

Loading

knz commented Aug 7, 2019 •

edited

Loading