webui: update to handle removal of decommissioned nodes #61812

erikgrinaker · 2021-03-10T22:02:55Z

In #56529 we remove a node's status entry once it's decommissioned. This means that it no longer shows up in neither "Recently Decommissioned Nodes", nor the "Decommissioned Node History" in the UI. We may want to either remove these views, or change them to use ephemeral info from gossip liveness (e.g. crdb_internal.gossip_liveness) rather than the status entry (e.g. crdb_internal.kv_node_status).

Also requested a docs update in #61808, should coordinate any follow-up actions.

Epic CRDB-10792

The text was updated successfully, but these errors were encountered:

irfansharif · 2021-12-20T16:28:23Z

Better yet, we should use the contents of node liveness directly (not the ephemeral gossip state): crdb_internal.kv_node_liveness. Probably we should expose a proper API for it if being depended on by the UI.

irfansharif · 2021-12-20T16:30:28Z

Related is #50707, which in hindsight is a very badly worded issue. Really we should purge all uses of these status entries in favor of liveness entries.

tbg · 2021-12-21T08:20:40Z

internal slack thread on the issue. I think the current presentation is actively dangerous. The UI will tell users that nodes that are actively undergoing the decommissioning process are "decommissioned" which users may interpret as them being removable from the cluster. This could lead to loss of quorum scenarios, as removing live nodes that houses replicas (which a decommissioning node is) in sufficient numbers does. As a quick fix, we can hide this section altogether.

zachlite · 2022-02-08T23:10:42Z

I ran into some confusion while going through the decommissioning lifecycle. @erikgrinaker can you advise?

I started to decommission a node, but when the node reached zero replicas, it disappeared from the Node List on the Overview page. I would have expected it to remain on the Node List with a DECOMMISSIONING label. The decommissioning node was also absent from the status report given by cockroach node status. Again, I would have expected the report to include all 4 nodes. Looking at my process manager, I could see all 4 nodes running.

Additionally, querying crdb_internal.kv_node_liveness yielded 4 nodes, with the decommissioning node's membership value set as decommissioned. Isn't a node supposed to be decommissioning until the node is stopped?

This was tested using a local 4 node cluster running 21.2.3 via roachprod.

tbg · 2022-02-09T07:48:53Z

No, the decommissioning process starts out marking the node as decommissioning, meaning replicas are moved off of it. Once zero replicas is reached, a transition of decommissioning -> decommissioned takes place. At this point, the node is hard-excluded from talking to the cluster, and the clusters considers this node to be no longer in existence. Whether the node is stopped or not plays no role in that. Recall that a node may be stopped throughout the entire decommissioning process (say it has a hardware failure).

tbg · 2022-02-09T07:49:56Z

The way to interpret

meaning replicas are moved off of it

in the latter case of a down node is that the rest of the cluster realizes that this node is not coming back, and will update their replica placement to make additional replicas elsewhere.

zachlite · 2022-02-09T15:01:00Z

Once zero replicas is reached, a transition of decommissioning -> decommissioned takes place.

Is the documentation using different semantics? It seems like a conflicting definition is given:

A node is considered to be decommissioned when it meets two criteria:

The node has completed the decommissioning process.

The node has been stopped and has not updated its liveness record for the duration configured via server.time_until_store_dead, which defaults to 5 minutes.

erikgrinaker · 2022-02-09T15:09:30Z

Yes, that is outdated. I opened a docs issue to fix it way back when we changed this, but it hasn't been done yet:

cockroachdb/docs#9968

I'll ping the docs team about it.

erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-webui Triage label for DB Console (fka admin UI) issues. Add this if nothing else is clear. labels Mar 10, 2021

erikgrinaker mentioned this issue Mar 17, 2021

Update decommissioning process info cockroachdb/docs#9968

Closed

jlinder added the T-observability label Jun 16, 2021

elinorgarcia removed the T-observability label Jun 17, 2021

thtruo added the T-cluster-ui label Jul 20, 2021

mwang1026 added the A-kv-observability label Dec 20, 2021

thtruo added T-kv-observability and removed T-cluster-ui labels Jan 4, 2022

exalate-issue-sync bot assigned zachlite Feb 8, 2022

zachlite mentioned this issue Feb 14, 2022

ui: Use liveness info to populate decommissioned node lists #76538

Merged

exalate-issue-sync bot closed this as completed Feb 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webui: update to handle removal of decommissioned nodes #61812

webui: update to handle removal of decommissioned nodes #61812

erikgrinaker commented Mar 10, 2021 •

edited by exalate-issue-sync bot

Loading

irfansharif commented Dec 20, 2021

irfansharif commented Dec 20, 2021

tbg commented Dec 21, 2021

zachlite commented Feb 8, 2022 •

edited

Loading

tbg commented Feb 9, 2022

tbg commented Feb 9, 2022

zachlite commented Feb 9, 2022

erikgrinaker commented Feb 9, 2022

webui: update to handle removal of decommissioned nodes #61812

webui: update to handle removal of decommissioned nodes #61812

Comments

erikgrinaker commented Mar 10, 2021 • edited by exalate-issue-sync bot Loading

irfansharif commented Dec 20, 2021

irfansharif commented Dec 20, 2021

tbg commented Dec 21, 2021

zachlite commented Feb 8, 2022 • edited Loading

tbg commented Feb 9, 2022

tbg commented Feb 9, 2022

zachlite commented Feb 9, 2022

erikgrinaker commented Feb 9, 2022

erikgrinaker commented Mar 10, 2021 •

edited by exalate-issue-sync bot

Loading

zachlite commented Feb 8, 2022 •

edited

Loading