Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webui: update to handle removal of decommissioned nodes #61812

Closed
erikgrinaker opened this issue Mar 10, 2021 · 8 comments
Closed

webui: update to handle removal of decommissioned nodes #61812

erikgrinaker opened this issue Mar 10, 2021 · 8 comments
Assignees
Labels
A-kv-observability A-webui Triage label for DB Console (fka admin UI) issues. Add this if nothing else is clear. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 10, 2021

In #56529 we remove a node's status entry once it's decommissioned. This means that it no longer shows up in neither "Recently Decommissioned Nodes", nor the "Decommissioned Node History" in the UI. We may want to either remove these views, or change them to use ephemeral info from gossip liveness (e.g. crdb_internal.gossip_liveness) rather than the status entry (e.g. crdb_internal.kv_node_status).

Also requested a docs update in #61808, should coordinate any follow-up actions.

Epic CRDB-10792

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-webui Triage label for DB Console (fka admin UI) issues. Add this if nothing else is clear. labels Mar 10, 2021
@irfansharif
Copy link
Contributor

Better yet, we should use the contents of node liveness directly (not the ephemeral gossip state): crdb_internal.kv_node_liveness. Probably we should expose a proper API for it if being depended on by the UI.

@irfansharif
Copy link
Contributor

Related is #50707, which in hindsight is a very badly worded issue. Really we should purge all uses of these status entries in favor of liveness entries.

@tbg
Copy link
Member

tbg commented Dec 21, 2021

internal slack thread on the issue. I think the current presentation is actively dangerous. The UI will tell users that nodes that are actively undergoing the decommissioning process are "decommissioned" which users may interpret as them being removable from the cluster. This could lead to loss of quorum scenarios, as removing live nodes that houses replicas (which a decommissioning node is) in sufficient numbers does. As a quick fix, we can hide this section altogether.

@zachlite
Copy link
Contributor

zachlite commented Feb 8, 2022

I ran into some confusion while going through the decommissioning lifecycle. @erikgrinaker can you advise?

I started to decommission a node, but when the node reached zero replicas, it disappeared from the Node List on the Overview page. I would have expected it to remain on the Node List with a DECOMMISSIONING label. The decommissioning node was also absent from the status report given by cockroach node status. Again, I would have expected the report to include all 4 nodes. Looking at my process manager, I could see all 4 nodes running.

Additionally, querying crdb_internal.kv_node_liveness yielded 4 nodes, with the decommissioning node's membership value set as decommissioned. Isn't a node supposed to be decommissioning until the node is stopped?

This was tested using a local 4 node cluster running 21.2.3 via roachprod.

@tbg
Copy link
Member

tbg commented Feb 9, 2022

No, the decommissioning process starts out marking the node as decommissioning, meaning replicas are moved off of it. Once zero replicas is reached, a transition of decommissioning -> decommissioned takes place. At this point, the node is hard-excluded from talking to the cluster, and the clusters considers this node to be no longer in existence. Whether the node is stopped or not plays no role in that. Recall that a node may be stopped throughout the entire decommissioning process (say it has a hardware failure).

@tbg
Copy link
Member

tbg commented Feb 9, 2022

The way to interpret

meaning replicas are moved off of it

in the latter case of a down node is that the rest of the cluster realizes that this node is not coming back, and will update their replica placement to make additional replicas elsewhere.

@zachlite
Copy link
Contributor

zachlite commented Feb 9, 2022

Once zero replicas is reached, a transition of decommissioning -> decommissioned takes place.

Is the documentation using different semantics? It seems like a conflicting definition is given:

A node is considered to be decommissioned when it meets two criteria:

  1. The node has completed the decommissioning process.
  2. The node has been stopped and has not updated its liveness record for the duration configured via server.time_until_store_dead, which defaults to 5 minutes.

@erikgrinaker
Copy link
Contributor Author

Yes, that is outdated. I opened a docs issue to fix it way back when we changed this, but it hasn't been done yet:

cockroachdb/docs#9968

I'll ping the docs team about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability A-webui Triage label for DB Console (fka admin UI) issues. Add this if nothing else is clear. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

8 participants