cli: let commands return partial information when run against unavailable/broken clusters #16489

dianasaur323 · 2017-06-13T15:11:11Z

Users expect to be able to get some information after running cockroach node status, even in the case of partitioning or loss of quorum.

It would be helpful to reveal some information from the local node (basically a warning that the node is no longer able to communicate with the rest of the cluster).

davibo-oc · 2017-06-13T15:19:46Z

An example - Cassandra gives a status information in the following form:


Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns    Host ID                               Rack
UN  127.0.0.1  4.08 GiB   1            ?       5e5167f6-94cd-4170-b295-262699afae16  rack1
DN  127.0.0.2  3.17 GiB   1            ?       025d4078-1ede-49ef-a0ae-6a3f5bd497d7  rack1
DN  127.0.0.3  3.98 GiB   1            ?       a54e0391-fa16-4025-acc2-f4bb13214527  rack1

Where U and D signals whether a node is reachable.

The current CockroachDB behaviour when there is a serious problem and the quorum is lost is to hang indefinitely on cockroach node status which is a bit unexpected.

rjnn · 2017-06-20T15:24:08Z

Thanks @davibo-oc for the context. That's very helpful, and appreciated!

shawnrichards · 2017-07-16T17:42:20Z

Hope you guys are going to give this the focus it deserves. Admin functions that fail to respond along with everything else precludes it from use in some of the crisis situations for which it was designed.

dianasaur323 · 2017-07-17T14:49:57Z

@shawnrichards thanks for the additional nudge here. Unfortunately, this isn't going to make it into our 1.1 release, but it is slotted for investigation in the 1.2 release.

shawnrichards · 2017-07-17T15:53:45Z

@dianasaur323 As long as it is on you guys' radar, that is perfect.

dianasaur323 · 2017-09-22T12:54:41Z

Draft Acceptance Criteria -- Accepting Comments

Rationale
Currently, we offer a confusing UX when quorum is lost. Users cannot access the admin UI, and also cannot check node status.

Feature Scope

Offer a solution similar to Cassandra - show the liveness of other nodes in the cluster from the perspective of the queried cluster. This could simply be an additional column that shows whether or not a given node can talk to another node. Leaving details of implementation to the engineering team
This new column should be part of the generic cockroach node status command (not the cockroach node stats -all command)

PM Acceptance Testing

Force a cluster to lose quorum
Run cockroach node status
Force a network partition
Run cockroach node status

dianasaur323 · 2017-10-17T15:50:07Z

@cuongdo I don't think @mrtracy is the right person to put on this issue anymore given how much is already on his plate. Can we triage to someone else?

tbg · 2017-11-27T18:39:57Z

Also: time out after trying for a while.

dianasaur323 · 2018-03-03T13:40:50Z

Thanks @tschottdorf! I've allocated time in our roadmap to close this out in 2.1, so modifying the milestone sounds good to me.

2nishantg · 2018-05-31T08:59:40Z

@tschottdorf If no one is working on this, I can pick this up.

tbg · 2018-05-31T20:56:12Z

@Nishant9 I think we're not quite at the point here where we know what exactly we want, so if you took this on you'd spend a lot of time stuck in limbo. However, there's another issue I could use your help with if you're interested.

tbg · 2018-06-05T12:18:09Z

@nstewart who's in charge of this now on the PM side? We're basically ready to change how node status works, but this entails changing the returned fields somewhat. Before we just go and do that I wanted to check in that that's still what we want to do, and that everyone is clear that this requires a docs update, and changes what is returned to users, possibly breaking any homegrown scripts they use to interpret the output.

Note to self: #20403 has some WIP.

nstewart · 2018-06-05T12:59:47Z

@piyush-singh is tackling this. @piyush-singh can you follow up here?

piyush-singh · 2018-07-17T18:04:45Z

spoke to @tschottdorf and we'll prioritize this up as part of the CLI team's upcoming 2 day bugfixing cycle

Expand the `crdb_internal.gossip_liveness` and `crdb_internal.gossip_nodes` tables to include columns needed to satisfy the basic usage of `node status`. Specifically, added `address`, `build`, `started_at`, `updated_at` and `replicas` columns. Changed `node status` to use `gossip_{liveness,nodes}` instead of `kv_node_status`. The latter table requires the range containing the consistent node status descriptors to be available, while `gossip_{liveness,nodes}` only retrieves info from gossip. `node status` and `node status --decommission` will work on unavailable/broken clusters as long as the node they are pointed to is up. `node status {--stats,--ranges,--all}` continue to require a reasonably healthy cluster. Fixes cockroachdb#16489 Release note (cli change): Enhance `node status` to work on unavailable/broken clusters.

28249: cli: allow `node status` to work in unavailable/broken clusters r=bdarnell a=petermattis Expand the `crdb_internal.gossip_liveness` table to include columns needed to satisfy the basic usage of `node status`. Specifically, added `address`, `build`, `started_at`, `updated_at` and `replicas` columns. Changed `node status` to use `gossip_liveness` instead of `kv_node_status`. The latter table requires the range containing the consistent node status descriptors to be available, while `gossip_liveness` only retrieves info from gossip. `node status` and `node status --decommission` will work on unavailable/broken clusters as long as the node they are pointed to is up. `node status {--stats,--ranges,--all}` continue to require a reasonably healthy cluster. Fixes #16489 Release note (cli change): Enhance `node status` to work on unavailable/broken clusters. Co-authored-by: Peter Mattis <[email protected]>

dianasaur323 added O-community Originated from the community fruit and removed O-community Originated from the community labels Jun 13, 2017

dianasaur323 modified the milestones: Later, 1.2 Jun 13, 2017

vivekmenezes assigned mrtracy Jun 15, 2017

a-robinson mentioned this issue Aug 24, 2017

stability: make debug tools useful for wedged/underreplicated clusters #17904

Closed

tbg mentioned this issue Aug 24, 2017

sql: add crdb_internal.set_vmodule, remove server.remote_debug.vmodule #17914

Merged

dianasaur323 mentioned this issue Oct 30, 2017

cli: improve ux when cluster is unavailable #19646

Closed

tbg changed the title ~~return some information about the local node in partition / loss of quorum scenarios~~ cli: let commands return partial information when run against unavailable/broken clusters Oct 30, 2017

cuongdo assigned andreimatei and unassigned mrtracy Oct 30, 2017

tbg mentioned this issue Nov 27, 2017

cockroach node status should show when a node is dead #19924

Closed

tbg mentioned this issue Nov 28, 2017

cli: add timeouts to node status and node ls #20308

Merged

tbg added this to the 2.1 milestone Mar 3, 2018

petermattis assigned petermattis and unassigned tbg Mar 28, 2018

knz added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 27, 2018

tbg added the A-cli label Jun 5, 2018

petermattis mentioned this issue Aug 3, 2018

cli: allow node status to work in unavailable/broken clusters #28249

Merged

craig bot closed this as completed in #28249 Aug 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli: let commands return partial information when run against unavailable/broken clusters #16489

cli: let commands return partial information when run against unavailable/broken clusters #16489

dianasaur323 commented Jun 13, 2017

davibo-oc commented Jun 13, 2017

rjnn commented Jun 20, 2017

shawnrichards commented Jul 16, 2017

dianasaur323 commented Jul 17, 2017

shawnrichards commented Jul 17, 2017

dianasaur323 commented Sep 22, 2017

dianasaur323 commented Oct 17, 2017

tbg commented Nov 27, 2017

dianasaur323 commented Mar 3, 2018

2nishantg commented May 31, 2018

tbg commented May 31, 2018

tbg commented Jun 5, 2018 •

edited

Loading

nstewart commented Jun 5, 2018

piyush-singh commented Jul 17, 2018

cli: let commands return partial information when run against unavailable/broken clusters #16489

cli: let commands return partial information when run against unavailable/broken clusters #16489

Comments

dianasaur323 commented Jun 13, 2017

davibo-oc commented Jun 13, 2017

rjnn commented Jun 20, 2017

shawnrichards commented Jul 16, 2017

dianasaur323 commented Jul 17, 2017

shawnrichards commented Jul 17, 2017

dianasaur323 commented Sep 22, 2017

dianasaur323 commented Oct 17, 2017

tbg commented Nov 27, 2017

dianasaur323 commented Mar 3, 2018

2nishantg commented May 31, 2018

tbg commented May 31, 2018

tbg commented Jun 5, 2018 • edited Loading

nstewart commented Jun 5, 2018

piyush-singh commented Jul 17, 2018

tbg commented Jun 5, 2018 •

edited

Loading