Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: let commands return partial information when run against unavailable/broken clusters #16489

Closed
dianasaur323 opened this issue Jun 13, 2017 · 16 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-community Originated from the community
Milestone

Comments

@dianasaur323
Copy link
Contributor

Users expect to be able to get some information after running cockroach node status, even in the case of partitioning or loss of quorum.

It would be helpful to reveal some information from the local node (basically a warning that the node is no longer able to communicate with the rest of the cluster).

@dianasaur323 dianasaur323 added O-community Originated from the community fruit and removed O-community Originated from the community labels Jun 13, 2017
@dianasaur323 dianasaur323 modified the milestones: Later, 1.2 Jun 13, 2017
@davibo-oc
Copy link

An example - Cassandra gives a status information in the following form:


Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns    Host ID                               Rack
UN  127.0.0.1  4.08 GiB   1            ?       5e5167f6-94cd-4170-b295-262699afae16  rack1
DN  127.0.0.2  3.17 GiB   1            ?       025d4078-1ede-49ef-a0ae-6a3f5bd497d7  rack1
DN  127.0.0.3  3.98 GiB   1            ?       a54e0391-fa16-4025-acc2-f4bb13214527  rack1

Where U and D signals whether a node is reachable.

The current CockroachDB behaviour when there is a serious problem and the quorum is lost is to hang indefinitely on cockroach node status which is a bit unexpected.

@rjnn
Copy link
Contributor

rjnn commented Jun 20, 2017

Thanks @davibo-oc for the context. That's very helpful, and appreciated!

@shawnrichards
Copy link

Hope you guys are going to give this the focus it deserves. Admin functions that fail to respond along with everything else precludes it from use in some of the crisis situations for which it was designed.

@dianasaur323
Copy link
Contributor Author

@shawnrichards thanks for the additional nudge here. Unfortunately, this isn't going to make it into our 1.1 release, but it is slotted for investigation in the 1.2 release.

@shawnrichards
Copy link

@dianasaur323 As long as it is on you guys' radar, that is perfect.

@dianasaur323
Copy link
Contributor Author

Draft Acceptance Criteria -- Accepting Comments

Rationale
Currently, we offer a confusing UX when quorum is lost. Users cannot access the admin UI, and also cannot check node status.

Feature Scope

  • Offer a solution similar to Cassandra - show the liveness of other nodes in the cluster from the perspective of the queried cluster. This could simply be an additional column that shows whether or not a given node can talk to another node. Leaving details of implementation to the engineering team
  • This new column should be part of the generic cockroach node status command (not the cockroach node stats -all command)

PM Acceptance Testing

  • Force a cluster to lose quorum
  • Run cockroach node status
  • Force a network partition
  • Run cockroach node status

@dianasaur323
Copy link
Contributor Author

@cuongdo I don't think @mrtracy is the right person to put on this issue anymore given how much is already on his plate. Can we triage to someone else?

@tbg tbg changed the title return some information about the local node in partition / loss of quorum scenarios cli: let commands return partial information when run against unavailable/broken clusters Oct 30, 2017
@cuongdo cuongdo assigned andreimatei and unassigned mrtracy Oct 30, 2017
@tbg
Copy link
Member

tbg commented Nov 27, 2017

Also: time out after trying for a while.

@tbg tbg added this to the 2.1 milestone Mar 3, 2018
@dianasaur323
Copy link
Contributor Author

Thanks @tschottdorf! I've allocated time in our roadmap to close this out in 2.1, so modifying the milestone sounds good to me.

@petermattis petermattis assigned petermattis and unassigned tbg Mar 28, 2018
@knz knz added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 27, 2018
@2nishantg
Copy link

@tschottdorf If no one is working on this, I can pick this up.

@tbg
Copy link
Member

tbg commented May 31, 2018

@Nishant9 I think we're not quite at the point here where we know what exactly we want, so if you took this on you'd spend a lot of time stuck in limbo. However, there's another issue I could use your help with if you're interested.

@tbg
Copy link
Member

tbg commented Jun 5, 2018

@nstewart who's in charge of this now on the PM side? We're basically ready to change how node status works, but this entails changing the returned fields somewhat. Before we just go and do that I wanted to check in that that's still what we want to do, and that everyone is clear that this requires a docs update, and changes what is returned to users, possibly breaking any homegrown scripts they use to interpret the output.

Note to self: #20403 has some WIP.

@tbg tbg added the A-cli label Jun 5, 2018
@nstewart
Copy link
Contributor

nstewart commented Jun 5, 2018

@piyush-singh is tackling this. @piyush-singh can you follow up here?

@piyush-singh
Copy link

spoke to @tschottdorf and we'll prioritize this up as part of the CLI team's upcoming 2 day bugfixing cycle

petermattis added a commit to petermattis/cockroach that referenced this issue Aug 3, 2018
Expand the `crdb_internal.gossip_liveness` and
`crdb_internal.gossip_nodes` tables to include columns needed to satisfy
the basic usage of `node status`. Specifically, added `address`,
`build`, `started_at`, `updated_at` and `replicas` columns. Changed
`node status` to use `gossip_{liveness,nodes}` instead of
`kv_node_status`. The latter table requires the range containing the
consistent node status descriptors to be available, while
`gossip_{liveness,nodes}` only retrieves info from gossip.

`node status` and `node status --decommission` will work on
unavailable/broken clusters as long as the node they are pointed to is
up. `node status {--stats,--ranges,--all}` continue to require a
reasonably healthy cluster.

Fixes cockroachdb#16489

Release note (cli change): Enhance `node status` to work on
unavailable/broken clusters.
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 10, 2018
Expand the `crdb_internal.gossip_liveness` and
`crdb_internal.gossip_nodes` tables to include columns needed to satisfy
the basic usage of `node status`. Specifically, added `address`,
`build`, `started_at`, `updated_at` and `replicas` columns. Changed
`node status` to use `gossip_{liveness,nodes}` instead of
`kv_node_status`. The latter table requires the range containing the
consistent node status descriptors to be available, while
`gossip_{liveness,nodes}` only retrieves info from gossip.

`node status` and `node status --decommission` will work on
unavailable/broken clusters as long as the node they are pointed to is
up. `node status {--stats,--ranges,--all}` continue to require a
reasonably healthy cluster.

Fixes cockroachdb#16489

Release note (cli change): Enhance `node status` to work on
unavailable/broken clusters.
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 10, 2018
Expand the `crdb_internal.gossip_liveness` and
`crdb_internal.gossip_nodes` tables to include columns needed to satisfy
the basic usage of `node status`. Specifically, added `address`,
`build`, `started_at`, `updated_at` and `replicas` columns. Changed
`node status` to use `gossip_{liveness,nodes}` instead of
`kv_node_status`. The latter table requires the range containing the
consistent node status descriptors to be available, while
`gossip_{liveness,nodes}` only retrieves info from gossip.

`node status` and `node status --decommission` will work on
unavailable/broken clusters as long as the node they are pointed to is
up. `node status {--stats,--ranges,--all}` continue to require a
reasonably healthy cluster.

Fixes cockroachdb#16489

Release note (cli change): Enhance `node status` to work on
unavailable/broken clusters.
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 10, 2018
Expand the `crdb_internal.gossip_liveness` and
`crdb_internal.gossip_nodes` tables to include columns needed to satisfy
the basic usage of `node status`. Specifically, added `address`,
`build`, `started_at`, `updated_at` and `replicas` columns. Changed
`node status` to use `gossip_{liveness,nodes}` instead of
`kv_node_status`. The latter table requires the range containing the
consistent node status descriptors to be available, while
`gossip_{liveness,nodes}` only retrieves info from gossip.

`node status` and `node status --decommission` will work on
unavailable/broken clusters as long as the node they are pointed to is
up. `node status {--stats,--ranges,--all}` continue to require a
reasonably healthy cluster.

Fixes cockroachdb#16489

Release note (cli change): Enhance `node status` to work on
unavailable/broken clusters.
craig bot pushed a commit that referenced this issue Aug 10, 2018
28249: cli: allow `node status` to work in unavailable/broken clusters r=bdarnell a=petermattis

Expand the `crdb_internal.gossip_liveness` table to include columns
needed to satisfy the basic usage of `node status`. Specifically, added
`address`, `build`, `started_at`, `updated_at` and `replicas`
columns. Changed `node status` to use `gossip_liveness` instead of
`kv_node_status`. The latter table requires the range containing the
consistent node status descriptors to be available, while
`gossip_liveness` only retrieves info from gossip.

`node status` and `node status --decommission` will work on
unavailable/broken clusters as long as the node they are pointed to is
up. `node status {--stats,--ranges,--all}` continue to require a
reasonably healthy cluster.

Fixes #16489

Release note (cli change): Enhance `node status` to work on
unavailable/broken clusters.

Co-authored-by: Peter Mattis <[email protected]>
@craig craig bot closed this as completed in #28249 Aug 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-community Originated from the community
Projects
None yet
Development

No branches or pull requests