qa: Decommissioning nodes #21882

vivekmenezes · 2018-01-29T18:10:42Z

No description provided.

tbg · 2018-02-26T18:57:53Z

@m-schneider one thing I'd like you to check is that when you decommission a node (whether it is down before you do that or not), the rest of the cluster stops connecting to it. In particular, it shouldn't show up in any of the debug pages, ui, ..., and there shouldn't be log entries for its IP.

tbg · 2018-03-07T06:07:57Z

In a four node cluster, decommissioning the first node in absentia (i.e. after terminating it cleanly), the graphs show one replica remains on that node forever (this isn't actually true; select replicas from crdb_internal.ranges; shows all replicas on the remaining three nodes):

@couchand is that a ui bug? The node is down, so it can't possibly write that time series.

tbg · 2018-03-07T06:16:47Z

On the plus side, after waiting for the store dead timeout, the node disappeared from the graphs and wasn't shown as dead any more, which was expected and worked fine! 👍

tbg · 2018-03-07T10:03:22Z

Some graphs from my WIP node decommissioning test (gracefully quit first of three nodes, decommission it, wipe it, bring it back, wait a minute). The test runs a max-qps=500 kv on the side. This is all on my laptop.

TLDR is that it looks pretty good. We seem to be doing a decent enough job a) quitting gracefully and b) not messing things up during the rebalances when the node comes back:

There are occasionally dips like these:

correlated with a spike in node heartbeat latency (~500ms)

That's something to investigate. Potentially this is an artifact of running on my laptop while I do other stuff, though.

couchand · 2018-03-07T14:57:44Z

show one replica remains on that node forever

This is a known charting issue (#23480 I think, though it's manifesting a bit strangely in your case...) If you look at the big dot for node 1, it's not at the same point in time as the other ones. I think all that's happening is that the last recorded data point for that series was 1. It might be a good usability win to make sure decommissioned nodes always record one more datapoint before going offline, so that they are forever recorded as being clean of all replicas. I've created #23525 to keep track of that idea.

tbg · 2018-03-07T16:09:21Z

Thanks @couchand! I've commented on #23525 directly.

A user observed that after decommissioning a node, connections would still be attempted to the node, seemingly forever. This was easily reproduced (gRPC warnings are logged when our onlyOnceDialer returns `grpcutil.ErrCannotReuseClientConn`). The root cause is that when the node disappears, many affected range descriptors are not evicted by `DistSender`: when the decommissioned node was not the lease holder, there will be no reason to evict. Similarly, if it was, we'll opportunistically send an RPC to the other possibly stale, but often not) replicas which frequently discovers the new lease holder. This doesn't seem like a problem; the replica is carried through and usually placed last (behind the lease holder and any other healthy replica) - the only issue was that the transport layer connects to all of the replicas eagerly, which provokes the spew of log messages. This commit changes the initialization process so that we don't actually GRPCDial replicas until they are needed. The previous call site to GRPCDial and the new one are very close to each other, so nothing is lost. In the common case, the rpc context already has a connection open and no actual work is done anyway. I diagnosed this using an in-progress decommissioning roachtest (and printf debugging); I hope to be able to prevent regression of this bug in that test when it's done. Touches cockroachdb#21882. Release note (bug fix): Avoid connection attempts to former members of the cluster and the associated spurious log messages.

vivekmenezes added the O-qa label Jan 29, 2018

vivekmenezes added this to the 2.0 milestone Jan 29, 2018

vivekmenezes assigned m-schneider Jan 29, 2018

tbg mentioned this issue Mar 7, 2018

kv: connect transport lazily #23521

Merged

couchand mentioned this issue Mar 7, 2018

ts/ui: decommissioned nodes should record one last data point before quitting #23525

Closed

This was referenced Mar 8, 2018

cherrypick-2.0: kv: connect transport lazily #23605

Merged

sql: distsql plans against unavailable node #23601

Closed

petermattis modified the milestones: 2.0, 2.0.x, 2.1 Apr 5, 2018

m-schneider closed this as completed Apr 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qa: Decommissioning nodes #21882

qa: Decommissioning nodes #21882

vivekmenezes commented Jan 29, 2018

tbg commented Feb 26, 2018

tbg commented Mar 7, 2018 •

edited

Loading

tbg commented Mar 7, 2018

tbg commented Mar 7, 2018

couchand commented Mar 7, 2018 •

edited

Loading

tbg commented Mar 7, 2018

qa: Decommissioning nodes #21882

qa: Decommissioning nodes #21882

Comments

vivekmenezes commented Jan 29, 2018

tbg commented Feb 26, 2018

tbg commented Mar 7, 2018 • edited Loading

tbg commented Mar 7, 2018

tbg commented Mar 7, 2018

couchand commented Mar 7, 2018 • edited Loading

tbg commented Mar 7, 2018

tbg commented Mar 7, 2018 •

edited

Loading

couchand commented Mar 7, 2018 •

edited

Loading