-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qa: Decommissioning nodes #21882
Comments
@m-schneider one thing I'd like you to check is that when you decommission a node (whether it is down before you do that or not), the rest of the cluster stops connecting to it. In particular, it shouldn't show up in any of the debug pages, ui, ..., and there shouldn't be log entries for its IP. |
In a four node cluster, decommissioning the first node in absentia (i.e. after terminating it cleanly), the graphs show one replica remains on that node forever (this isn't actually true; @couchand is that a ui bug? The node is down, so it can't possibly write that time series. |
On the plus side, after waiting for the store dead timeout, the node disappeared from the graphs and wasn't shown as dead any more, which was expected and worked fine! 👍 |
Some graphs from my WIP node decommissioning test (gracefully quit first of three nodes, decommission it, wipe it, bring it back, wait a minute). The test runs a max-qps=500 kv on the side. This is all on my laptop. TLDR is that it looks pretty good. We seem to be doing a decent enough job a) quitting gracefully and b) not messing things up during the rebalances when the node comes back: There are occasionally dips like these: correlated with a spike in node heartbeat latency (~500ms) That's something to investigate. Potentially this is an artifact of running on my laptop while I do other stuff, though. |
This is a known charting issue (#23480 I think, though it's manifesting a bit strangely in your case...) If you look at the big dot for node 1, it's not at the same point in time as the other ones. I think all that's happening is that the last recorded data point for that series was 1. It might be a good usability win to make sure decommissioned nodes always record one more datapoint before going offline, so that they are forever recorded as being clean of all replicas. I've created #23525 to keep track of that idea. |
A user observed that after decommissioning a node, connections would still be attempted to the node, seemingly forever. This was easily reproduced (gRPC warnings are logged when our onlyOnceDialer returns `grpcutil.ErrCannotReuseClientConn`). The root cause is that when the node disappears, many affected range descriptors are not evicted by `DistSender`: when the decommissioned node was not the lease holder, there will be no reason to evict. Similarly, if it was, we'll opportunistically send an RPC to the other possibly stale, but often not) replicas which frequently discovers the new lease holder. This doesn't seem like a problem; the replica is carried through and usually placed last (behind the lease holder and any other healthy replica) - the only issue was that the transport layer connects to all of the replicas eagerly, which provokes the spew of log messages. This commit changes the initialization process so that we don't actually GRPCDial replicas until they are needed. The previous call site to GRPCDial and the new one are very close to each other, so nothing is lost. In the common case, the rpc context already has a connection open and no actual work is done anyway. I diagnosed this using an in-progress decommissioning roachtest (and printf debugging); I hope to be able to prevent regression of this bug in that test when it's done. Touches cockroachdb#21882. Release note (bug fix): Avoid connection attempts to former members of the cluster and the associated spurious log messages.
A user observed that after decommissioning a node, connections would still be attempted to the node, seemingly forever. This was easily reproduced (gRPC warnings are logged when our onlyOnceDialer returns `grpcutil.ErrCannotReuseClientConn`). The root cause is that when the node disappears, many affected range descriptors are not evicted by `DistSender`: when the decommissioned node was not the lease holder, there will be no reason to evict. Similarly, if it was, we'll opportunistically send an RPC to the other possibly stale, but often not) replicas which frequently discovers the new lease holder. This doesn't seem like a problem; the replica is carried through and usually placed last (behind the lease holder and any other healthy replica) - the only issue was that the transport layer connects to all of the replicas eagerly, which provokes the spew of log messages. This commit changes the initialization process so that we don't actually GRPCDial replicas until they are needed. The previous call site to GRPCDial and the new one are very close to each other, so nothing is lost. In the common case, the rpc context already has a connection open and no actual work is done anyway. I diagnosed this using an in-progress decommissioning roachtest (and printf debugging); I hope to be able to prevent regression of this bug in that test when it's done. Touches cockroachdb#21882. Release note (bug fix): Avoid connection attempts to former members of the cluster and the associated spurious log messages.
No description provided.
The text was updated successfully, but these errors were encountered: