-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: operations failing when encountering decommissioned nodes #66586
Labels
A-kv-client
Relating to the KV client and the KV interface.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
Comments
erikgrinaker
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-kv-client
Relating to the KV client and the KV interface.
T-kv
KV Team
labels
Jun 17, 2021
erikgrinaker
changed the title
kv: fix operations failing when encountering decommissioned nodes
kv: operations failing when encountering decommissioned nodes
Jun 18, 2021
miretskiy
pushed a commit
to miretskiy/cockroach
that referenced
this issue
Jun 21, 2021
Avoid non-active nodes (i.e. those that are decomissioning or decomissioned) when planning distributed sql flows. Informs cockroachdb#66586 Informs cockroachdb#66636 Release Notes: None
craig bot
pushed a commit
that referenced
this issue
Jun 25, 2021
66632: *: check node decommissioned/draining state for DistSQL/consistency r=tbg,knz a=erikgrinaker The DistSQL planner and consistency queue did not take the nodes' decommissioned or draining states into account, which in particular could cause spurious errors when interacting with decommissioned nodes. This patch adds convenience methods for checking node availability and draining states, and avoids scheduling DistSQL flows on unavailable nodes and consistency checks on unavailable/draining nodes. Touches #66586, touches #45123. Release note (bug fix): Avoid interacting with decommissioned nodes during DistSQL planning and consistency checking. /cc @cockroachdb/kv Co-authored-by: Erik Grinaker <[email protected]>
craig bot
pushed a commit
that referenced
this issue
Jun 29, 2021
66910: kvcoord: fix rangefeed retries on transport errors r=miretskiy,tbg,aliher1911,rickystewart a=erikgrinaker ### rangecache: consider FailedPrecondition a retryable error In #66199, the gRPC error `FailedPrecondition` (returned when contacting a decommissioned node) was incorrectly considered a permanent error during range descriptor lookups, via `IsRangeLookupErrorRetryable()`. These lookups are all trying to reach a meta-range (addressed by key), so when encountering a decommissioned node they should evict the cache token and retry the lookup. This patch reverses that change, making only authentication errors permanent (i.e. when the local node is decommissioned or otherwise not part of the cluster). Release note: None ### kvcoord: fix rangefeed retries on transport errors `DistSender.RangeFeed()` was meant to retry transport errors after refreshing the range descriptor (invalidating the cached entry). However, due to an incorrect error type check (`*sendError` vs `sendError`), these errors failed the range feed without invalidating the cached range descriptor. This was particularly severe in cases where a large number of nodes had been decommissioned, where some stale range descriptors on some nodes contained only decommissioned nodes. Since change feeds set up range feeds across many nodes and ranges in the cluster, they are likely to encounter these decommissioned nodes and return an error -- and since the descriptor cache wasn't invalidated they would keep erroring until the nodes were restarted such that the caches were flushed (often requiring a full cluster restart). Resolves #66636, touches #66586. Release note (bug fix): Change feeds now properly invalidate cached range descriptors and retry when encountering decommissioned nodes. /cc @cockroachdb/kv Co-authored-by: Erik Grinaker <[email protected]>
I believe all of the known bugs should be fixed now. |
To verify, I wrote up the following script which readily failed before these fixes with:
After these fixes, I did dozens of runs without seeing a single failure. roachprod destroy local
set -euxo pipefail
roachprod create local -n 9
roachprod start local:1-3
roachprod sql local:1 -- -e 'CREATE TABLE t (id int primary key, value int);'
roachprod sql local:1 -- -e 'INSERT INTO t VALUES (1, 1), (2, 2), (3, 3);'
sleep 3
roachprod start local:4-9
for N in 4 5 6 7 8 9; do
roachprod sql local:$N -- -e 'SELECT * FROM t;'
done
cockroach node decommission --insecure 1 2 3
roachprod sql local:5 -- -e 'SET CLUSTER SETTING kv.rangefeed.enabled = true;'
roachprod sql local:7 -- -e "CREATE CHANGEFEED FOR TABLE t INTO 'experimental-nodelocal://6/tmp/changefeed';"
roachprod sql local:8 -- -e "BACKUP TABLE t INTO 'nodelocal://9/tmp/backup/t';"
for N in 4 5 6 7 8 9; do
roachprod sql local:$N -- -e 'SELECT * FROM t;'
done |
This was referenced Jun 29, 2021
Relevant fixes have been backported to 20.2 and 21.1, will be included in the next patch releases (20.2.13 and 21.1.5). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-client
Relating to the KV client and the KV interface.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
On 21.1.1, operations tend to fail when encountering decommissioned nodes. Users have reported that after doing a full cycle of spinning up new nodes and decommissioning all old nodes,
CHANGEFEED
andBACKUP
operations would fail:Restarting the affected nodes fixed the problem, so it seems related to some transitory state like range descriptor/leaseholder caches or gossip state.
This hasn't been reproducible, but other operations such as
SET CLUSTER SETTING
andSELECT * FROM t
have occasionally been seen to fail this way.There seems to be multiple reasons for this:
Any RPC ping involving a decommissioned node (both to and from) returned
PermissionDenied
, which is treated as a permanent error. However, this was too symmetrical: when e.g. the DistSender tries to contact a range and encounters a decommissioned node, it shouldn't give up, but rather refresh its caches and retry against a different replica. Any KV operation could fail due to this. Fixed by server: return FailedPrecondition when talking to decom node #66199.Operations that explicitly choose nodes to operate on (in particular, DistSQL jobs) often only consider node liveness, but not its decommissioned state. Fixed by *: check node decommissioned/draining state for DistSQL/consistency #66632.
changefeedccl: Rangefeeds might fail due to stale range cache #66636 Range feeds appear to not not flush range descriptor caches or lease caches when encountering authentication errors, this was found to be due to a bug in range feed error handling. Fixed by kvcoord: fix rangefeed retries on transport errors #66910.
/cc @cockroachdb/kv
The text was updated successfully, but these errors were encountered: