Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: operations failing when encountering decommissioned nodes #66586

Closed
3 tasks done
erikgrinaker opened this issue Jun 17, 2021 · 3 comments
Closed
3 tasks done

kv: operations failing when encountering decommissioned nodes #66586

erikgrinaker opened this issue Jun 17, 2021 · 3 comments
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 17, 2021

On 21.1.1, operations tend to fail when encountering decommissioned nodes. Users have reported that after doing a full cycle of spinning up new nodes and decommissioning all old nodes, CHANGEFEED and BACKUP operations would fail:

CREATE CHANGEFEED FOR TABLE t INTO 'experimental-s3://bucket/changefeed/?AUTH=implicit' WITH compression = 'gzip', diff, format = 'json', initial_scan, resolved = '5m', schema_change_events = 'column_changes', schema_change_policy = 'backfill', updated;
ERROR: failed to connect to n4 at 172.31.91.238:26257: initial connection heartbeat failed: rpc error: code = PermissionDenied desc = n4 was permanently removed from the cluster at 2021-05-27 21:08:12.282567682 +0000 UTC; it is not allowed to rejoin the cluster

Restarting the affected nodes fixed the problem, so it seems related to some transitory state like range descriptor/leaseholder caches or gossip state.

This hasn't been reproducible, but other operations such as SET CLUSTER SETTING and SELECT * FROM t have occasionally been seen to fail this way.

There seems to be multiple reasons for this:

/cc @cockroachdb/kv

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-client Relating to the KV client and the KV interface. T-kv KV Team labels Jun 17, 2021
@erikgrinaker erikgrinaker self-assigned this Jun 17, 2021
@erikgrinaker erikgrinaker changed the title kv: fix operations failing when encountering decommissioned nodes kv: operations failing when encountering decommissioned nodes Jun 18, 2021
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jun 21, 2021
Avoid non-active nodes (i.e. those that are decomissioning or
decomissioned) when planning distributed sql flows.

Informs cockroachdb#66586
Informs cockroachdb#66636

Release Notes: None
craig bot pushed a commit that referenced this issue Jun 25, 2021
66632: *: check node decommissioned/draining state for DistSQL/consistency  r=tbg,knz a=erikgrinaker

The DistSQL planner and consistency queue did not take the nodes'
decommissioned or draining states into account, which in particular
could cause spurious errors when interacting with decommissioned nodes.

This patch adds convenience methods for checking node availability and
draining states, and avoids scheduling DistSQL flows on
unavailable nodes and consistency checks on unavailable/draining nodes.

Touches #66586, touches #45123.

Release note (bug fix): Avoid interacting with decommissioned nodes
during DistSQL planning and consistency checking.

/cc @cockroachdb/kv 

Co-authored-by: Erik Grinaker <[email protected]>
craig bot pushed a commit that referenced this issue Jun 29, 2021
66910: kvcoord: fix rangefeed retries on transport errors r=miretskiy,tbg,aliher1911,rickystewart a=erikgrinaker

### rangecache: consider FailedPrecondition a retryable error

In #66199, the gRPC error `FailedPrecondition` (returned when contacting
a decommissioned node) was incorrectly considered a permanent error
during range descriptor lookups, via `IsRangeLookupErrorRetryable()`.
These lookups are all trying to reach a meta-range (addressed by key),
so when encountering a decommissioned node they should evict the cache
token and retry the lookup.

This patch reverses that change, making only authentication errors
permanent (i.e. when the local node is decommissioned or otherwise not
part of the cluster).

Release note: None

### kvcoord: fix rangefeed retries on transport errors

`DistSender.RangeFeed()` was meant to retry transport errors after
refreshing the range descriptor (invalidating the cached entry).
However, due to an incorrect error type check (`*sendError` vs
`sendError`), these errors failed the range feed without invalidating
the cached range descriptor.

This was particularly severe in cases where a large number of nodes had
been decommissioned, where some stale range descriptors on some nodes
contained only decommissioned nodes. Since change feeds set up range
feeds across many nodes and ranges in the cluster, they are likely to
encounter these decommissioned nodes and return an error -- and since
the descriptor cache wasn't invalidated they would keep erroring until
the nodes were restarted such that the caches were flushed (often
requiring a full cluster restart).

Resolves #66636, touches #66586.

Release note (bug fix): Change feeds now properly invalidate cached
range descriptors and retry when encountering decommissioned nodes.

/cc @cockroachdb/kv 

Co-authored-by: Erik Grinaker <[email protected]>
@erikgrinaker
Copy link
Contributor Author

I believe all of the known bugs should be fixed now.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jun 29, 2021

To verify, I wrote up the following script which readily failed before these fixes with:

rpc error: code = PermissionDenied desc = n1 was permanently removed from the cluster at 2021-06-29 12:41:50.019428021 +0000 UTC; it is not allowed to rejoin the cluster

After these fixes, I did dozens of runs without seeing a single failure.

roachprod destroy local

set -euxo pipefail

roachprod create local -n 9
roachprod start local:1-3
roachprod sql local:1 -- -e 'CREATE TABLE t (id int primary key, value int);'
roachprod sql local:1 -- -e 'INSERT INTO t VALUES (1, 1), (2, 2), (3, 3);'
sleep 3

roachprod start local:4-9
for N in 4 5 6 7 8 9; do
  roachprod sql local:$N -- -e 'SELECT * FROM t;'
done

cockroach node decommission --insecure 1 2 3

roachprod sql local:5 -- -e 'SET CLUSTER SETTING kv.rangefeed.enabled = true;'
roachprod sql local:7 -- -e "CREATE CHANGEFEED FOR TABLE t INTO 'experimental-nodelocal://6/tmp/changefeed';"
roachprod sql local:8 -- -e "BACKUP TABLE t INTO 'nodelocal://9/tmp/backup/t';"
for N in 4 5 6 7 8 9; do
  roachprod sql local:$N -- -e 'SELECT * FROM t;'
done

@erikgrinaker
Copy link
Contributor Author

Relevant fixes have been backported to 20.2 and 21.1, will be included in the next patch releases (20.2.13 and 21.1.5).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

1 participant