Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: avoid internal retries for permanent errors #61470

Closed
erikgrinaker opened this issue Mar 4, 2021 · 0 comments · Fixed by #62608
Closed

kv: avoid internal retries for permanent errors #61470

erikgrinaker opened this issue Mar 4, 2021 · 0 comments · Fixed by #62608
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 4, 2021

The KV layer often retries RPC errors internally, but tries to error out early on permanent errors, e.g.:

func IsRangeLookupErrorRetryable(err error) bool {
// For now, all errors are retryable except authentication/authorization.
return !grpcutil.IsAuthError(err)
}

However, coverage of this is often spotty, which can cause operations to hang indefinitely. This can often be seen when decommissioning a node and then trying to run operations via that node -- these will often error internally with codes.PermissionDenied, but since the error isn't always considered permanent and returned early, it is retried internally and the operation hangs.

This problem is exacerbated by the RPC layer often wrapping errors in roachpb.Error internally:

if err != nil {
return response{pErr: roachpb.NewError(err)}
}

These errors do not propagate gRPC error codes because of cockroachdb/errors#63. As mentioned in #56208, we also often rely on checking the error message because of the lack of structured errors, which is brittle and should also be avoided.

@erikgrinaker erikgrinaker added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 4, 2021
@erikgrinaker erikgrinaker self-assigned this Mar 4, 2021
erikgrinaker added a commit to erikgrinaker/cockroach that referenced this issue Mar 4, 2021
In cockroachdb#61356, `TestDecommissionedNodeCannotConnect` was extended to check
error propagation for a `Scan` operation after a node was decommissioned.
However, because of cockroachdb#61470 not all failure modes propagate errors
properly, causing the test to sometimes hang. This patch disables this
check until these problems are addresses.

Release justification: non-production code changes
Release note: None
craig bot pushed a commit that referenced this issue Mar 4, 2021
59288: rfc: declarative, state-based schema changer r=lucy-zhang a=lucy-zhang

https://github.com/lucy-zhang/cockroach/blob/schema-changer-rfc/docs/RFCS/20210115_new_schema_changer.md

Release note: none

61459: tracing: cap registry size at 5k r=irfansharif a=tbg

This caps the size of the span registry at 5k, evicting "random" (map
order) existing entries when at the limit.
The purpose of this is to serve as a guardrail against leaked spans,
which would otherwise lead to unbounded memory growth.

Touches #59188.

Release justification: low risk, high benefit changes to existing
functionality
Release note: None


61471: server: disable flaky check in TestDecommissionedNodeCannotConnect r=tbg a=erikgrinaker

In #61356, `TestDecommissionedNodeCannotConnect` was extended to check
error propagation for a `Scan` operation after a node was decommissioned.
However, because of #61470 not all failure modes propagate errors
properly, causing the test to sometimes hang. This patch disables this
check until these problems are addressed.

Resolves #61452.

Release justification: non-production code changes
Release note: None

Co-authored-by: Lucy Zhang <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
@craig craig bot closed this as completed in 17dd168 Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant