-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: improve handling of permanent errors #62608
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! One comment about unit testing the propagation but basically LGTM once that's done.
NB: when you say "might resolve XXX" it will get closed on merge.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/roachpb/errors.go, line 25 at r1 (raw file):
"github.com/cockroachdb/errors" "github.com/cockroachdb/errors/errorspb" _ "github.com/cockroachdb/errors/extgrpc" // Register EncodeError support for gRPC Status.
nit: // register EncodeError support for gRPC Status
Can you add a unit test that proves that whatever this does "works"?
pkg/server/connectivity_test.go, line 400 at r1 (raw file):
// // TODO(erikgrinaker): until cockroachdb/errors preserves grpcstatus.Error // across errors.EncodeError() we'll have to disable this, since the return
Mind stressing this for min(~10k iterations, 30m)
or thereabouts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
Reviewable is down for me, replying here.
I let it and the acceptance test run for a couple of hours more, and they both eventually failed/hung. There's still some (rare) code paths that obscure the original error. I'll mark this as draft for now and do a full pass of error handling for #61470. |
9e5ca65
to
61f1a49
Compare
Error
0c3ac40
to
c970f54
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure this resolves the acceptance test. The hang happened while writing the event record, which races with the node liveness entry propagation. In cases where it loses the race, it would hang due to internal RPC retries; now, the txn will return an error that will get logged but not propagated to the decommission caller. Recording this entry is best-effort due to the lack of txn support for liveness entries. This is also the case for the status entry removal.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
pkg/roachpb/errors.go, line 25 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit:
// register EncodeError support for gRPC Status
Can you add a unit test that proves that whatever this does "works"?
Yep, done.
pkg/server/connectivity_test.go, line 400 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Mind stressing this for
min(~10k iterations, 30m)
or thereabouts?
I've added a bunch of other tests as well, and stressed each of them for 30m, as well as the acceptance test for 1h. The only failures I've seen are I/O timeouts and handshake timeouts caused by excessive load, stressing at lower parallelism resolves it (8 is generally fine, 16 isn't).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r2, 11 of 11 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/roachpb/errors_test.go, line 214 at r2 (raw file):
decoded, ok := status.FromError(goErr) require.True(t, ok, "expected gRPC status error, got %T: %v", goErr, goErr)
nit: there's require.IsType
or something like that
Not worth another CI cycle, just keep it in mind for next time.
pkg/server/admin_test.go, line 2098 at r3 (raw file):
ReplicationMode: base.ReplicationManual, // saves time ServerArgs: base.TestServerArgs{ Insecure: true, // allows admin client without seting up certs
nit: setting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)
c970f54
to
a7b6290
Compare
`roachpb.Error` uses `errors.EncodeError()` and `errors.DecodeError()` to preserve the original structured error. Unfortunately, these did not handle gRPC `Status` errors, resulting in a generic unstructured error after decoding. Since this is used while propagating errors through the KV layer, it could prevent detection of the original error type. Support for gRPC Status encoding was added in cockroachdb/errors 1.8.3, this patch registers the error en/decoder such that these errors are preserved across `roachpb.Error`. A later patch will extend existing error handling code to make better use of these. Release note: None
RPC errors are normally retried. However, in some cases the errors are permanent such that retries are futile, and this can cause operations to appear to hang as they keep retrying -- e.g. when running operations on a decommissioned node. There is already some detection of permanent errors, but it is incomplete. This patch attempts to improve coverage of permanent errors, in particular in the context of decommissioned nodes, and adds test cases for these scenarios. Release note: None
a7b6290
to
dd325f0
Compare
bors r=tbg |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
*: improve handling of permanent errors
RPC errors are normally retried. However, in some cases the errors are
permanent such that retries are futile, and this can cause operations to
appear to hang as they keep retrying -- e.g. when running operations on
a decommissioned node. There is already some detection of permanent
errors, but it is incomplete.
This patch attempts to improve coverage of permanent errors, in
particular in the context of decommissioned nodes, and adds test cases
for these scenarios.
Release note: None
roachpb: propagate gRPC Status errors across Error
roachpb.Error
useserrors.EncodeError()
anderrors.DecodeError()
to preserve the original structured error. Unfortunately, these did not
handle gRPC
Status
errors, resulting in a generic unstructured errorafter decoding. Since this is used while propagating errors through the
KV layer, it could prevent detection of the original error type.
Support for gRPC Status encoding was added in cockroachdb/errors 1.8.3,
this patch registers the error en/decoder such that these errors are
preserved across
roachpb.Error
. A later patch will extend existingerror handling code to make better use of these.
Release note: None
Resolves #62233, resolves #61470. Touches #56208.
Decommissioned nodes will keep running a bunch of async processes that
try (and fail) to communicate with the cluster. These have not been
addressed here, see #62693.