Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: improve handling of permanent errors #62608

Merged
merged 2 commits into from
Mar 31, 2021

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Mar 25, 2021

*: improve handling of permanent errors

RPC errors are normally retried. However, in some cases the errors are
permanent such that retries are futile, and this can cause operations to
appear to hang as they keep retrying -- e.g. when running operations on
a decommissioned node. There is already some detection of permanent
errors, but it is incomplete.

This patch attempts to improve coverage of permanent errors, in
particular in the context of decommissioned nodes, and adds test cases
for these scenarios.

Release note: None

roachpb: propagate gRPC Status errors across Error

roachpb.Error uses errors.EncodeError() and errors.DecodeError()
to preserve the original structured error. Unfortunately, these did not
handle gRPC Status errors, resulting in a generic unstructured error
after decoding. Since this is used while propagating errors through the
KV layer, it could prevent detection of the original error type.

Support for gRPC Status encoding was added in cockroachdb/errors 1.8.3,
this patch registers the error en/decoder such that these errors are
preserved across roachpb.Error. A later patch will extend existing
error handling code to make better use of these.

Release note: None

Resolves #62233, resolves #61470. Touches #56208.

Decommissioned nodes will keep running a bunch of async processes that
try (and fail) to communicate with the cluster. These have not been
addressed here, see #62693.

@erikgrinaker erikgrinaker requested a review from tbg March 25, 2021 16:49
@erikgrinaker erikgrinaker requested a review from a team as a code owner March 25, 2021 16:49
@erikgrinaker erikgrinaker self-assigned this Mar 25, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! One comment about unit testing the propagation but basically LGTM once that's done.

NB: when you say "might resolve XXX" it will get closed on merge.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/roachpb/errors.go, line 25 at r1 (raw file):

	"github.com/cockroachdb/errors"
	"github.com/cockroachdb/errors/errorspb"
	_ "github.com/cockroachdb/errors/extgrpc" // Register EncodeError support for gRPC Status.

nit: // register EncodeError support for gRPC Status

Can you add a unit test that proves that whatever this does "works"?


pkg/server/connectivity_test.go, line 400 at r1 (raw file):

	//
	// TODO(erikgrinaker): until cockroachdb/errors preserves grpcstatus.Error
	// across errors.EncodeError() we'll have to disable this, since the return

Mind stressing this for min(~10k iterations, 30m) or thereabouts?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

@erikgrinaker
Copy link
Contributor Author

Reviewable is down for me, replying here.

Mind stressing this for min(~10k iterations, 30m) or thereabouts?

I let it and the acceptance test run for a couple of hours more, and they both eventually failed/hung. There's still some (rare) code paths that obscure the original error. I'll mark this as draft for now and do a full pass of error handling for #61470.

@erikgrinaker erikgrinaker marked this pull request as draft March 25, 2021 18:59
@erikgrinaker erikgrinaker force-pushed the grpc-errors branch 8 times, most recently from 9e5ca65 to 61f1a49 Compare March 28, 2021 21:13
@erikgrinaker erikgrinaker changed the title roachpb: propagate gRPC Status errors across Error *: improve handling of permanent errors Mar 28, 2021
@erikgrinaker erikgrinaker force-pushed the grpc-errors branch 2 times, most recently from 0c3ac40 to c970f54 Compare March 30, 2021 11:52
@erikgrinaker erikgrinaker marked this pull request as ready for review March 30, 2021 12:07
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this resolves the acceptance test. The hang happened while writing the event record, which races with the node liveness entry propagation. In cases where it loses the race, it would hang due to internal RPC retries; now, the txn will return an error that will get logged but not propagated to the decommission caller. Recording this entry is best-effort due to the lack of txn support for liveness entries. This is also the case for the status entry removal.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)


pkg/roachpb/errors.go, line 25 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: // register EncodeError support for gRPC Status

Can you add a unit test that proves that whatever this does "works"?

Yep, done.


pkg/server/connectivity_test.go, line 400 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Mind stressing this for min(~10k iterations, 30m) or thereabouts?

I've added a bunch of other tests as well, and stressed each of them for 30m, as well as the acceptance test for 1h. The only failures I've seen are I/O timeouts and handshake timeouts caused by excessive load, stressing at lower parallelism resolves it (8 is generally fine, 16 isn't).

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r2, 11 of 11 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/roachpb/errors_test.go, line 214 at r2 (raw file):

	decoded, ok := status.FromError(goErr)
	require.True(t, ok, "expected gRPC status error, got %T: %v", goErr, goErr)

nit: there's require.IsType or something like that
Not worth another CI cycle, just keep it in mind for next time.


pkg/server/admin_test.go, line 2098 at r3 (raw file):

		ReplicationMode: base.ReplicationManual, // saves time
		ServerArgs: base.TestServerArgs{
			Insecure: true, // allows admin client without seting up certs

nit: setting

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)

`roachpb.Error` uses `errors.EncodeError()` and `errors.DecodeError()`
to preserve the original structured error. Unfortunately, these did not
handle gRPC `Status` errors, resulting in a generic unstructured error
after decoding. Since this is used while propagating errors through the
KV layer, it could prevent detection of the original error type.

Support for gRPC Status encoding was added in cockroachdb/errors 1.8.3,
this patch registers the error en/decoder such that these errors are
preserved across `roachpb.Error`. A later patch will extend existing
error handling code to make better use of these.

Release note: None
RPC errors are normally retried. However, in some cases the errors are
permanent such that retries are futile, and this can cause operations to
appear to hang as they keep retrying -- e.g. when running operations on
a decommissioned node. There is already some detection of permanent
errors, but it is incomplete.

This patch attempts to improve coverage of permanent errors, in
particular in the context of decommissioned nodes, and adds test cases
for these scenarios.

Release note: None
@erikgrinaker
Copy link
Contributor Author

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Mar 31, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 31, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 31, 2021

Build succeeded:

@craig craig bot merged commit 17dd168 into cockroachdb:master Mar 31, 2021
@erikgrinaker erikgrinaker deleted the grpc-errors branch July 26, 2021 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: acceptance/decommission-self failed kv: avoid internal retries for permanent errors
3 participants