-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full recovery mode #14236
Full recovery mode #14236
Conversation
d023d19
to
413ff6a
Compare
new failures detected in https://buildkite.com/redpanda/redpanda/builds/39141#018b3fb7-001b-4617-b94f-da3a4fe4e894: "rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.S3.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True" |
The error doesn't look related to my changes, opened #14266 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, all nits/clarification.
@@ -879,6 +879,10 @@ fetch_handler::handle(request_context rctx, ss::smp_service_group ssg) { | |||
octx.response.data.error_code = octx.session_ctx.error(); | |||
return std::move(octx).send_response(); | |||
} | |||
if (octx.rctx.recovery_mode_enabled()) { | |||
octx.response.data.error_code = error_code::policy_violation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why policy_violation and what are the client implications? Retryable or it just gives up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a non-recoverable error that semantically looks close to what we need here. Looks like some clients will still retry even in the presence of non-recoverable errors (e.g. rpk produce fails, but rpk consume retries indefinitely), but the hope is that these retries are less eager than for recoverable errors.
413ff6a
to
303c13f
Compare
Disable: * HTTP proxy * schema registry listener * TS purger in recovery mode.
303c13f
to
354f06b
Compare
354f06b
to
9a409eb
Compare
changes in force-push: addressed review comments and added group describe test checks in recovery mode. |
@ztlpn this is coooool! should this come w/ a list of admin api endpoints that are avail ... perhaps admin api endpoints to trigger GC of segments or smth like that. |
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/39441#018b4dee-f6e7-4c66-9f2e-1a6cc2a0ffcc |
/ci-repeat 1 |
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/39570#018b5be3-0fc3-405f-9b4a-efaf08755d8f |
@emaxerrno Metadata operations are still available so most of the existing admin API should work without problems. Having additional endpoints for fixing problematic partitions makes sense but is a bit out of scope for this project (as a start, it would be great to at least have an ability to delete them and recovery mode allows this). |
Add "full recovery mode":
Backports Required
Release Notes
Features
recovery_mode_enabled
node config property.